ABSTRACT
Title of dissertation: EXTENDED ESTIMATING EQUATIONS
AND EMPIRICAL LIKELIHOOD
Xuan Yao, Doctor of Philosophy, 2018
Dissertation directed by: Dr. Eric Slud
Department of Mathematics
Classic Estimating Equations (CEE) were first introduced by Godambe in [16] and
have been widely used under both parametric and nonparametric settings. However,
under some prominent semiparametric models, CEE cannot be used to identify
certain low-dimensional parameters. We prove that under regularity conditions, for
the Cox (1972) survival-time model, a CEE for the structural parameter does not
exist; and under more restrictive conditions, a CEE for the structural parameter in
the Accelerated Failure Time (AFT) model does not exist when lifetime is subject
to random right censoring with unknown distribution. Motivated by this lack of
coverage of CEE’s for finite-dimensional parameters in semiparametric problems, we
establish a method named Extended Estimating Equation (EEE). The EEE’s relax
the requirement in the CEE of which the estimating function must be a function
of the independently identically distributed (i.i.d.) summands and instead allow the
estimating function to incorporate ratio of the sums of functions depending on two
of the i.i.d. arguments.
To our knowledge, the broadest class of semiparametric models that can be inves-
tigated using EEE is the ϕ-transformation model class that we construct, where ϕ is
a given function of covariate, structural parameter and random error with unknown
hazard rate. With different choice of ϕ, the model can represent the general trans-
formation model, nonlinear location-shift model, models incorporating cumulative
integrated functions of times at risk and others. Inspired by Tsiatis’s work in [38], by
defining martingale structure on the residual scale, we are able to prove the asymp-
totic linearity of the associated EEE, which leads to the asymptotic normality of
the structural estimator.
Another perspective from which to view EEE is to use it as a constraint in the
Empirical Likelihood (EL) method. We first show that under the CEE setting,
regardless of the continuity of the criterion function, there exists a neighbourhood
of the true structural parameter on which there always exists a probability vector
that maximizes the EL. The same conclusion can be generalized to the EEE setting
with continuous criterion function as well as the discontinuous criterion function
with the martingale structure of the ϕ-transformation model or the Cox model.
A point estimator for the structural parameter can be defined via maximizing the
Profile Empirical Likelihood (pEL) associated with the EEE. We show that the pEL
estimator is asymptotically normal, with asymptotic variance covariance matrix
identical to that of the Z-estimator obtained by directly solving for the root of EEE.
Finally, we develop algorithms to compute and compare the Z-estimator and pEL
estimator associated with the EEE, and decide the minimal sample size for the
two estimators to achieve asymptotic normality under three different parametric
settings. Simulation shows a more symmetric covariate usually leads to a smaller
threshold sample size, and the Z-estimator and pEL estimator are close in value and
variance -covariance matrices. We also conclude that the pEL function tends to be
much smoother, in settings where the EEE criterion function is non-smooth, than
EEE itself, by comparing the plots of the projection of each function.
EXTENDED ESTIMATING EQUATION AND EMPIRICAL
LIKELIHOOD
by
Xuan Yao
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2018
Advisory Committee:
Dr. Eric Slud, Chair/Advisor
Dr. Shuo Chen
Dr. Benjamin Kedem
Dr. Yan Li, Dean’s Representative
Dr. Tingni Sun
©c Copyright by
Xuan Yao
2018
Acknowledgments
I would like to express my gratitude to everyone who has been there during my
graduate life. Thank you for being part of this unforgettable and fruitful experience.
I would like to thank my advisor, Dr. Eric Slud. Thank you for giving me all the
challenging yet intriguing questions. He is always patient with my questions, and
has pulled me out of seemingly unsolvable situations with his advice. He has taught
me how to view mathematical and statistical questions not as a student, but as a
researcher. It has been an honor to learn from him.
I would like to thank Dr. Benjamin Kedem, Dr. Tingni Sun, Dr. Shuo Chen and
Dr. Yan Li for serving on my dissertation committee. Thank you for reviewing my
work thoroughly and for all the valuable comments and suggestions. Thanks to all
the other professors in the Mathematics department who have taught me.
I would like to thank my parents and my husband for been supportive consistently
during the past seven years. Thanks to Brook Qiu, a member who joined my family
in 2014, who has always been the brightest sunshine in my life since then, and who
can smooth any difficulties away with her laughter.
I would like to thank my peer and my friend Xia Li. Thank you for running
towards the finish line with me together, and pushing me with your diligence and
intelligence during the hard times. Thanks to all my other friends.
ii
Table of Contents
Acknowledgements ii
List of Abbreviations v
1 Introduction 1
1.1 Classic Estimating Equations . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Extended Estimating Equations . . . . . . . . . . . . . . . . . . . . . 5
2 Non-Existence of Classic Estimating Equations 13
2.1 Non-Existence of Classic Estimating Equations of Cox Model . . . . . 15
2.2 Non-Existence of Classic Estimating Equations of Censored AFT . . 17
2.3 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 Proof of Theorem 2.1.1 . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . 30
3 ϕ-Transformation Model 34
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Extended Estimating Equations and Martingales . . . . . . . . . . . . 37
3.2.1 Remarks and Sufficient Conditions for Assumptions . . . . . . 40
3.3 Estimator for Structural Parameter . . . . . . . . . . . . . . . . . . . 43
3.4 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.4.1 Proof of Proposition 3.2.1 . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Proof of Lemma 3.4.2-3.4.5 . . . . . . . . . . . . . . . . . . . . 50
3.4.3 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . 58
3.5 List of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4 Technical Results I: Owen’s Lemmas & Empirical Process 65
4.1 Assumptions and Notations . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Lemmas Parallel to Owen, and Qin and Lawless . . . . . . . . . . . . 68
4.3 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
iii
5 Technical Results II: Zero in the Convex Hull Theorems 82
5.1 Classic Estimating Equation . . . . . . . . . . . . . . . . . . . . . . . 84
5.1.1 Continuous Criterion Function . . . . . . . . . . . . . . . . . . 84
5.1.2 Discontinuous Criterion Function . . . . . . . . . . . . . . . . 90
5.2 Extended Estimating Equation . . . . . . . . . . . . . . . . . . . . . 93
5.2.1 Continuous Criterion Function . . . . . . . . . . . . . . . . . . 93
5.2.2 Discontinuous Criterion Function . . . . . . . . . . . . . . . . 99
5.3 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.1 Proofs Under the CEE Setting . . . . . . . . . . . . . . . . . . 104
5.3.2 Proofs Under the EEE Setting . . . . . . . . . . . . . . . . . . 115
6 Empirical Likelihood Applied to Extended Estimating Equations 124
6.1 Empirical likelihood of Extended Estimating Equation . . . . . . . . 126
6.2 Asymptotic Normality Associated with the EEE . . . . . . . . . . . . 127
6.3 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7 Computational Results 142
7.1 Description of the Algorithm Associated with the EEE . . . . . . . . 143
7.1.1 Algorithm for Computing the pEL Estimator . . . . . . . . . 144
7.1.2 Algorithm for Computing the Z-Estimator . . . . . . . . . . . 148
7.2 Local Continuity of EEE and pEL . . . . . . . . . . . . . . . . . . . . 149
7.2.1 Plots of Projection of EE and EL . . . . . . . . . . . . . . . . 149
7.2.2 Quantitative Measurement of Local Continuity of l(β) . . . . . 156
7.3 Convergence of the Estimator . . . . . . . . . . . . . . . . . . . . . . 159
7.4 Asymptotic Normality of the Z-Estimator and pEL Estimator . . . . 159
7.4.1 Covariates without Normality . . . . . . . . . . . . . . . . . . 160
7.4.2 Normally Distributed Covariates . . . . . . . . . . . . . . . . . 163
7.4.3 Extreme Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8 Contribution and Future Work 167
8.1 Original Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
A Empirical Process Theory 173
Bibliography 177
iv
List of Abbreviations and Notations
AFT Accelerated Lifetime Model page 4
CEE Classic Estimating Equation page 1
CLT Central Limit Theorem page 42
EEE Extended Estimating Equation page 5
EL Empirical Likelihood page 1
GTM General Transformation Model page 4
i.i.d. Independently Identiccally Distributed page 1
LIL Law of Iterated Logarithm page 70
LLN Law of Large Numbers page 40
MLE Maximum Likelihood Estimator page 1
Ti Lifetime page 6
Ci Right-Censoring Variable page 6
∆i = I {Ti ≤ Ci} Non-Censored Indicator page 6
Vi = min(Ti, Ci) Event Time page 6
Zi Covariate page 6
Z Support of Z1 page 34
εi Random Error with Unknown Hazard Rate page 34
Xi i.i.d. Observations page 1
β p-Dimensional Structural Parameter page 1
ν Infinite Dimensional Nuisance Parameter page 1
θ = (β, ν) Parameter in Semiparametric Models page 1
H Infinite Dimensional Nuisance Parameter Space page 1
m(x, β) Classic Estimating Function page 2
mn(x,x, β) Extended Estimating Function page 6
Sn(β) Summation of m(Xi, β) or mn(Xi,X , β) page 2
Conv(β) Convex hull of mn(Xi,X , β) page 83
int (Conv(β)) Interior of Conv(β) page 86
v
Chapter 1: Introduction
The method of Classic Estimating Equations (CEE), first raised by Godambe
in [16], is a powerful tool for constructing estimators for the structural parameter in
a semiparametric model and has been extensively discussed in statistical literature.
Given a mean zero estimating function depending on data and structural parameter
alone, the CEE method defines an estimator as the root of the empirical integral of
the criterion function. Such an estimator is known as a Z-Estimator.
To give a formal definition of CEE, let us consider independently identically dis-
tributed (i.i.d.) observations
i.i.d.
X1. . . . , Xn ∼ Pθ, θ = (β, ν) ∈ Rp ×H, (1.1)
where {Xi}ni=1 are d-dimensional random vectors with support X , β ∈ Rp is a
finite-dimensional structural parameter, ν is a nuisance parameter in an infinite-
dimensional space H such as a function space. Let β0 and ν0 respectively denote
the true structural and nuisance parameter value.
1
1.1 Classic Estimating Equations
In our usage, CEE is a summation of mean zero functions of single independent
data elements and structural parameter only, called estimating equations.
Definition 1.1.1 (Classic Estimating Equation) Let X1. . . . , Xn be i.i.d. ob-
servations as in (1.1), and m(x, β) : X × Rd → Rd with
Eθ {m(X1, β)} = 0, for all θ = (β, ν), ν ∈ H (1.2)
where β is a generally a proper subvector of the whole unknown parameter θ. If there
exists a set Uβ0, a neighbourhood of β0 such that β̂n is a unique solution to
∑n
Sn(β) = m(Xi, β) = 0, for β ∈ Uβ0 (1.3)
i=1
then (1.3) is called an estimating equation for β.
CEE provide consistent and asymptotically normal estimators under regularity
conditions including
Eθ0 {∇βm(X1, β)} is non singular for β ∈ Uβ0 , (1.4)
where Uβ0 ⊂ Rp is a neighbourhood of β0. Such conditions can be found in various
statistical literature, such as Theorem 5.7 and Theorem 5.23 in [41].
A well known example of estimator that is constructed by means of estimating
equation is the Maximum Likelihood Estimator (MLE) when ν is not present. Con-
sider a simple parametric case in which the parameter is θ = β. Then the MLE
2
∏n ∑n
maximizes f(x; θ), or equivalently, ln f(Xi; θ), where f(x; θ) is the density
i=1 i=1
function of X1 with respect to the Lebesgue measure. If f(x, θ) is differentiable
with respect to θ for each fixed value x, then the MLE is a solution to
∑n
l(θ;Xi) = 0, where l(θ;x) = ∂ ln f(x; θ)/∂θ (1.5)
i=1
Note that under regularity conditions such as those summarized in Section 3.2 of [42],
then l(θ;x) has mean zero. Under further regularity conditions guaranteeing the
other parts of Definition 1.1.1, (1.5) is a classic estimating equation with the choice
of m(x, θ) = l(θ;x).
The CEE is also related to the Empirical Likelihood (EL) method as shown by
Owen in [31]. Using the criterion function associated with CEE as a constraint,
under a non-parametric setting, Owen established the EL ratio confidence intervals
for a single functional in [32]. Both the Z-estimator and Owen’s theory in [32]
require the dimension of the CEE to be equal to that of the structural parameter.
By allowing the former to exceed the latter, Qin and Lawless generalized Owen’s
conclusion in [33]. Qin and Lawless also constructed a point estimator for the
structural parameter by maximizing the EL, and showed that such an estimator
is asymptotically normal with a sandwich-formed asymptotic variance covariance
matrix.
Despite the positive features of CEE, it has limitations in some prominent semi-
parametric models in survival analysis, such as the Cox model (restircted to the
non-time-dependent version in this thesis) and Accelerated Lifetime Model (AFT).
3
Cox model, or proportional hazard rate model was proposed by Cox in [11]. It
assumes that the conditional hazard rate function given covariates is proportional
(as a function of time) to the nuisance or “baseline” hazard function, by a factor
depending on a linear combination of the covariates. Cox proposed to estimate
the structural regression-coefficient parameter via maximizing the partial likelihood
in [12], and Anderson and Gill developed the essential martingale-based large-sample
distributional properties in [2]. Efron discussed the effciency of the partial liekli-
hood estimator of the Cox model in [14]. In Section 5.2 of [39], Tsiatis showed the
structural estimator constructed via maximizing the partial likelihood is globally
semiparametric efficient. However, the equation system associated with the partial
likelihood does not fit the definition of CEE due to the appearance of the quotient
of higher order summations.
When the Cox model does not fit a possibly right-censored sample of survival data,
an important alternative model is AFT. AFT can be considered as a special case
of a General Transformation Model (GTM). The latter assumes that the lifetime
transformed via a known monotone function depends linearly on the covariate with
unknown regression coefficient plus an independent random error with unknown
hazard rate function. When the monotone transforming function is chosen to be
natural logarithm, then the GTM becomes AFT. An equivalent construction of
AFT is to assume that the lifetime is conditionally given the covariates proportional
by a factor equal to an exponential function of a linear combination of the covariates
to some unknown baseline lifetime that is independent of the covariates. AFT has
4
been extensively investigated by Miller [30], Buckley and James [8], Koul et al. [25],
Louis [28], Wei and Gail [44], James and Smith [22], Ritov and Wellner in [35], Lai
and Ying in [26], Wei et al in [45], and Ritov in [34]. Tsiatis proposed a class of
linear rank statistic estimators by constructing a martingale on the residual scale
in [38]. He also showed that the “estimating equation” through which the structural
parameter is defined is asymptotically linear, and he established the asymptotic
normality of the structural estimator using the asymptotic linearity and martingale
central limit theorem. In [34], Ritov showed that the linear rank statistic estimator
is efficient. However, like the equation related to partial likelihood for Cox model,
the “estimating equation” in [38] for AFT also involves a quotient of higher order
summations and again does not fit the definition of CEE.
As shown by Cox and Tsiatis in [13] and [38], the structural estimators constructed
under Cox and AFT model assumptions are usually related solving equations that
are summations of non i.i.d. distributed summands, and the non-i.i.d violates the
usual assumption on CEE. In fact, we prove that under regularity conditions, a
CEE does not exist for Cox model. For the AFT, under more restrictive regularity
conditions and with right censored data, the CEE does not exist, either.
1.2 Extended Estimating Equations
In order to extend the regime of CEE to cover right censored semiparametric
models, we define the class of Extended Estimating Equations (EEE) by allowing
the estimating function to depend not only on single observations but on quotients
5
of averages with respect to one index of functions of the structural parameter and two
observations from the sample. Recall that for Cox model and Accelerated Lifetime
Model (AFT), the estim{ators fo∑r β are usually constructed by solving∑ }n n Z I {ln(V )− βtrZ ≥ ln(V )− βtrZ }
AFT Model: ∆i{ Zi − ∑
j=1 j j j i i
∑ n (1.6)∑ j=1 I {ln(Vj)− β
trZ ≥} ln(V tri=1 j i)− β Zi}
n ∑n βtrZj− j=1 ZjI {Vj ≥ Vi} eCox Model: ∆i Zi n tr , (1.7)β Zj
i=1 j=1
I {Vj ≥ Vi} e
where notations of in (1.6) - (1.7) are defined on page v. Evidently, (1.6) and (1.7)
do not satisfy the definition of classic estimating equations, because of the quotient
of two i.i.d. summations within the curly brackets. Inspired by the formulation of
the summands in (1.6) and (1.7), le{t ∑ }n
− j∑=1C(Xj)k(Xi, Xj, β)mn(Xi,X , β) = Q(Xi, β) C(Xi) n . (1.8)
j=1 k(Xi, Xj, β)
For both the AFT and Cox models, Xi = (Ti, Ci, Zi,∆i), and we can choose
Q(Xi, β) = ∆i = I {Ti ≤ Ci} , C(Xi) = Zi
and { }
for AFT Model: k(Xi, Xj, β) = I ln(Vj)− βtrZj ≥ ln(V )− βtri Zi
tr
for Cox Model: k(X β Zji, Xj, β) = I {Vj ≥ Vi} e
Then an extended estimating equation is defined in the following way,
Definition 1.2.1 (Extended Estimating Equations) Let X1, . . . , Xn be random
vectors as in (1.1). Let Q(x, β) : Rd×Rp 7→ R, k(x, y, β) : Rd×Rd×Rp 7→ R+ and
C(x) : Rd 7→ Rp be measurable functions. Assume that
Eβ,ν {mn(X1,X , β)} = 0, for all θ = (β, ν), ν ∈ H. (1.9)
6
where β is generally a proper subvector of the whole unknown parameter θ. If there
exists a unique solution β̂n to
∑n
Sn(β) = mn(Xi,X , β) = 0, β ∈ Uβ0 (1.10)
i=1
then (1.10) is the extended estimating equation.
In Definition 1.2.1, it is required that the estimating function mn(X1,X , β0) has
mean zero, which is the same assumption as in the classic estimating equation. For
the specific example of Cox model and AFT, this assumption will be verified through
a martingale property. We will discuss this in detail in Chapter 3. In order to ensure
the existence of β̂n and its consistency and asymptotic normality, more assumptions
need to be made. For example, the quotient term in (1.8) cannot explode to infinity,
as n goes to infinity, and in order to have asymptotic normality, there should exist a
neighbourhood of the β0 such that for all β in this neighbourhood, ∂mn(Xi,X , β)/∂β
cannot be singular. We include these series of assumptions in Chapter 4.
In this thesis, we will also be applying CEE’s and EEE’s in an EL estimation
framework. Rather than constructing the EL confidence regions, in this thesis, EL
is primarily an approach to computing estimators although it could readily be further
developed for its more common use in determining confidence regions. We will give
self-contained definitions and proofs of EL constructions in Chapter 4 and 5 using
empirical process theory. In Chapter 6, we discuss the asymptotic normality of the
two estimators associated with the EEE, namely, the one given by directly solving
the EEE, and the one by maximizing the EL, which share the identical sandwich
7
formed variance covariance matrices under regularity conditions.
One thing that we would like to point out is that the definition of EEE is differ-
ent from the term “martingale estimating equation” in literature like [4] by Bibby
et al, [29] by Merkouris, or [20] by Hwang et al, whose primary interest were to
construct an estimating function that estimate structural and nuisance parameters
simultaneously.
To modify the CEE estimator and confidence regions defined through the EL
method to become applicable to right censored semiparametric models, Hjort, Mck-
eague and van Keilegom proposed a “plug-in” method in [19], i.e., using the empirical
√
estimator of the nuisance parameter. They also showed the slower than n -rate of
convergence, and settings with large numbers of estimating equations compared to
the sample size. In [46] and [48], Zhou extended the Wilks type confidence region
in [32] for right censored data by replacing the unknown survival function of right
censoring variable with the Kaplan-Meier estimator. However, after replacing the
unknown nuisance parameter with its empirical estimator, the estimating functions
no longer satisfy the definition of CEE.
We develop a broader class of semiparametric models for which EEE definition
holds, i.e., the ϕ-Transformation Models with right censored data. These models
can be considered as generalizations of the GTM in the sense of characterizing the
relation between the dependent variable and covariates by a known function ϕ,
where ϕ depend on the covariates, structural parameter and error with unknown
8
hazard rate, and ϕ is monotone with respect to the error term. With different
choices of ϕ, the ϕ-transformation models include a series of semiparametric models,
including the AFT, the linear model in [34] and [38], and a nonlinear regression
model in [39] that allows location and shape to change based according to covariates.
The transformation function ϕ can also have a non-analytical form. For example, it
can be defined as an integral of a given wear-out rate function of structural parameter
and covariate. Such models can be found in [10] and [3]. Following Tsiatis’s work
in [38], by constructing a martingale structure on the residual, we show that for the
ϕ-transformation model, the estimating equation that defines an estimator of the
structural parameter alone satisfies the definition of EEE. We also prove the EEE
associated with the ϕ-transformation model is asymptotically linear, which together
with the martingale Central Limit Theorem (CLT) implies the asymptotic normality
of the structural estimator.
Similar to CEE, we can also use EEE as a constraint in the EL method, then
construct a structural estimator via maximizing the pEL. Under the CEE setting,
an element in Owen’s, or Qin and Lawless’s work in [32] and [33] is specifying a
neighbourhood of the structural parameter, in which there exists a unique proba-
bility vector that maximizes the EL with probability approaching 1. Therefore we
begin, in Chapter 5, by constructing such a neighbourhood for continuous estimating
functions under CEE setting, and then generalize the conclusion to discontinuous
estimating functions. Finally, we show that for EEE, when the estimating function
is continuous with respect to the structural parameter, then the local uniqueness
9
of solutions to EL maximization is also guaranteed with probability approaching
1; when the criterion functions is discontinuous, with the martingale structure as
described in the ϕ-Transformation model or the Cox model, then the same conclu-
sion can be drawn. Note that the martingale assumption is satisfied by all the EEE
examples that we know up to know, including the ϕ-transformation model and the
Cox model.
After proving lemmas that are parallel to Owen, and Qin and Lawless in [32]
and [33] using empirical process theory and some classic examples of Donsker class
and Glivenko-Cantelli Class listed in [41], we are able to establish the asymptotic
normality of the structural point estimator, of which the asymptotic variance co-
variance matrix is identical to the sandwich-formed asymptotic variance covariance
matrix of the corresponding Z-estimator.
Finally, we validate the EEE theory by simulation under the AFT model assump-
tion with R (3.4.1). Since the criterion function corresponding to the EEE of AFT
is discontinuous due to the appearance of the indicator function, the EEE may not
have a root. Therefore instead, we define the Z-estimator as the value minimizes
the Euclidean norm of the EEE. To calculate the maximum empirical likelihood
estimator, we first construct the Lagrange multiplier in the maximization problem
as a function of structural parameter, then calculate the structural parameter by
maximizing the pEL.
10
Despite the lack of continuity of EEE under AFT model assumption, we proved
that a unique solution of the EL maximization always exists with probability ap-
proaching 1. The intuition is that the pEL function, or equivalently, the summation
of the negative logarithm of the pEL appears very smooth for large n. We show
this conjecture by plotting the projection of the pEL in randomly generated unit
directions. From the pictures, we can see that for a “moderate” sample size depend-
ing on the censoring rate and the skewness of covariates, the plots are always very
smooth and almost parabolic around the location of the maximum pEL estimator.
On the other hand, plots of the projections of the EEE have many jumps even
around the true structural parameter value. We also develop a quantitative way to
compare the continuity pattern of around the pEL estimator and around the true
structural parameter, and we found the pEL acquires a very similar pattern to the
true parameter value.
From simulation, we can also see that for a moderate sample size, the EEE esti-
mator and pEL estimator are very close measured by the L1 distance between the
two. The differences between their variance covariance matrices is also very small,
evidenced by small magnitude of eigenvalues. Heuristically, this is because the La-
grange multiplier associated with pEL has a very small magnitude, which makes the
constraint function for pEL “almost the same” as the EEE.
Finally, we check the asymptotic normality of the pEL estimator under three
different parametric settings with right censored data, including the non-normally
11
distributed, the normally distributed and the severely skewed distributed covariates.
We found it takes a larger sample size for pEL estimator to reach normality if the
corresponding covariates is not normally distributed, and the sample size needs to
be even larger as the skewness of the covariate grows.
12
Chapter 2: Non-Existence of Classic Estimating Equations
Cox model and Accelerated Failure Time (AFT) model are two semiparametric
models that have been extensively used in survival analysis, especially when data
may be subject to various types of censoring. Usual ways of constructing estima-
tors for the structural parameters for Cox and AFT involve solving equations as
mentioned in [13], [38], and [41]. However, as far as we see from the literature, the
equations through which an estimator for the structural parameter is constructed
do not satisfy the definition of Classic Estimating Equations (CEE). Therefore a
natural question is whether a CEE method exist for these semiparametric models.
Despite many advantages of the CEE we discussed on page 1, under some cir-
cumstances CEE for a subvector β of parameters may not exist. In this chapter,
we prove the nonexistence of the classic EE under the Cox model and randomly
right-censored AFT.
More specifically, let us consider a statistical model {Pθ, θ ∈ Θ}, where θ = (β, λ)
consists of a structural parameter β ∈ Rp and a nuisance parameter λ ∈ H with
infinite dimension, wh{ere ∫ ∞ }
H ≡ λ(t) : λ(t) > 0 a.e. in t, λ(t)dt =∞ . (2.1)
0
13
Let β0 and λ0 denote the true parameter. The question this chapter aims to answer
in some special cases is whether there exists a CEE that depends only on data
and the structural parameter, i.e., we would like to know if there exists a function
m(X, β) : R× Rp → Rp, such that for any λ0 and β0
Eβ0,λ0 {m(X, β0)} = 0, (2.2)
and
∇βEβ0,λ0 {m(X, β)} = Eβ0,λ0 {∇βm(X, β)} is nonsingular for β ∈ Uβ0 , (2.3)
under assumptions that the passage of ∇β inside E {·} is allowed, where Uβ0 is a
bounded domain in Rp that contains a neighbourhood of β0. We show that an
estimating function m(X, β) satisfying (2.2) and (2.3) and the following regularity
conditions does not exist for Cox model in Section 2.1, and for censored AFT in
Section 2.2.
Assumptions
(A.1) Z is supported on a bounded set Z ⊂ Rp.
∫ ∫
(A.2) ∇β m(t, z, β)p(z)dz = ∇βm(t, z, β)p(z)dz, for all β ∈ U +β0 and t ∈ R .
(A.3) For all h ∈ H∫a∫nd sufficiently small ε,
tr βtrz
m(t, z, β)eβ zh(t)e−e (1−ε)H(t)p(z)dzdt <∞. (2.4)
(A.4) P {T < C} > δ > 0.
(A.5) Eβ,λ {‖m(T, Z, β)‖} <∞, for any β ∈ Uβ0 and λ ∈ H.
14
Remark 2.0.1 Given assumption (A.5), assumption (A.3) is sat∫isfied for all h ∈
H. Thi∫s is because for any h ∈ H and ε ∈ (0, 1), since 1− ε > 0, (1− ε)h(t)dt =
(1− ε) h(t)dt =∞, th∫er∫efore hε(t) = (1− ε)h(t) ∈ H. Then
tr
Eβ,hε‖m(T, Z, β)‖ = ‖
tr β z
m∫(t∫, z, β)‖e
β zhε(t)e
−e Hε(t)p(z)dzdt
(2.5)
− tr β
trz
= (1 ε) m(t, z, β)eβ zh(t)e−e (1−ε)H(t)p(z)dzdt
Under assumption (A.5), the left hand side of (2.5) is bounded, therefore so is the
right hand side, which is what (A.3) states.
2.1 Non-Existence of Classic Estimating Equations of Cox Model
As proposed in [11], the Cox model assumes that the conditional hazard rate of
lifetime T given covariate Z is proportional to an unknown baseline hazard rate
function λ(t), i.e.,
tr
λT |Z(t|z) = eβ zλ(t), (2.6)
where β ∈ Rp is the regression coefficient to be estimated, and Z is the covariate
with density function pZ(z). The Cox model (2.6) is semiparametric, with parameter
θ = (β, λ) ∈ Θ = Uβ0 ×H, (2.7)
where Uβ0 is an open and bounded subset of Rp that contains the true parameter
value β0, and H is defined in (2.1).
Different from the Kaplan-Meier estimator that primarily constructs an estimator
for the nuisance parameter, with the incorporation of the regression-like argument
βtre z, the Cox model (2.6) gives a way of estimating the structural parameter under a
15
semiparametric assumption. Cox introduced the notion of partial likelihood in [12],
and β̂n, the maximum partial likelihood estimator for β0, is defined as the solution
to ∑ {n ∑∑ }n βtrZj− j=1 Zje I {Vj ≥ Vi}∆i Zi n tr = 0 (2.8)β Zj
j=1 e I {V ≥ V }i=1 j i
Andersen and Gill derived the large-sample theoretical properties of β̂n in [2]. It can
be shown that β̂n constructed via solving equation (2.8) is a semiparametric efficient
estimator. Details of semiparametric efficiency can be found in [39]. However,
equation (2.8) does not satisfy the definition of classic EE because the summands
are not independently identically distributed (i.i.d.). In fact, in this section, we will
show that under regularity conditions, a CEE does not exist for the Cox model.
Under the Cox model assumption, the survival function of T given Z is
{ ∫ t }
− βtrS (t) = exp e zT |Z λ(s)ds ,
0
and the density function of T is
∫
tr βtrz
f(t) = eβ ze−e Λ(t)λ(t)pZ(z)dz, (2.9)
∫ t
where Λ(t) = λ(s)ds for t ≥ 0. We would like to see whether a function
0
m(t, z, β) : R+ ×Z × U pβ0 → R
of lifetime, covariate and structural parameter alone can exist, which satisfies for all
θ ∈ Θ and all λ ∈ H,
∫∫
βtrz−eβtrzE Λ(t)β,λ {m(T, Z, β)} = m(t, z, β)e λ(t)pZ(z)dzdt ≡ 0. (2.10)
16
By the dominated convergence theorem applied to difference quotients with re-
spect to ε, for each h ∈ H, with g ranging freely over functions bounded by 1, we
have the following theorem
Theorem 2.1.1 For the Cox model given in (2.6), there does not exist a function
m(t, z, β) of data X = (T, Z) supported on R+×Z and structural parameter β ∈ Uβ0
alone, which at the same time satisfies assumptions (A.1)-(A.5) and equations (2.2)
and (2.3).
The proof of this theorem can be found in Section 2.3. In the next section, we
discuss the non-existence of CEE for the AFT model.
2.2 Non-Existence of Classic Estimating Equations of Censored AFT
AFT assumes that conditionally given covariate Z, the lifetime T is proportional
to some baseline lifetime T0, i.e.,
−βtrT = e zT0, (2.11)
where T0 is a lifetime with unknown hazard rate function λ(t), β ∈ RP is the
regression coefficient to be estimated, and Z is the covariate with density function
pZ(z). Assume that Z is supported on a bounded subset Z ∈ Rp, hence (2.11) forms
a semiparametric model with parameter
θ = (β, λ) ∈ Θ = Uβ0 ×H, (2.12)
17
where Uβ0 is an open and bounded subset of Rp that contains the true parameter
value β0, and { ∫ ∞ }
H ≡ λ(t) : λ(t) ≥ 0 for all t, λ(t)dt =∞ (2.13)
0
is an infinite dimensional space. The survival function of T given Z is given by
{ }
tr
S (t) = exp −Λ(teβ zT |Z )
and the density function of T is given by
∫
tr tr
f(t) = eβ z−Λ(te
β z) trλ(teβ z)pZ(z)dz (2.14)
There are two different constructions of the regression parameter β depending on
whether β has an intercept term.
Case 1 β does not have an intercept term. In this case, λ is unrestricted. We do
not know any existing CEE based on β and data alone. There probably is no such
estimating equation even though we have not proved this.
Case 2 β has an intercept term. Without loss of generality, assume that the
expected value of T0 is 1, that is
{
tr ∣∣∣∣ } ∫E {T0} = E eβ ZT Z = tλ(t)e−Λ(t)dt ≡ 1 (2.15)
Therefore for any p(z)dz integrable functions a(z) : Rp → Rp and b(z) : Rp → Rp
such that E {a(Z)} = E {b(Z)} =6 0,
tr
m(T, Z, β) = Teβ Za(Z)− b(Z) (2.16)
18
has mean zero and is an estimating function if there exists Uβ0 such that∇βm(T, Z, β)
is nonsingular for all β ∈ Uβ0 . For example, a(z) can be a linear function of Z, i.e.,
{ }
tr −1
a(z) = Az, where A = E {b(Z)} · E Teβ ZZ
is a p× p matrix.
The discussion in previous paragraphs shows we have different conclusions for the
non-existence of CEE when lifetime T is always observable. Next, we introduce
the concept of censoring, which is a commonly seen situation in practice. When
lifetimes are subject to right censoring, instead of observing T and Z, we observe
X = (V,∆, Z), where V = min(T,C), ∆ = I {T ≤ C} (2.17)
and C is the right censoring variable with hazard function ΛC and hazard rate λC .
Koul, Susarla and Van Ryzin proposed a classic estimating equation for β in [25]
under the assumptions
(K1): C is independent of (T, Z);
(K2): The survival function of C is known and is denoted by SC(c)
Actually, assumption (K2) is seldom reasonable as a modeling assumption, unless
function SC is estimated from another source or censoring is purely “administrative”,
i.e., occurs when the study observation period ends and the pattern of times of entry
into the survival study does not depend on covariates or survival-time. When the
distribution function of C is unknown, Koul, Susarla and Van Ryzin proposed to
use the Kaplan-Meier estimator instead.
19
However, the estimating function used to construct the estimator for the structural
parameter is does not satisfy the definition of CEE due to the presence of the
Kaplan-Meier estimator, because the summands in it are no longer independent.
Similarly, neither the general “plug-in” method discussed by Hjort, Mckeague and
van Keilegom in [19], or the method proposed by Zhou in Chapter7 of [48] uses
estimating functions that satisfies the definition of CEE defined in Chapter 1. In
fact, in this section, we show that under regularity conditions, if the expectation of
the estimating function is identical to zero when lifetime is arbitrarily right censored,
for al possible model parameters, then a CEE does not exist.
Let C be the right censoring variable that is independent of the (T, Z), and letm1(t, z, β) when ∆ = 1;
m(t, z, β) =  (2.18)m0(t, z, β) when ∆ = 0,
We show that when m0(t, z, β) ≡ 0 for all t ∈ R+, z ∈ Z and β ∈ Uβ0 , then a non-
trivial CEE does not exist. Note that under most right-censoring CEE formulations,
the equations through whi∑ch an estimator for the structural parameter is definedn
are usually in the form of mn(Xi,X , β), where Xi = (Vi, Zi,∆i) and
i=1
mn(Xi,X , β) = ∆ ·m∗i n(Xi,X , β), (2.19)
and the assumption m0(t, z, β) ≡ 0 for all t ∈ R+, z ∈ Z is satisfied by (2.19).
20
Estimating funct∫ion m(T, Z, β) must be mean zero, which implies for all hazardc
functions∫Λ∫C∫(c) = λC(s) ds ∈ H, λ ∈ H and β ∈ U .0 β0c tr
0 ≡ m1 tr(t, z, β)eβ z−Λ(teβ z)−Λ trC(c)∫∫∫ λC(c)λ(te
β z)p(z) dt dz dc
0
t
0 tr β
trz
+ m (t, z, β)eβ z−Λ(te )−ΛC(c)
tr
∫∫ λC(c)λ(te
β z)p(z) dc dz dt
0 (2.20)
1 βtr= m (t, z, β)λ(te z
tr
)eβ z−
tr
Λ(teβ z)−Λ (t)
∫∫ C p(z) dt dz
− βtr+ m0(c, z, β)λ (c)e Λ(ce
z)−ΛC(c)
C p(z) dc dz,
Now, let us present the main theorem of this section
Theorem 2.2.1 Under assumptions (A.1), (A.2), (A.4), and (A.5), for AFT given
in (2.11) with arbitrarily right censored data (2.17),
(a) A CEE does not exist when assuming
m0(c, z, β) ≡ 0 for all c ∈ R+, z ∈ Z and β ∈ Uβ0 . (2.21)
(b) A CEE does not exist when assuming
m1(c, z, β) ≡ 0 for all c ∈ R+, z ∈ Z and β ∈ Uβ0 . (2.22)
The conclusion described in (a) is the main result we would like to present. It can
be shown by proving { }
βtr0 E ∇βm1(T, Z, β0) = 0,
which violates the non-singularity assumption of the gradient described in (2.3).
21
2.3 Some Proofs
In this section, for simplicity, the ranges of integrations for u and t are equal to
(0,∞) unless otherwise specified. Lemma 2.3.1 gives an important identity that
will be used to derive the non-existence of CEE under Cox model assumption, and
Lemma 2.3.2 shows how the non-existence conclusion can be drawn without impos-
ing smoothness assumptions on mn(·, z, β).
L∫emma 2.3.1 Under assumption (∫A.1)[∫- (A.6), for a.e. t ∈ R
+ and all β ∈ Uβ0,
tr ∞
]
tr β z tr βtrz
m(t, z, β)eβ z−e H(t)p(z)dz = m(u, z, β)e2β z−e H(u)h(u)p(z)dz du
t
(2.23)
Proof To start with, let us consider the following construction of hazard function.
Let h(t) ∈ H be a candidate baseline ha∫zard function, andt
H(t) = h(s)ds.
0
Suppose that the baseline hazard rate in (2.6) is of the form
λ(t) = h(t)eεg(t),
where ε is a positive constant, and g(t) ∈ G is a continuous and bounded function,
i.e.,
{ }
G ≡ g(t) : g(t) is continuous, and |g(t)| ≤M +g <∞ for all t ∈ R (2.24)
The boundedness assumption on g guarantees that λ(t) integrates to ∞ and conse-
quently is a hazard rate function. This is because if |g(t)| ≤M , for all t ∈ R+, then
22
for any ε > 0, g(t)ε ≥ −ε∫M , therefore for any t ∈∫ R
+ and h ∈ H,
t t
h(x)eεg(x)dx ≥ e−εM h(x)dx
∫ 0 0∞ ∫∞
Since h(t)dt =∞, we know that h(t)eεg(t)dt =∞.
0 0
Let us consider
λ(t) = h(t)eεg(t), where g ∈ G. (2.25)
By (2.10), for all θ ∈∫Θ∫ and g ∈ G, ∫
βtrz−eβtrz t h(s)eεg(s)E ds εg(t)θ {m(T, Z, β)} = m(t, z, β)e 0 h(t)e p(z)dtdz ≡ 0, (2.26)
Consider the double integral in (2.26) as a function of ε, then for any fixed h ∈ H,
g ∈ G , β ∈ Θ an∫d∫ε ≥ 0, ∫
tr βtrz t
r(ε, g, h) = m(t, z, β)eβ z−e 0 h(s)e
εg(s)dsh(t)eεg(t)p(z)dtdz = 0. (2.27)
For fixed g ∈ G and h ∈ H, r(ε, g, h) = 0, for any ε ≥ 0. Therefore ∇1r(0, g, h) =
0. By (2.26)-(2.27) a∫p∫plied with ε > 0 and ε = 0,
∇ trr(0, g, h) = lim m(t, z, β)eβ z1 { h(t)p(z)ε→0 } (2.28)1 − βtr ∫× e z t εg(s) βtrze 0 h(s)e dseεg(t) − e−e H(t) dtdz = 0
ε
To apply the Dominated Convergence Theorem and pass the limit into the double
integral of (2.2{8), we first re-write the difference quotient in the second line as
1 { ∫ } { }}− βtrz − βtre e H(t) e z t εg(s)e 0 h(s)(e −1)ds+εg(t) − eεg(t) 1+ eεg(t) − 1 (2.29)
ε ε
Note that for any fixed g and t ∈ R+, (eεg(t)− 1)/ε = O(g(t)) is uniformly bounded.
As for t{he first term in (2.29), it is }equal toeεg(t) { }− βtr ∫ εg(t) ∫e zε t h(s) 1 tre 0 (eεg(s)−1)ds − e1 = e−eβ zε tε 0 h(s)(g(s)+o(1))ds − 1 (2.30)
ε ε
23
∫ t
Since g(·) is bounded, denote h(s)(g(s)+o(1))ds = θ(t)H(t), where |θ(t)| ≤M
0 0
∈
R+ for all t ∈ R+. Therefore the integrand in (2.28) is bounded by
‖ tr β
trz
m(t, z, β)‖h(t)eβ z−e H(t){1+εθ(t)}p(z),
∫∫ tr βtrz
and under assumption (A.3), m(t, z, β)h(t)eβ z−e H(t){1+εθ(t)}p(z)dzdt < ∞.
Therefore by the Dominated Convergence Theorem, the limit in equation (2.28) can
be pa∫ss∫ed into the integral in (2.28). Hence ∣
∂ { }tr βtr ∫z t εg(s) ∣
m(t, z, β)eβ zh(t)p(z)e−e 0 h(s)e dseεg(t) ∣∣ dtdz = 0, (2.31)∂ε ε=0
for all θ ∈ Θ and g ∈ G, which implis that
∫∫ { ∫ t }
βtrm(t, z, β)e z−
tr
eβ zH(t) − βtrh(t)p(z) g(t) e z h(s)g(s)ds dtdz ≡ 0, (2.32)
0
for all θ ∈ Θ and g ∈ G.
We can r∫e-write (2{.32∫) by the Fubini Theorem and get for all θ ∈ Θ and g ∈ G
tr − βtr0 ≡ zg(t)h(t) m(t, z, β)eβ z e H(t)∫ p(z)dz (2.33)∞ ∫ }
− 2βtr β
trz
m(u, z, β)e −e H(u)h(u)p(z)dzdu dt
t
E∫quation (2.33) implies that for a.e∫. t ∈[∫R
+ and all β ∈ Uβ0 ,
∞ ]tr tr
m(t, z, β)eβ
trz−eβ zH(t) tr β zp(z)dz = m(u, z, β)e2β z−e H(u)h(u)p(z)dz du,
t
(2.34)
which is the assertion of the Lemma. 
Lemm∫a 2.3.2 Under assumptions (A.1) - (A.5),
∇ tr β
trz
t m(t, z, β)e
β z−e sp(z)dz ≡ 0, for all s, t ∈ R+ and β ∈ Uβ0 . (2.35)
24
Proof For fixed β ∈ Uβ0 and h ∈ H, denote∫
tr βtrz
f(t) = m(t, z, β)eβ z−e H(t)p(z)dz.
Then by (2.34) f(t) is differentiable for all t ∈ R+ and h ∈ H. Differentiating both
sides of (2.34), for all β ∈ Uβ0 and h ∈ H,∫
′ tr βtrzf (t) = − m(t, z, β)e2β z−e H(t)h(t)p(z)dz (2.36)
By the definition of the left{h∫and side of (2.36)
′ 1 βtr βtrzf (t) = lim m(t+ δ, z, β)e z−e H(t+δ) p(z) dz
δ→0 δ ∫ } (2.37)
− βtr
tr
m(t, z, β)e z−e
β zH(t) p(z) dz
Rewr∫ite the difference quotient in (2.37) as
1 { }tr tr
m(t, z, β) eβ z−e
β zH(t+δ) − βtre z− β
tr
e zH(t)
∫ p(z) dzδ (2.38)
1 tr − βtrβ z e z+ e H(t+δ) {m(t+ δ, z, β)−m(t, z, β)} p(z) dz
δ
Next, we discuss the two lines in (2.38). The uniform boundedness of h leads to
supt |H(t+δ)−H(t)| ≤Mε <∞ for |δ| < ε <∞. So by the Dominated Convergence
Theorem, the first term∫ of (2.38) converges to{ }
∇ βtrz−eβ
trz
m(t, z, β) e H(t)∫ t p(z) dz (2.39)
− tr β
trz
= m(t, z, β)e2β z−e H(t)h(t) p(z) dz.
From (2.36), (2.38), and (2.39), we know that as δ → 0, the second line of (2.38)
converges to zero for all β ∈ Uβ0 , i.e.,∫
1 { − } βtrlim m(t+ δ, z, β) m(t, z, β) e z−eβ
trzH(t+δ) p(z) dz ≡ 0 (2.40)
δ→0 δ
25
Write the left hand side of (2.40) as the sum of
∫
1 tr{m(t+ δ, z, β)−m(t, z, β)} tr β zeβ z−e H(t) p(z) dz (2.40.a)
δ
and
∫
1 { }tr tr{m(t+ δ, z, β)− tr β z tr β zm(t, z, β)} eβ z−e H(t+δ) − eβ z−e H(t) p(z) dz
δ
{∫ } ∣ (2.40.b)tr βtrz
As δ → 0, (2.40.a) converges to ∇t m(t, z, β)eβ z−e sp(z)dz ∣ , whichs=H(t)
is the left hand side of (2.35) when H(t) is regarded as a free variable s ∈ R+.
Therefore it suffices to prove that as δ → 0, (2.40.b) converges to 0 for all β ∈ Uβ0
and h ∈ H∗, where{H
∗ is also a dense class of functions in L1 defined as
H∗ = h : h ∈ H; for each h, there exists ε > 0 such that
h(t) = 0 for all t ∈ (0, ε);
(2.41)
h bounded above and below, fo}r all t ≥ ε;
h bounded above, for all t ≥ ε
Then co∫nsider t > δ and write (2.40.b) as{ }
tr βtrz tr
eβ zp(z) m(t+ δ, z, β)e−e H(t+δ) − −eβ zH(t)∫ { m(t, z}, β)e dz
− βtrz −eβ
trz tr
e p(z)m(t, z, β) e H(t+δ) − −eβ ze H(t)∫ dz (2.42)
− βtre zp(z)e− β
tr
e zH(t) {m(t+ δ, z, β)−m(t, z, β)} dz
Then using ∫the identity eq∫uation (2.34) ∫proved in Lemma 2.3.1 and the additivity∞ t+δ ∞
property, − `(s) ds = `(s) ds − `(s) ds for any integrable function `(·),
t+δ t t
26
we re-write (2.42) ∫as the sum oft+δ ∫ [
′ ′
C1 = m(u, z, β)e
2β z h(u− δ) exp(−eβ z] H(u− δ))t (2.43)
− ′h(u) exp(−eβ zH(u)) p(z)dzdu
and
∫ ∞ ∫ [
′ ′
C = m(u, z, β)e2β z 2h(u) exp(−eβ z2 H(u)) ] (2.44)t
− h(u− − β′δ) exp( e zH(u− ′δ))− h(u+ δ) exp(−eβ zH(u+ δ)) p(z)dzdu
First, by the Mean Value Theorem, the square-bracketed integrand in C1 is O(δ),
bounded by
[ ′ ]β z ′
M2δ |h′(u− θδ)| e−e H(u−θδ) 2 − β z+ h (u θδ)e−e H(u−θδ) (2.45)
tr
where M2 is a uniform upper bound for eβ z, h ∈ H∗, and θ ∈ (0, 1). Let k be a
function and C∗ ∗1 be a constant such that for all δ ∈ (0, ε) the lower bound of C1k(u)
is given by a constant M1 ∈ R when u < ε (2.46)supθ∈[−1,1] (h′(u− θδ), h2(u− θδ)) when u > ε.
Let K(u) ≥ H(u− ε) for all u ≥ ε. Then (2.45) is bounded by
′
M2 ∗ −e
β z
δ C k(u) e K(u)1 , for all u > ε, |θ| ≤ 1, 0 < δ < ε,
So the fact that the outer integral in C1 is taken over a shrinking interval (t, t+ δ)
allows us to say that the integral of the integrand given by δ times dz du integrable
function
‖ ‖ 2βtr − β
tr
z e zm(u, z, β) k(u)e K(u)p(z) = o(δ).
27
Now we move on to C2. Using the Mean Value Theorem again, the square-
bracketed integrand in C2 is bounded by
∂2 ∣2 − β′δ (h(x) exp( e zH(x)) ∣∣ (2.47)
∂x2 x=u+θ∗δ
where h ∈ H∗ and |θ∗| ≤ 1 but cannot be controlled further. Let k be a function
and C
∗
2 be a constant such that C
∗
2k(u) is lower bounded by
a constant M2 ∈ R when u < εsup ′′ ′ 2 ′θ∈[−1,1] (h (u+ θδ), 2h(u+ θδ)h (u+ θδ), h (u+ θδ)h (u+ θδ)) when u > ε.
(2.48)
for all 0 < δ < ε. Let K(u) ≥ H(u− ε) for all u ≥ ε. Then (2.47) is controlled by
δ2C∗
βtrz
2k(u)e
−e K(u), for all u > ε, |θ∗| ≤ 1, 0 < δ < ε.
Thus , even though the range of integration is now not small, C2 can be shown (with
h ∈ H∗) to be O(δ2). 
Lemma 2.3.3 (Weierstrass) Suppose f is a continuous real-valued function de-
fined on real interval [a, b]. For every ε > 0, there exists a polynomial pf (x) such
that for all x in [a, b], |f(x)−pf (x)| < ε, or equivalently, supx∈[a,b] |f(x)−pf (x)| < ε.
2.3.1 Proof of Theorem 2.1.1
Multiply both sides of (2.35) by a(t) ∈ A, where
A ≡ {a(t) : R→ R; a is compactly supported ,
(2.49)
and continuously differentiable on R+},
28
then integrate with respect to t, yielding for all a ∈ A, β ∈ U and s ∈ R+β
∫ { ∫ } 0
,
∂ tr − βtra(t) m(t, z, β)eβ z e
zsp(z)dz dt ≡ 0. (2.50)
∂t
Integrate (2.50) by parts, yielding for all a ∈ A, β ∈ Uβ0 and s ∈ R+,∫ ∫
tr βtrz
a′(t) m(t, z, β)eβ z−e sp(z)dzdt ≡ 0. (2.51)
Now, multiply both sides of (2.34) by a′(t), then integrate with respect to t. Together
with equation (2.51), we conclude that for all a ∈ A, β ∈ Uβ0 and h ∈ H,∫ {∫ ∞ ∫ }
tr βtrz
0 ≡ a′(t) m(u, z, β)e2β z−e H(u)h(u)p(z)dzdu dt (2.52)
t
Integrating (2.52) by parts, then we know that for all a ∈ A, β ∈ Uβ0 and h ∈ H,∫ ∞ ∫
≡ 2βtr β
trz
0 a(t) m(t, z, β)e z−e H(t)h(t)p(z)dzdt (2.53)
0
Therefore for all a ∈ A, β ∈ Uβ0 and s, t ∈ R+,∫
tr βtrz
0 ≡ m(t, z, β)e2β z−e sp(z)dz, (2.54)
where we replaced H(t) by s ∈ R+ since H is free to be any function in H. Inte-
grating both sides of (2.54) with respect to s from x to ∞ implies
∫
≡ βtr − β
tr
0 m(t, z, β)e z e
zxp(z)dz (2.55)
After integrating again with respect to x on s,∞ or (separately) by differentiating
equation (2.55) under the integral sign arbitrarily many times, then the same formula
tr tr
(2.55) holds with the first term eβ z term replaced by ekβ z for any non-negative
integer k.
29
Let b(·) be any element of B, the continuous functions on the real line and apply
the Weierstrass’ theorem stated in Lemma 2.3.3 to approximate b by polynomials
uniformly on∫ the compact set of possible βtrz values, yielding
0 ≡ m(t, z, β)b(βtrz)p(z)dz, for all t ∈ R+, β ∈ Uβ0 , b ∈ B. (2.56)
Now, let us demonstrate the singularity of the matrix Eβ0,λ0 {∇βm(t, z, β0)},
which is a violation of (2.3) with β = β0 and thus proves the non-existence of
the CEE. Differentiate both sides of (2.55) with respect to βtr, i.e., the Jacobian of
function ∫m(t, z, ·), yielding for all t, s ∈ R+ ∫and β ∈ Uβ0 ,
tr − βtr0 ≡ ∇ zβm(t, z, β)eβ z e sp(z)dz − m(t, z, β)cs(βtrz) z p(z) dz, (2.57)
tr ∇ βtrz−eβ
trz
where c (β z)z = (e s). Multiply both sides of (2.57) by βtrs β and use
equation (2.57) by choosing c tr trs(β z)β z ∈ B and then setting s = H(t). Then for
all t ∈ R+, h ∈ H and β ∈ U∫β0
tr
0 ≡ βtr ∇ tr β zβm(t, z, β)eβ z−e H(t)p(z)dz. (2.58)
Equation (2.58) hold∫s for a specific choice of β = β0, which contradicts the non-trtr β z
sigularity of matrix ∇ β 0βm(t, z, β0)e 0 z−e sp(z)dz as described in (2.3). 
2.3.2 Proof of Theorem 2.2.1
Note that (2.20) is true for any hazard rate function λC , therefore it is satisfied
when C puts its mass at a point c, i.e., for a constant c ∈ R+, ΛC(s) = Λc(s), where
0 when s < c,Λc(s) =  (2.59)∞ when s > c.
30
Then (2.20)∫im
+
∫plies that for any c ∈ R , λ ∈ H and β,c tr
0 ≡ m1 tr tr β z(t, z, β)λ(teβ z)eβ z−Λ(te )p(z)dtdz
0 ∫
− βtr0 Λ(ce z
∫∫ + m (c, z, β)e
)p(z)dz
tr (2.60)
ceβ z
= m1(se−β
trz, z, β)λ(s∫)e
−Λ(s)p(z)dsdz
0
− trΛ(ceβ z+ e )m0(c, z, β)p(z)dz,
Since (2.60) is satisfied by any λ ∈ H, let Λ(s) = Λt(s), where Λt(s) is defined as in
(2.59).∫Then (2.60) implies{ }
≡ βtrz 1 −βtr0 I t∫< ce m (te
z
{ } , z, β)p(z)dz (2.61)
βtr+ I t > ce z m0(c, z, β)p(z)dz, for all c ∈ R+, t ∈ R+, β
Next, we prove statement (a). Then (b) can be shown with the same strategy.
Proof of (a) In this part, we assume m0(c, z, β) ≡ 0 for all c ∈ R+, z ∈ Z and
β ∈ Uβ0 . Hence (2.61) becomes
∫ { }
0 ≡ tr trI t < ceβ z m1(te−β z, z, β)p(z)dz, for all t ∈ R+, β (2.62)
Since (2.62) holds for all t, c ∈ R+, therefore consider a class of functions A as
follows, { }
A ≡ a(t) : a(t) is essentially bounded , t ∈ R+
then for any a(t) ∈ A, β, and c ∈ R+, integrating a(t) multiplied by (2.62) against
Lebesgue measure on (0,∞) gives for all c ∈ R+, a ∈ A, β ∈ Uβ0 and z ∈ Z,
∫∫ βtrce z
0 ≡ a(t)m1(te−βtrz, z, β)p(z)dtdz. (2.63)
0
31
Differentiate (2.63) with respect to c, yielding for all a ∈ A, β ∈ Uβ0 , z ∈ Z and
a.e. in c, ∫
≡ tr tr0 eβ za(ceβ z)m1(c, z, β)p(z)dz. (2.64)
βtrz βtr
{
z·k βtr }By specific series of choices of a(ce ) = e I ce z ≥ τ for fixed τ, c ∈ R+
and k ∈ Z+, we know th∫at (2.64) implies for all k ∈ Z
+ and τ ∈ R+,
{ }
tr tr
0 = eβ z·kI ceβ z ≥ τ m1(c, z, β)p(z)dz (2.65)
Let τ → 0, for all a ∈ A, β ∈ U∫β0 , k ∈ Z
+, z ∈ Z and c ∈ R+
βtr0 = e z·km1(c, z, β)p(z)dz (2.66)
By Weierstrass’ theorem described in Lemma 2.3.3, for b ∈ B, β ∈ Uβ0 , k ∈ Z+,
z ∈ Z and c ∈ R+, ∫
0 = b(βtrz)m1(c, z, β)p(z)dz, (2.67)
where B is the family of differentiable functions with bounded support.
To show the non existence of CEE via deriving the singularity of
{ }
Eβ0,λ0 ∇βm1(T, Z, β0) , (2.68)
let us differentiate both sides of (2.67) with respect to β, then for b ∈ B, β ∈ Uβ0 ,
k ∈ Z+, z ∈ Z and c ∈ R+∫ , ∫
βtr∇βm1(c, z, β)b(βtrz)p(z)dz = − βtrzb′(βtrz)m1(c, z, β)p(z)dz = 0 (2.69)
tr tr βtrz
Choose b(βtrz) = eβ zh(eβ z)e−H(e c) and integrate against dc. Then (2.69) im-
plies { }
βtr0 Eβ0,λ0 ∇ m1β (T, Z, β0) ≡ 0. (2.70)
32
This contradiction of non-singularity completes the proof of (a).
Proof of (b) In this part, we assume that m1(t, z, β) ≡ 0 for all t ∈ R+, z ∈ Z
and β ∈ Uβ0 . Hence (2.61) becomes∫ { }
≡ βtr0 I t > ce z m0(c, z, β)p(z)dz, for all c ∈ R+, z ∈ Z, β ∈ Uβ0 (2.71)
In fact, we can show the contradiction of singularity described in (2.70) following
the same steps in proving (a) on page 31. 
33
Chapter 3: ϕ-Transformation Model
3.1 Introduction
In this section, we discuss the ϕ-Transformation model, which serves as the most
general worked-out example for the Extended Estimating Equations (EEE). Let Ti
be the lifetime for the ith individual for i = 1, . . . , n, and consider the model
Ti = ϕ(ε, β0, Zi), i = 1, . . . , n (3.1)
where Zi = (Z
tr
i1, . . . Zip) is a covariate, εi is the error that is independently iden-
tically distributed (i.i.d.) with common differentiable distribution F (x), and haz-
ard rate function λ(x), β0 ∈ Rp is the unknown coefficient to be estimated, and
ϕ(x, y, z) : R×Rp×Rp → R+ is a given function, where ϕ(·, y, z) is strictly increas-
ing and continuously differentiable, and ϕ(x, ·, z) is differentiable. We also assume
that the support of T and the support of ε do not depend on parameter (β, F ).
In practice, sometimes we are not able to observe the complete lifetime Ti. Let us
use Ci to denote the right censoring variable, assumed to be identically distributed
with a common distribution function denoted by H. Assume that conditionally
given Zi, Ti and Ci are independent. When Ti are subject to right censoring, the
34
data collected are random vectors
(Vi,∆i, Zi), i = 1, . . . , n (3.2)
where Vi = min(Ti, Ci), and ∆i = I {Ti ≤ Ci} is equal to 1 when Ti does not exceed
the right censoring variable Ci, and 0 otherwise.
The ϕ-transformation model relates to many other important models in statistics,
and we mention the following examples. First, when ϕ(x, y, z) depends on the
structural parameter y only through ytrz and is linear in ytrz and x, i.e.,
T tri = β0 Zi + εi, (3.3)
then the ϕ-transformation model becomes the usual linear model. Nothing needs to
be changed when the lifetime Ti is replaced by h(Ti), where h(·) is a known monotone
function. When h(Ti) = lnTi, (3.3) becomes the Accelerated Failure Time (AFT)
model. Therefore the AFT model is an example of the ϕ-transformation model. A
review of the linear model and the AFT can be found in page 8 of Chapter 1.
Another important class of models that is related to the ϕ-transformation model
is the nonlinear regression model with additive independent errors ( [39], Chapter
5), namely,
Ti = µ(Zi, β0) + εi, (3.4)
where µ(·, β0) is given, Ti is continuous, and εi is independent of Zi. This model
assumes that there is a basic underlying distribution for the lifetime, but the location
shifts according to covariate Z. In (3.4), ϕ(ε, Z, β) = µ(Z, β) + ε, and one way to
35
generalize it is to allow a shape change in Ti, i.e.,
h(Ti) = b0(β
tr
1 Zi) + εb
tr
1(β2 Zi),
where hR+ 7→ R is a known monotone function, b0 and b1 are both known func-
tions, and b1 is strictly positive. This way, both the location and the shape change
according to covariates.
It is not necessary that the transformation function ϕ(x, y, z) has an analytical
form. A possible choice is to let ϕ(x, y, z) be an integral of a known positive rate
function, i.e., let b(s, β, z) : R× Rd × Rd → R+ be a given function, then
∫ ε
h(Ti) = ϕ(ε, Zi, β0) = b(s, β0, Zi)ds. (3.5)
−∞
The time Ti is known as “operational time”, and model (3.5) may have different in-
terpretation in practice depending on the choice of b(s, β0, z). For example, b(s, β0, z)
can describe the wearing-out rate of a device. This rate can depend on a structural
parameter β0, and a covariate Zi that differs from device to device. This type of
model is also discussed by Nikulin in [3].
This chapter is organized as follows. In Section 3.2, we establish an EEE for β0
using martingale theory. In Section 3.3, we prove the consistency and asymptotic
normality of the estimator of β0. Technical lemmas not given in detail in Section
3.3 are postponed to Section 3.4.
36
3.2 Extended Estimating Equations and Martingales
To construct the EEE that yields an estimator of β0 for the ϕ-transformation
model (3.1), we follow Tsiatis [38] by building a martingale on the residual scale.
Let NTi (v) be the counting process for the ith individual for lifetime Ti, and let
Y Ti (v) be the at-risk indicator for lifetime Ti, i.e.,
NTi (v) = ∆i · I {V ≤ v} ; Y Ti i (v) = I {Vi ≥ v} (3.6)
Since ϕ(·, y, z) is strictly increasing, we can define its inverse function,
ϕ−1(t, y, z) : R+ × R→ R such that ϕ(u, y, z) = t ⇔ ϕ−1(t, y, z) = u. (3.7)
Let ζi be the residual when T
−1
i is censored, i.e., ζi = ϕ (Ci, β0, Zi). Then we can
write the counting process and at-risk indicator for residual εi as follows,
N εi (u) = I {min(εi, ζi) ≤ u, εi ≤ ζi} (3.8)
Since ϕ−1(u, y, z) is also strictly increasing,
I {min(εi, ζi) ≤ u} = I {min(ϕ(εi, β0, Zi), ϕ(ζi, β0, Zi)) ≤ ϕ(u, β0, Zi)}
(3.9)
= I {min(Ti, Ci) ≤ ϕ(u, β0, Zi)} ,
and
I {εi ≤ ζi} = I {ϕ(εi, β0, Zi) ≤ ϕ(ζi, β0, Zi)} = I {Ti ≤ Ci} = ∆i, (3.10)
From (3.6)-(3.10), we know that
N εi (u) = I {min(Ti, Ci) ≤ ϕ(u, β0, Zi),∆i = 1} = NTi (ϕ(u, β0, Zi)) (3.11)
37
As for the at-risk indicator Y εi (u),
Y εi (u) = I {min(εi, ζi) ≥ u}
= I {min(ϕ(εi, β0, Zi), ϕ(ζ , β (3.12)i 0, Zi)) ≥ ϕ(u, β0, Zi)}
= I {min(Ti, Ci) ≥ ϕ(u, β0, Zi)} = Y Ti (ϕ(u, β0, Zi)) .
Since the counting process and at-risk indicator for εi and Ti have the relationship
described in (3.11) and (3.12), from now on we will mainly use NTi (·) and Y Ti (·)
to construct estimators, and define Ni(·) = NTi (·), Yi(·) = Y Ti (·) for the sake of
simplicity. Assume that Vi’s are nondegenerate, then there exists a constant T
∗
such that for some ξ > 0,
P {V ≥ T ∗i + ξ} ≥ ψ > 0, for all i. (3.13)
Let us introduce additional notations to be used throughout this chapter. For a
function g(x1, x2, x3), let ∇ig(x1, x2, x3) = ∂g(x1, x2, x3)/∂xi. Define
J(u, β1, β2, z) ≡ ϕ−1(ϕ(u, β1, z), β2, z) (3.14)
Note that by the chain rule, as shown in Lemma 3.4.1,
γ(u, β0, z) = ∇
∇2ϕ(u, β0, z)
2J(u, β, β0, z)|β=β0 = . (3.15)∇1ϕ(u, β0, z)
In addition, assume the following regularity conditions:
(A.1) The density function of ε, f(x) = dF (x)/dx, exists and is bounded by K1
on (−∞, T ∗ + ξ], where T ∗, ξ are as in (3.13). The hazard rate λ(x) is twice
differentiable.
38
(A.2) The density function for Ci exists, and h(x) = −dH(x)/dx ≤ K2, for all
x ≤ T ∗ + ξ.
(A.3) There exists θ(u, β0, Z) with E|θ(u, β0, Z)| <∞ such that
|λ(J(u, β,β0, Zi))− λ(u)− γ(u, β , Z )tr0 i (β − β ′0)λ (u)|
(3.16)
≤ ‖β − β ‖20 θ(u, β0, Zi), for β ∈ Uβ0(n−1/2), and u ∈ R,
almost surely in Zi, where U (n
−1/2
β0 ) is a n
−1/2 neighbourhood of β0.
(A.4) There exists a constant c such that P {‖Zi‖ < c} = 1.
(A.5) Let µβ0(u, β) = E {Z1I {V1 > ϕ(u, β, Z1)}}/E {I {V1 > ϕ(u, β, Z1)}}. Then
P
sup ‖Z̄(u, β)− µβ0(u, β)‖ −→ 0, as n→∞,
β∈U −1/2 ∗β (n ),u≤T +ξ0
where ∑∑nj=1 ZjYj (ϕ(u, β, Zj))Z̄(u, β) = n . (3.17)
j=1 Yj (ϕ(u, β, Zj)){ }
(A.6) Let Aβ0(u, β) ≡ E I(V1 > ϕ(u, β, Z1))(Z1 − Z̄(u, β))γ(u, β , Z )tr0 1 , then
∣∣∣∣ ∑n1 { } ∣∣sup Yi (ϕ(u, β, Zi)) Zi − Z̄(u, β) γ(u, β0, Zi)tr−Aβ0(u, β)∣∣
β∈U (n−1/2),u≤T ∗+ξ nβ0 i=1
approaches zero in probability, and for all β ∈ Uβ (n−1/2)
∫ 0T ∗
λ′(u)Aβ0(u, β)du is non singular.
−∞
(A.7) For β ∈ U (n−1/2β0 ), E {Λ (J(T ∗, β0, β, Z))} = M <∞.
(A.8) ϕ(·, y, z), ϕ(x, ·, z) and ϕ−1(·, y, z) are all Lipschitz continuous; ϕ(·, y, z) is
continuously differentiable, and ∇2ϕ(x, ·, z) is Lipschitz continuous.
39
(A.9) ϕ(·, y, z) is strictly increasing.
∥∥
(A.10) E ∥∫ ∥T ∗ ′ ∥−∞ γ(u, β0, Zi)λ (u)du∥ <∞.
3.2.1 Remarks and Sufficient Conditions for Assumptions
First, let us discuss assumption (A.3). Note that the left hand side of (3.16) is
the Taylor expansion of λ(J(u, ·, β0, Zi)) at β = β0, the remainder of which is (β −
β tr 20) ∂ λ(J(u, β, β0, Zi))/∂β
2|β=β∗(β − β0), for β∗. Also recall that J(u, β1, β2, z) =
ϕ−1(ϕ(u, β1, z), β2). Therefore one way to guarantee (3.16) in assumption (A.3) is
to assume that
(i) ∂2ϕ(u, β, z)/∂β2 and ∂2ϕ−1(u, β, z)/∂β2 exists
(ii) λ′(u) is Lipschitz continuous continuous.
As for the boundedness of E|θ(u, β0, Zi)|, it is guaranteed if the derivatives in (i)
are continuous in z.
In assumptions (A.5) and (A.6), the pointwise convergence for fixed β and u
can be obtained using the Law of Large Numbers (LLN) . We can show that this
convergence uniform in β ∈ Uβ0(n−1/2) and u ≤ T ∗ + ξ using empirical process
theory. For example, we can show that
{ }
F ≡ f(v, z;u, β) : I v ≥ ϕ(u, β, z)(z − Z̄(u, β))
is a Glivenko-Cantelli class, which can be established under the assumption that
function ϕ(x, y, z) is Lipschitz continuous in x and y using the Example 19.11 and
Example 19.20 in [41].
40
The boundedness of the expected value mentioned in (A.7) and (A.10) is guar-
anteed by the smoothness of Λ(·) and J(u, β0, ·, z), and the bounded support of Z.
Under assumption (A.8), J(u, β1, ·, z) is Lipschitz. Since J(u, β0, β0, z) = 1, we know
that |J(u, β0, β, z)| ≤ 1 + c‖β− β0‖, hence (A.7) is satisfied. (A.10) is guaranteed if
we assume that γ(u, β0, z) is bounded for any u and z.
Define ∑n ∫ T ∗ { }
Sn(β) = dNi (ϕ(u, β, Zi)) Zi − Z̄(u, β) , (3.18)
i=1 −∞
where Z̄(u, β) is defined in (3.17). From now on, let us use P denote the probability
measure under the true nuisance and structural parameter. Then we construct the
martingale in the following proposition.
Proposition 3.2.1 Mi {ϕ(u, β, Zi)} is a martingale with respect to the filtration
Fn(u, β) = σ (Zi, I {Vi ≤ ϕ(s, β, Zi)} ,∆iI {Vi ≤ ϕ(s, β, Zi)} , i = 1, . . . , n, s ≤ u) ,
(3.19)
under P , where
Mi (ϕ(u, β, Zi)) =Ni ∫(ϕ(u, β, Zi)) (3.20)J(u,β,β0,Zi)
− λ(x)Y −1i (ϕ(x, β0, Zi)) (ρ(x, β0, β, Zi)) dx,
−∞
and
∇1ϕ(u, β, Zi)
ρ(u, β, β0, Zi) = ∇1J(u, β, β0, Zi) = (3.21)∇1 (ϕ(J(u, β, β0, Zi), β0, Zi))
Proposition 3.2.1 is a direct result of the compensated martingale associated with
the counting process Ni(ϕ(u, β, Zi)). The details of calculation can be found in
41
Section 3.4. As a special case of Proposition 3.2.1, when β = β0,
dMi (ϕ(u, β0, Zi)) = dNi (ϕ(u, β0, Zi))− λ(u)Yi (ϕ(u, β0, Zi)) du (3.22)
is a martingale differentia∑l w∫ith respect to measure{P and filtrati}on Fn(u, β0) definedn T ∗
in (3.19). Since Sn(β) = −∞ dMi (ϕ(u, β, Zi)) Zi − Z̄(u, β) , it follows that
i=1
Eβ0 {Sn(β0)} = 0. (3.23)
The martingale Central Limit Theorem (CLT) implies that n−1/2Sn(β0) is asymp-
totically normal with mean zero and variance σ2(β0), where∫ T ∗
σ2(β0) = (Z1 − Z̄(u, β0))⊗2λ(u)P {V1 ≥ ϕ(u, β0, Z1)} du. (3.24)
0
Now, (3.18) is in the form of extended estimating equations defined in the previous
chapter. We can re-{write (∑3.18) as∑ ∑ }n n Z I {ϕ−1(V , β, Z ) ≥ ϕ−1− j=1 j j j (Vi, β, Zi)}Sn(β) = ∆i Zi n −1 − . (3.25)1
i=1 j=1
I {ϕ (Vj, β, Zj) ≥ ϕ (Vi, β, Zi)}
With the choice of Xi = (Vi, Zi,∆i),
{
Q(X , β) = ∆ , C(X ) = Z , k(X ,X , β) = I ϕ−1(V , β, Z ) ≥ ϕ−
}
1
i i i i i j j j (Vi, β, Zi) ,
∑n
equation (3.18) is exactly in the form of mn(Xi,X , β) mentioned in (1.10), where
∑ ∑ {i=1 }n n ∑n
j∑=1C(Xj)k(Xi, Xj, β)mn(Xi,X , β) = Q(Xi, β) C(Xi)− n
j=1 k(Xi, Xi=1 i=1 j, β)
with 0 mean under the true parameter value shown in (3.23).
Corollary 3.2.1 Under the ϕ-transformation model assumption, Sn(β) = 0 is an
extended estimating equation, where Sn(β) is defined in (3.25).
42
3.3 Estimator for Structural Parameter
In the previous section, we constructed an extended estimating equation Sn(β)
√
in equation (3.18). The goal of this section is to establish the n-consistency of
estimator β̂n, the estimator for β0 estimated via extended estimating equation Sn(β).
√
In Theorem 3.3.3, we prove that n(β̂n − β0) is asymptotically normal with mean
zero.
Note that Sn(β) is a step function, which brings the following two problems. First,
there may not exist a root for equation Sn(β) = 0. Second, the usual Taylor expan-
sion method does not apply to Sn(β). Consequently, we adopt the definition of β̂n
given by Jurec̆ková in Section 4 of [24]. Namely, for Sn(β) = (Sn,1(β), . . . , Sn,p(β))
tr,
∑p
β̂ 2n = arg min {Sn,j(β)} , (3.26)
β∈Uβ0 j=1
where Uβ0 is a neighbourhood of β0 where there is a unique solution to the minimiza-
tion problem (3.26). In the case when the minimization problem (3.26) has more
than one solution, estimator β̂n is defined as the one with a smaller lexicographic
norm, i.e., if both vectors b̂i = (b̂i,1, . . . , b̂i,p) satisfy (3.26) for i = 1, 2, and b̂1,j = b̂2,j
for j = 1, . . . , k, k < p, and if b̂1,k+1 < b̂2,k+1, then β̂n = b̂1.
Next, we show that Sn(β) is asymptotically linear in a neighbourhood of β0. Let
∫ T ∗
g(β) = λ′(u)Aβ0(u, β)du, (3.27)
−∞
43
where Aβ0(u, β) is the function defined in assumption (A.6) in the previous section
on page 39. Define a linear function of β as follows,
S̃n(β) = Sn(β0) + ng(β0)(β − β0). (3.28)
Let β∗n be the root of S̃n(β) = 0. Consider β ∈ Uβ0(n−1/2), a n−1/2-neighbourhood
of β0. If we can show that Sn(β) is ‘asymptotically equivalent” to S̃n(β), i.e.,
the l∞ norm of Sn(β) − S̃n(β) converges to zero in probability, then β̂n is also
“asymptotically equivalent” to β∗n, namely, they are both asymptotically normal
with mean zero and the identical asymptotic variance. From (3.28), we know that
if g(β0) is nonsingular,
√
n(β∗ − β ) = n−1 {g(β )}−1 S (β ), for β ∈ U (n−1/2n 0 0 n 0 β0 )
is asymptotically normal with mean zero and variance {g(β )}−10 σ(β0) {g(β −10)} .
√ P √
If we can show that n(β̂ − β∗n n) −→ 0, then this would imply n(β̂n − β0) is
√
asymptotically distributed the same as n(β∗n − β0). Argued by Jurec̆ková in [23]
and [24], it would suffice to show that
sup n−1/2|Sn(β)− S̃n(β)| −
P→ 0. (3.29)
β∈U (n−1/2β )0
We will show (3.29) in two steps. In Theorem 3.3.1, we will show the pointwise
convergence of (3.29), then we get the uniformity in β in Theorem 3.3.2.
Now, let us start with the first step, i.e., the pointwise convergence by writing
Sn(β) as the summation of
∑n ∫ T ∗ { }
Sn1(β) = dMi {ϕ(u, β, Zi)} Zi − Z̄(u, β) (3.30)
i=1 −∞
44
and
∑n ∫ T ∗ { }
Sn2(β) = λ(J(u, β,β0, Zi)ρ(u, β, β0, Zi)− λ(u) (3.31)
i=1 −∞ { }
× Yi {ϕ(u, β, Zi)} Zi − Z̄(u, β) du.
P
We show that n−1/2|Sn1(β) − Sn(β0)| −→ 0 for fixed value β ∈ U (n−1/2β0 ) using
Lenglart’s Inequality and the predictable variation process in Lemma 3.4.2 and
P
Lemma 3.4.3, and that n−1/2|Sn2− ng(β0)(β − β0)| −→ 0 in Lemma 3.4.4 and 3.4.5
using the Lipshitz assumption mentioned in (A.8). A careful statement and proof
of Lemma 3.4.2-3.4.5 can be found on page 50-57 of Section 3.4. Now, we are ready
to discuss the pointwise asymptotic linearity of Sn(β) for any fixed β ∈ U (n−1/2β0 )
as follows,
Theorem 3.3.1 U(nder a)ssumptions (A.1), (A.3), and (A.5)-(A.8) for any fixed βn
that belongs to Uβ0 n
−1/2 ,
√1 | PSn(βn)− S̃n(βn)| −→ 0. (3.32)
n
Proof: Note that we can write Sn(β) = Sn1(β) + Sn2(β), and that S̃n(β) =
Sn(β0) + ng(β0)(β − β0). Therefore the proof is complete since
√1 | PSn1(βn)− Sn(β0)| −→ 0, (3.33)
n
√1 |Sn2(βn)−
1 P
ng(β0)(βn − β0)| = √ · n · ‖βn − β0‖op(1) −→ 0, (3.34)
n n
where we get (3.33) and (3.34) by Lemma 3.4.3 and 3.4.5, respectively.
In Theorem 3.3.1, we have shown the pointwise linearity of Sn(β). The next
theorem guarantees that such linearity is uniform for β that belongs to a small
45
neighbourhood of β0.
Theorem 3.3.2 Under the assumptions (A.1)-(A.8), for any ε > 0, there exists
δ > 0 such that { }
lim P sup n−1/2‖Sn(β∗)− Sn(βn)‖ ≥ ε = 0, (3.35)
n→∞ 0≤‖β∗−βn‖≤δn−1/2
for any |d| < C.
The proof of the theorem uses the same technique as Tsiatis in [38]. The idea is
to show that for a choice of sufficiently fine partitions of interval [−C,C], function
Sn(β) does not fluctuate too much within the sub-intervals. Details of the proof are
√
presented in Section 3.4. Now, we are ready to conclude the n-consistency of β̂n
mentioned at the very beginning of this section
Theorem 3.3.3 Let β̂n the solution to
∑p
β̂n = arg min {S (β)}2n,j , (3.36)
β∈Uβ0 j=1
where Uβ0 is a neighbourhood of β0 such that β̂n is unique. Then under (A.1)-(A.8),
√ D
n(β̂ − β ) −→ N(0, {g(β )}−1n 0 0 σ2(β0) {g(β0)}−1).
Proof: Let β∗n be the solution to S̃n(β) = 0, where
S̃n(β) = Sn(β0) + ng(β0)(β − β0). (3.37)
If σ2(β0) denotes the asymptotic variance of n
−1/2Sn(β0) mentioned under (3.23),
then
√
n(β∗n − β0) = n−1/2 {g(β0)}
−1 Sn(β0) (3.38)
46
is asymptotically normal with mean zero and variance {g(β0)}−1 σ2(β0) {g(β )}−10 .
On the other hand, by Theorem 3.3.2,
P
sup n−1/2|Sn(β)− S̃n(β)| −→ 0, (3.39)
β∈Uβ (n−1/2)0
√ P
which as proved by Jurec̆ková in [23] implies n(β̂ ∗n − βn) −→ 0. As a result,
√ √
n(β̂n − β0) follows the same asymptotic distribution as n(β∗n − β0), and we
complete the proof.
3.4 Some Proofs
Lemma 3.4.1 Let J(u, β , β , z) = ϕ−11 2 (ϕ(u, β1, z), β2, z), then
(J1) ∇1J(u, β1, β2, z) = ρ(u, β1, β2, z), and ∇2J(u, β1, β2, z) = γ(u, β1, β2, z)),
where
∇1ϕ(u, β1, z)
ρ(u, β1, β2, z) = ,∇1ϕ(J(u, β1, β2, z), β2, z)
(3.40)
∇2ϕ(u, β1, z)
γ(u, β1, β2, z) = .∇1ϕ(J(u, β1, β2, z), β2, z)
(J2) Let x = J(u, β1, β2, z), then u = J(u, β2, β1, z), and
1
ρ(x, β1, β2, z) = (3.41)
ρ(u, β2, β1, z)
(J3) J(u, ·, β2, z) is Lipschitz.
(J4) There exists a constant c such that |ρ(u, β1, β2, z)− 1| ≤ c‖β1 − β2‖.
(J1) By chain rule,
∇1J(u, β1, β2, Zi) = ∇1ϕ−1(ϕ(u, β1, z), β2, z) · ∇1ϕ(u, β1, z)
(3.42)
∇ −12J(u, β1, β2, Zi) = ∇1ϕ (ϕ(u, β1, z), β2, z) · ∇2ϕ(u, β1, z)
47
Let w = ϕ(u, β1, z). Since ϕ(ϕ
−1(w, β2, z), β2, z) = w, differentiate both sides
with respect to w, yielding ∇1ϕ(ϕ−1(w, β , z), β , z) · ∇ ϕ−12 2 1 (w, β2, z) = 1.
Therefore
∇ ϕ−11 (ϕ(u, β1, z), β2, z) = ∇ ϕ−11 (w, β2, z)
1
=
∇ ϕ(ϕ− (3.43)11 (w, β2, z), β2, z)
1
= .
∇1ϕ(J(u, β1, β2, z), β2, z)
Then (3.40) is obtained by plugging (3.43) into (3.42).
(J2) If x = J(u, β , β , z) = ϕ−11 2 (ϕ(u, β1, z), β2, z), then u = ϕ
−1(ϕ(x, β2, z), β1, z) =
J(x, β2, β1, z). As for (3.41), it is true since
∇1ϕ(u, β1, z)
ρ(u, β1, β2, z) = ∇1ϕ (J(u, β1, β2, z), β2, z)
∇1ϕ(J(x, β2, β1, z), β1, z) 1
= = .
∇1ϕ(x, β2, z) ρ(x, β2, β1, z)
(J3) Let b1 and b2 be two distinct points in Uβ0 . Under assumption (A.8), ϕ
−1(·, β2, z)
and ϕ(u, ·, z) are both Lipschitz. Therefore there exist constants c1 and c2 such
that
|J(u, b −1 −11, β2, z)− J(u, b2, β2, z)| = |ϕ (ϕ(u, b1, z), β2, z)− ϕ (ϕ(u, b2, z), β2, z) |
≤ c1|ϕ(u, b1, z)− ϕ(u, b2, z)| ≤ c1c2‖b1 − b2‖
(J4) By the definition of ρ(u,∣∣β1, β2, z) in (3.21) ∣
| ∇1ϕ(u, β1, Zi)
∣
ρ(u, β1, β2, Zi)− 1| = ∣∣ − 1∣∇1ϕ (J(u, β1, β2, Zi), β2, Z ) ∣i
(3.44)
|∇1ϕ(u, β1, Zi)−∇1ϕ (J(u, β1, β2, Zi), β2, Zi) |
=
|∇1ϕ (J(u, β1, β2, Zi), β2, Zi) |
Since ϕ(·, x, y) is assumed to be strictly increasing, the denominator of (3.44)
is strictly greater than a positive constant m. Use D denote the numerator of
48
(3.44), then (3.44) is bounded byD/m. Since both∇1ϕ(·, y, z) and∇2ϕ(x, ·, z)
are Lipschitz continuous under assumption (A.8),
D ≤ |∇1ϕ(u, β1, Zi)−∇1ϕ (J(u, β1, β2, Zi), β1, Zi) |
+ |∇1ϕ (J(u, β1, β2, Zi), β1, Zi)−∇1ϕ (J(u, β1, β2, Zi), β2, Zi) |
≤ c1‖u− J(u, β1, β2, Zi)‖+ c2‖β1 − β2‖ (3.45)
Note that we can write u = J(u, β2, β2, Zi), therefore by (J3), the first term in
(3.45) is bounded by c1c3‖βn−β0‖. Then the conclusion holds with the choice
of c = c1c2 + c2.

3.4.1 Proof of Proposition 3.2.1
Proof: Recall that we would like to prove dMTi (ϕ(u, β, Zi)) is a martingale with
filtration Fn(u, β) in (3.19), where
dMTi (ϕ(u, β, Zi)) =dN
T
i (ϕ(u, β, Zi))
(3.46)
− λ (J(u, β, β0, Z Ti))Yi (ϕ(u, β, Zi)) ρ(u, β, β0, Zi)du,
and NTi (u) and Y
T
i (u) are defined in (3.6), λ(u) is the hazard rate function for ε, and
ρ(u, β, β0, Zi) is defined in Lemma 3.4.1. Change the variable in (3.46) by setting
u = J(v, β0, β, Zi) = ϕ
−1 (ϕ(v, β0, Zi), β, Zi) (3.47)
then ϕ(u, β, Zi) = ϕ(v, β0, Zi), and
NTi (ϕ(u, β, Zi)) = N
T
i (ϕ(v, β
ε
0, Zi)) = Ni (v), (3.48)
49
where we get the second equation in (3.48) since N εi (v) = N
T
i (ϕ(v, β0, Zi)) as shown
in (3.11). By the compensated counting process martingale established in [21],
M εi (v) is a martingale, where
M εi (v) = N
ε
i (v)− λ(v)Y εi (v)dv (3.49)
By (3.47), v = J(u, β, β −10, Zi) = ϕ (ϕ(u, β, Zi), β0, Zi), which implies that
dv = ∇1J(u, β, β0, Zi)du = ρ(u, β, β0, Zi)du (3.50)
where the last equation in (3.50) guaranteed by (J1) of Lemma 3.4.1. Then the
conclusion is a result of (3.48), (3.49) and (3.50).
3.4.2 Proof of Lemma 3.4.2-3.4.5
In Lemma 3.4.2 and 3.4.3, we will show that for any fixed β in a n−1/2 neighbour-
hood of β0, the L
2 distance between (3.30) and Sn(β0) converges to 0 in probability
whereas Lemma 3.4.4 and 3.4.5 will show the same conclusion for (3.31) and the
term ng(β0)(β − β0).
Lemma 3.4.2 Let βn be a sequence of nonrandom vectors converging to β0, then
under the assumptio{ns (A∫.5) and (A.7),n ∗
√1
∑ T { )
dMi (ϕ(u, βn, Zi)} Zi − Z̄(u, βn)
n
i=1 ∫∞ (3.51)T ∗ }
− dMi (ϕ(u, βn, Zi)) {Zi − µβ0(u, βn)}
∞
converges to 0 in probability.
50
Proof: Let us first consider the scalar case when β ∈ R. The expression in (3.51)
is equal to R(T ∗), where by Proposition 3.2.1,
∑n ∫1 u { }
R(u) = √ dMi (ϕ(x, βn, Zi)) Z̄(x, βn)− µβ0(x, βn) .n
i=1 ∞
is a martingale under P with respect to
Fn(u, βn) = σ {Zi, I [Vi ≤ ϕ(u, βn, Zi)] ,∆iI [Vi ≤ ϕ(u, βn, Zi)] , i = 1, . . . n}
Therefore using the vector form o{f Lengla∫rt’s Inequality (see Appendix I, I.2 of [2]),n
δ 1 ∑ T ∗
P {|R(T ∗)| > ε} ≤ + P (Z̄(u, βn)− µβ0(u, βn))2ε2 n
i=1 −∞
× λ (J(u, βn, β0, Zi)) ρ(u,}βn, β0, Z (3.52)i)
× Yi (ϕ(u, βn, Zi)) du > δ
By assumption (A.5), we can find N(ε,K) such that for any n > N(ε,K),
{ }
P sup |Z̄(u, βn)− µβ0(u, βn)| > K < ε,
u≤T ∗
hence with probability exceeding 1− ε, the integral in (3.52) is bounded by
∫ T ∗
K2 λ (J(u, βn, , β0, Zi)) ρ(u, βn, β0, Zi)Yi (ϕ(u, βn, Zi)) du (3.53)
−∞
Let x = J(u, βn, β0, Zi), then by property (J1) and (J2) in Lemma 3.4.1, u =
J(x, β0, βn, Zi), du = ρ(x, β0, βn, Zi)dx, and
1
ρ(u, βn, β0, Zi) =
ρ(x, β0, βn, Zi)
Therefore (3.53) is bounded by
∫ J(T ∗,β0,βn,Zi)
K2 λ(x)dx = K2Λ (J(T ∗, β0, βn, Zi)) ,
−∞
51
Consequently, the average of the integral in (3.52) over n is bounded by
∑n
K2n−1 Λ (J(T ∗, β0, βn, Zi))
i=1
with probability greater than 1−ε. By assumption (A.7), if we choose K ≤ (δ/M)1/2
and δ = ε3 , then the probability in (1.48) is smaller than  for n > N(ε,K).
Now, consider the vector-valued parameter case, i.e., β ∈ Rp. Let t = (t1, . . . , tp)tr
be a unit vector, let
∑n
R(T ∗) = ηi, where ηi = (ηi1, . . . , ηip)
tr (3.54)
i=1
then
∣∣ ∣ ∣∣∑ntr ∗ ∣ ∣∣∣ tr ∣∣∣
∣ ∣∣∣∣∑∑ ∣∣∣ ∣∣∣∑ (∑ )∣n p p n ∣∣ ∑ ∣p∣ ∣ ∣∣∑
∣
n ∣∣
t R(T ) = t ηi = tjηij∣∣ = ∣∣ tj ηij ∣∣ ≤ |tj| ∣∣ ηij∣∣
i=1 i=1 j=1 j=1 i=1 j=1 i=1
(3.55)
Since ‖t‖ = 1 implies tk ≤ 1 for k = 1, . . . , p, t∣herefor∣e∑p ∣∑n ∣
|ttr ∣ ∣R(T ∗)| ≤ ∣∣ ηij∣∣ (3.56)
j=1 i=1
There{fore } {∑∣∣ ∣} {∣ ∣ }p n n
P sup |ttrR(T ∗)| ≥ ε ≤ ∣P ∣
t∈Rp,‖t‖=1 ∣∑ ∣
∣
ηij∣∣ ∣∑ ∣≤ ∣ ∣ εpmaxP ∣∣ ηij∣∣ ≥j p
j=1 i=1 i=1
(3.57)
The probability{i∣∣n∑(3.57∣∣) conv}erges to zero by applying the univariate Lenglart’sn
inequality to P ∣∣ η ∣ij∣ ≥ ε/p . Therefore the conclusion is true for β ∈ Rp.
i=1
52
Lemma 3.4.3 Let βn be a sequence of nonrandom vectors converging to β0, then
under ass{umpti∫ons (A.4), (A.5), (A.7) and (A.8),∑n T ∗1 { } }√ PdMi (ϕ(u, βn, Zi)) Zi − Z̄(u, βn) − Sn(β0) −→ 0. (3.58)
n
i=1 −∞
Proof: (3.58) can[b∑e w∫ritten as the summation of the following three terms,n1 T ∗ { }√ dMi (ϕ(u, βn, Zi)) Zi − Z̄(u, βn)
n
i=1∑−n ∫∞ (3.59)T ∗ ]
− dMi (ϕ(u, βn, Zi)} (Zi − µβ0(u, βn)} ,
[∑i=∫1 −∞n ∗
√1
T
dMi (ϕ(u, βn, Zi)) {Zi − µβ0(u, βn)}n
i=1∑−∞ (3.60)n ∫ T ∗ ]
− dMi (ϕ(u, β0, Zi)) {Zi − µβ0(u, β0)} ,
i=1 −∞
and [∑ ]n ∫1 T ∗√ dMi (ϕ(u, β0, Zi)) {Zi − µβ0(u, β0)} − Sn(β0) . (3.61)n
i=1 −∞
By Lemma 3.4.2, (3.59) and (3.61) converges to zero in probability. We focus on
the asymptotic behavior of (3.60). Let u = J(x, β0, βn, Zi), then the first integral in
equatio∫n (3.60) isJ(T ∗,βn,β0,Zi)
dMi (ϕ(x, β0, Zi)) {Zi − µβ0 (J(x, β0, βn, Zi), βn)} . (3.62)
−∞
Write (3.60) [is equ∫al to the summation of A, B and C, where∑ ]n T ∗
A = −√1 dMi (ϕ(x, β0, Zi)) {µβ0 (J(x, β0, βn, Zi), βn)− µβn 0
(x, β0)} ,
i=1 −∞
[ (3.63)
1 ∑n ∫ J(T ∗,βn,β0,Zi)
B = √ dMi (ϕ(x, β0, Zi))
n
i=1 T
∗ ] (3.64)
× I {T ∗ > J(T ∗, β0, βn, Zi)} {Zi − µβ0 (J(x, β0, βn, Zi), βn)}
53
and [∑n ∫ ∗1 T
C = √ dMi (ϕ(x, β0, Zi))
n ∗
i=1 J(T ,βn,β0,Zi) ] (3.65)
× I {T ∗ < J(T ∗, β0, βn, Zi)} {Zi − µβ0 (J(x, β0, βn, Zi), βn)} .
Since A, B and C integrated up to u are all Fn(u, β0) martingales, condition on Zi,
n ∫ ∗
1 ∑ T
V ar(A) = {µ ⊗2β0 (J(x, β0, βn, Zi), βn)− µβ0(u, β0)}n
i=1 −∞
×λ(x)P {Vi ≥ ϕ(x, β0, Zi)} dx
∑n ∫ ∗1 J(T ,βn,β0,Zi)
V ar(B) = {Zi − µβ0 (J(x, β , β , Z ), β
⊗2
0 n i n)}
n ∗
i∈κ T1
×λ(x)P {Vi ≥ ϕ(x, β0, Zi)} dx
and
∑n ∫ T ∗1
V ar(C) = {Z − µ ⊗2i β0 (J(x, β0, βn, Zi), βn)}n ∗
i∈κ J(T ,βn,β0,Z2 i)
×λ(x)P {Vi ≥ ϕ(x, β0, Zi)} dx
where we calculate the variances using the predictable variation process
〈dMi, dMi〉 = λ(u)Yi (ϕ(u, β0, Zi)) .
By definition of µβ0(x, β) in assumption (A.5) and the boundedness of Zi assumed
in assumption (A.4), we know that µβ0(x, β) is bounded by 1 for all x and β, which
implies
4 ∑n ∫ T ∗‖V ar(A)‖ ≤ λ(x)S(x)dx ≤ 4.
n
i=1 −∞
54
Therefore by the continuity of µβ0(u, β), ϕ(x, y) and the dominated convergence
theorem, ‖V ar(A)‖ converges to 0. therefore (3.63) converges to 0 in probability.
As for (3.64), note th∫at
1 ∑ J(T ∗,βn,β0,Zi)
V ar(B) = {Zi − µβ0 (J(x, β0, βn, Zi), β )}
⊗2
n
n
∈ T ∗i κ1
× λ(x)P {Vi ≥ ϕ(x, β0, Zi)} dx
Since Vi = min(Ti, Ci)
{ }
P {Vi ≥ ϕ(x, β0, Zi)} ≤ P {Ti ≥ ϕ(x, β0, Zi)} = P ϕ−1(Ti, β0, Zi) ≥ x
= P {εi ≥ x} = S(x),
and that λ(x)S(x) = f(x), we know that
∑n ∫1 J(T ∗,βn,β0,Zi)
V ar(B) ≤ λ(x)S(x)dx. (3.66)
n
κ T
∗
1
Since λ(x)S(x) = f(x) is bounded by 1, (3.66) implies that
∑n
V ar(B) ≤ 4K n−11 {J(T ∗, β ∗n, β0, Zi)− T } (3.67)
i=1
Note that T ∗ = J(T ∗, β0, β0, Zi). Since J(u, ·, β2, z) is Lipschitz as shown in (J3)
of Lemma 3.4.1, there exists a constant c such that the terms in the summand of
the right hand side of (3.67) is bounded by c‖βn − β0‖, which implies that V ar(B)
goes to 0 in probability. Using the same approach, we conclude that V ar(C) also
approaches 0 in probability. Hence the proof is complete.
Lemma 3.4.4 Let Sn2 be∫defi{ned as in (3.31), and Sn3 }be∑n T ∗
Sn3 = λ(J(u, βn, β0, Zi))− λ(u) (3.68)
i=1 −∞ { }
× Yi (ϕ(u, βn, Zi)) Zi − Z̄(u, βn) du,
55
then under assumption (A.1) and (A.8),
n−1 · | PSn2(β)− Sn3(β)| −→ 0. (3.69)
Proof: By (3.31) and (3.68),
∑n ∫ T ∗
n−1|Sn2(β)− Sn3(β)| ≤ n−1 λ(J(u, βn, β0, Zi))
i=1 −∞
× |ρ(u∑, βn∫, β0, Zi)− 1|Yi (ϕ(u, βn, Zi)) ‖Zi − Z̄(u, βn)‖dun T ∗
≤ 2n−1 λ(J(u, βn, β0, Zi))
i=1 −∞
× |ρ(u, βn, β0, Zi)− 1|Yi (ϕ(u, βn, Zi)) du (3.70)
By (J4), |ρ(u, βn, β0, Zi)− 1| ≤ c‖βn − β0‖,∑ther∫efore by (3.70), we can see thatn T ∗
n−1|Sn2(β)− Sn3(β)| ≤ 2c‖β −1n − β0‖n λ (J(u, βn, β0, Zi))
i=1 −∞ (3.71)
× Yi (ϕ(u, βn, Zi)) du
Since
E {Yi (ϕ(u, βn, Zi))} ≤ P {Ti ≥ ϕ(u, βn, Zi)}
= P {ε ≥ J(u, βn, β0, Zi)} (3.72)
= S (J(u, βn, β0, Zi)) ,
we know that
∑n ∫ T ∗
n−1|Sn2(β)− Sn3(β)| ≤ 2c‖βn − β ‖n−10 λ (J(u, βn, β0, Zi))
i=1 −∞
∑∫× S (J(u, βn, β0, Zi)) dun T ∗
= 2c‖βn − β −10‖n f (J(u, βn, β0, Zi)) du
∑i=1 −∞n
= 2c‖βn − β0‖n−1 F (J(T ∗, βn, β0, Zi)) (3.73)
i=1
56
By assumption (A.1), F (J(T ∗, βn, β0, Zi)) is bounded by {c1c2‖βn − β0‖+ 1}, there-
fore by (3.73), n−1|Sn2(β)− Sn3(β)| approaches 0 in probability.
∫ T ∗
Lemma 3.4.5 Let g(β ) = λ′0 −∞ (u)Aβ0(u, β)du, and Sn2(β) as shown in (3.31),
then under assumptions (A.1), (A.3), (A.6) and (A.8),
n−1Sn2(β) = g(β0)(β − β0) + oP (‖βn − β0‖) (3.74)
Proof: By Lemma 3.4.4, it suffices to show
n−1Sn3 = g(β0)(β − β0) + oP (‖β − β0‖). (3.75)
Note that the left hand side of (3.75) is the summation of
∑n ∫ T ∗
− ′ { }n 1 λ (u)Yi (ϕ(u, β0, Zi)) Zi − Z̄(u, βn) γ(u, β tr0, Zi) du(βn − β0) (3.76)
i=1 −∞
and
∑n ∫ T ∗
− { }n 1 Yi (ϕ(u, β0, Zi)) Zi − Z̄(u, βn) (3.77)
i=1 −∞
× {λ (J(u, βn, β0, Zi))− λ(u)− λ′(u)γ(u, β0, Z )tri (βn − β0)}du
where γ(u, β0, Zi) is defined in (3.15). By assumption (A.6), (3.76) converges to∫ T ∗
λ′(u)Aβ0(u, β0)du(βn − β0) = g(β0)(βn − β0). (3.78)
−∞
As for (3.77), it is bounded by
∑n ∫ T ∗
2n−1‖βn − β0‖2 |θ(u, βn, Zi)|du (3.79)
i=1 −∞
Since the average over n in (3.79) is bounded in probability by assumption (A.3),
we finish the proof by (3.78) and (3.79).
57
3.4.3 Proof of Theorem 3.3.2
First, let us consider a uni-variate β ∈ R. We form a mesh with space approaching
zero from −C to C using a finite number points d0, . . . , dm. By Theorem 3.3.1, for
√
βn,i = β0 + di/ n,
{ }
max n−1/2‖ PSn(βn,i)− S̃n(βn,i)‖ −→ 0. (3.80)
i≤m
In order to prove (3.35), we must show that n−1/2Sn(β) does not fluctuate too
much from βn,i to βn,i+1 for i = 1, . . . ,m for any choice of partition points d0, . . . , dm.
More specifically, for any ε > 0, there exists a positive δ such that for βn = β0 +
δn−1/2, { }
lim P sup n−1/2‖S (β∗n )− Sn(βn)‖ ≥ ε = 0, (3.81)
n→∞ βn≤β∗≤β +δn−1/2n
for any |d| < C.
As for the case when β ∈ Rp, we consider a p-dimensional mesh by allowing β to
change its coordinate one at a time. More specifically, for a fixed β∗, define
β̌ = (β∗(1), . . . , β∗(j), β(j+1), . . . , β(p))trn,j n n ,
then ∑p−1 { }
Sn(β
∗)− Sn(βn) = Sn(β̌n,j+1)− Sn(β̌n,j) . (3.82)
j=0
Therefore ∥ ∥
‖Sn(β∗)− Sn(βn)‖ ≤ p ·max∥Sn(β̌n,j+1)− Sn(β̌ ∥n,j) (3.83)
j
58
In the right hand side of (3.83), the change only occurs in the (j + 1)th coordinate,
and the other coordinates are fixed. Consequently, the uniform convergence for
β ∈ Rp will follow if we can show (3.81) for β ∈ R.
For the rest of this subsection, without loss of generality, assume that β ∈ R.
Recall that Vi = min(Ti, Ci). Define the residuals ri as
ri ≡ r(Vi, β, β0, Zi) = ϕ−1(Vi, β, Zi) = J (εi, β0, β, Zi) ,
We can complete the proof of (3.81) by putting a probabilistic bound on the maxi-
mum change of S (β∗) as β∗ varies from β to β + δn−1/2n n n . Recall that
∑ { }n ∑n
j=1 ZjI {Vj ≥ ϕ(εi, Zj, βi)}
Sn(β) = ∆i Zi − ∑n (3.84)
i=1 j=1
I {Vj ≥ ϕ(εi, Zj, βi)}
so S (β∗n ) is a function of the ranks of residuals ri, hence change in Sn(β
∗) occurs
whenever the change of β∗ from β −1/2n to βn+δn leads to a change of ri, i = 1, . . . , n.
Therefore the maximum change of S ∗n(β ) can be calculated by computing F1 × F2,
where
L1 = #[pairs of interchanged ranks ],
L = 

(3.85)
the maximum change of S (β∗n ) 
2
for each such interchange
In Lemma 3.4.6, we investigate the two factors of (3.85). Then, using Lemma
3.4.6, we prove the uniform linearity as stated in Theorem 3.3.2.
Lemma 3.4.6 Let T ∗ be {a value such that P {Vi ≥ T
∗ + ξ}}≥ ψ > 0. Define
n
B ≡ 1
∑
X : I {V ≥ T ∗ ψn ∞ i + ξ} ≥ , (3.86)
n 2
i=1
59
where X∞ = {Xi}ni=1. Under assumption (A.4) and (A.8), for any ε > 0, there
exists Nε such that for any n > Nε,
(i) P {Bn} > 1− ε;
(ii) for n > Nε and X∞ ∈ Bn, P {nL2 ≤ 6c/ψ} ≥ 1− ε.
∑n ∑n
(iii) L1 = I(Aij), where Aij is the event |Vi − Vj| ≤ c · δn−1/2
i=1 j=6 i
Proof: The conclusion (i) in is guaranteed by the Law of Large Numbers. Next,
let us consider (ii) which shows how to bound L2 in (3.85). Note that whenever the
change of β∗ from β to β −1/2n n+δn causes an interchange in ranks of the residual, the
inter{chan}ge must happen between two adjacent order statistics of ri, i = 1, . . . , n.n
Let r(i) be the set of order statistics of {ri}ni=1 i=1, and denote the corresponding
covariate and failure indicator by Z(i)(β
∗) and ∆ ∗(i)(β ), then S (β
∗
n ) can be written
as ∑n { } ∑n Z (β∗∗ )∆i(β ) Z(i)(β∗)− (k)Z̄(i) , where Z̄ ∗(i)(β ) = . (3.87)
n− i+ 1
i=1 k=i
Now, assume that the change of β∗ from β −1/2n to βn + δn causes an interchange in
ranks between two adjacent order statistics r(j) and r(j+1), then the new S (β
∗+
n ) is
∑j−1
∆ (β∗
{ } { }
i ) Z(i)(β
∗)− Z̄ ∗(i)(β ) + ∆(j+1)(β∗) Z ∗ ∗{ (j+1)
(β )− Z̄
}(j)
(β )
i=1
∗ ∗ Z̄(j+2)(β
∗)(n− j − 1) + Z ∗
− (j)
(β )
+ ∆(j)(β ) Z(j)(β ) (3.88)
∑ n− jn
+ ∆ (β∗
{ }
i ) Z
∗
(i)(β )− Z̄ (β∗(i) ) .
i=j+2
60
Hence the difference of S (β∗n ) before and after the interchange in r(j) and r(j+1) is
(3.87) minus (3.88), which equals to
{ ∗ }{ }Z̄ (β∗)(n− j − 1)∆(j+1)(β )− ∗ (j+2)∆(j)(β ) − Z̄(j)(β∗)
n− j
(3.89)
∆ ∗(j+1)(β )Z
∗
(j+1)(β ) ∆(j)(β
∗)Z(j)(β
∗)
+ − ,
n− j n− j
where we use the fact∑thatn ∗ ∗ ∗
∗ k=j+1 Z(k)(β ) Z̄(j+2)(β )(n− j − 1) + Zj+1(β )Z̄(j+1)(β ) = = .
n− j n− j
Then (3.89) is equal toZ − Z
(j+1) (j)
if ∆(j) = ∆(j+1) = 1
n− j
 Z(j) Z(j+1) Z̄(j+2)(n− j − 1)− + + if ∆(j) = 1 and ∆ (j+1)
= 0
 (n− j)(n− j + 1) n− j + 1 (n− j)(n− j + 1)Z̄(j+1) − Z(j) if ∆(j) = 0 and ∆(j+1) = 1
n− j + 1
(3.90)
For any of the three cases in (3.90), since ‖Zi‖’s are bounded by a constant c with
probability 1 under assumption (A.4), the change in S (β∗n ) is bounded by c/(n−j),
where n− j is the number of ri’s at risk at the point where the interchange occurs.
Consider X∞ ∈ Bn on which
∑n1
I {V ≥ T ∗i + ξ} ≥
ψ
.
n 2
i=1
Since Sn(β) is computed for ri’s that are less than ϕ
−1(T ∗, βn, Zi), forX∞ ∈ Bn, the
number of the ri’s at risk will exceed nψ/2 if an interchange occurs, i.e., n−j ≥ nψ/2.
Consequently, the change in Sn(β) is bounded by (6c/ψ)n
−1.
61
Then we try to find L1 in (3.85), i.e., the number of interchanges when β
∗ varies
from βn to βn + δn
−1/2. An interchange between (i, j) will occur for βij if εi = εj.
Therefore by assumption (A.4) and (A.8), an interchange occurring for values of β∗
between β and β + δn−1/2n n ∣ implies∣ ( ) ∣∣
|V ∣i − Vj| = ∣ϕ (εi, βn, Zi)− δ δϕ εi, βn + √ , Z ∣j ∣ ≤ C2 · √ (3.91)n n
The total number of interchanges equates
∑n ∑n
L1 = I(Aij),
i=1 j=6 i
where Aij denotes the event in (3.91).
Proof of Theorem 3.3.2 In Lemma 3.4.6, we have shown that the maximum
change of Sn(β) after each interchange in ranks is bounded by (4/ψ)n
−1, and the
the number of interchanges as β∗ varies from β to β +δn−1/2n n is L1. If we can show
that { }
lim P n−3/2L1 ≥ ε = 0 (3.92)
n→∞
for some δ > 0 that{is to be chosen properly, then the proof is co}mplete, i.e.,
lim P sup n−1/2‖S ∗n(β )− Sn(βn)‖ ≥ ε = 0.
n→∞ 0≤‖β∗−βn‖≤δn−1/2
For 1 ≤ i < j ≤ n, Let
Wij =∑I(Aij) + I(Aji),n
Ui = {E(Wij|Vi)− E(Wij)} ,
j=6 i
Uij = Wij − E(Wij|Vi)− E(Wij|Vj) + E(Wij).
62
Then ∑n ∑n ∑n
L−E {L1} = Ui + Uij
i=1 i<j
is the sum of pairwise uncorrelated random variables. Let f ∗i (u) be the density
function of Vi, hence fi(u) = f(u)H(u) + h(u)S(u). Let
∫ √C2δ/ n
P {A |V } = f ∗ 2(K1 +K2)C2δij i j (u)du ≤ √ ,
0 n
where the inequality can be attained by assumption A and B. Similarly, P {Aji|Vi} ≤
√
2(K1 +K2)C2δ/ n, so
|E {Wij|Vi} | ≤ 4(K1 +K2)C δn−1/22 ,
therefore
∑n ∑n ∑n
E {L } ≤ 2(K +K )C δn3/2, σ2(M) = σ2(U ) + σ2(V ) = O(n21 1 2 2 i ij )
{ } i=1 { i<j }
Since in (3.92), P n−3/2L1 ≥ ε is bounded by P |L1 − E(L 3/21)| ≥ n ε− E(L1) ,
using the Chebyshev’s inequality,
{ } 2
| − | ≥ 3/2 − ≤ σ (L1)P L1 E(L1) n ε E(L1)
(n3/2ε− E(L1))2
(3.93)
Cn2≤ ,
(n3/2ε− 2(K +K 3/2 21 2)C2δn )
where C is a constant. Let δ = ε/ {3C2(K1 +K2)}, then the probability in (3.93)
is bounded by 3C/(nε2), hence we have shown (3.85). 
63
3.5 List of Notations
T = ϕ(ε, β0, Z) page 34
J(u, β1, β , z) ≡ ϕ−12 (ϕ(u, β1, z), β2, z) page 38
∇2ϕ(u, β0, z)
γ(u, β0, z) =∑∇2J(u, β, β0, z)|β=β0 = . page 38∇1ϕ(u, β0, z)
Z̄(u, β) = ∑nj=1 ZjYj {ϕ(u, β, Zj)}∫ n page 39∑ j=1 Yj {ϕ(u, β, Zj)}n T ∗ { }
Sn(β) = dN
T
i {ϕ(u, β, Zi)} Zi − Z̄(u, β) page 41
i=1 −∞
∇1ϕ(u, β, Zi)
ρ(u, β, β0, Zi) = ∇1J(u, β, β0, Zi) = ∇∫ page 411 (ϕ(J(u, β, β0, Zi), β0, Zi))J(u,β,β0,Zi) λ(x)Yi (ϕ(x, β0, Zi))
Mi (ϕ(u, β, Zi)) = Ni (ϕ(u, β, Zi))− dx page 41
−∞ ρ(x, β0, β, Zi)
S̃n(β) = Sn(β0) + g(β0)(β − β0) page 44
64
Chapter 4: Technical Results I: Owen’s Lemmas & Empirical Process
Consider d-dimensional independently identically distributed (i.i.d.) random vec-
tors from a common distribution family
X1, . . . , Xn ∼ Pθ, where θ = (β, λ). (4.1)
The parameter θ is consisted of the structural part β ∈ Rp, and the nuisance part
λ ∈ H that is infinite dimensional. Classic Estimating Equation (CEE) assumes
there exists function m(x, β) satisfying Definition 1.1.1 in Section 1.2. We extended
the CEE to the Extended Estimating Equation (EEE) in Definition 1.2.1, in which
the estimating function is denoted by mn(x,x, β). A direct way of constructing an
estimator from the CEE and EEE is solving
β̃n : Sn(β) = 0, (4.2)
∑n ∑n
where Sn(β) = m(Xi, β) or Sn(β) = mn(Xi,X , β).
i=1 i=1
Under the CEE setting, another way to construct an estimator for β0 is through
the Empirical Likelihood (EL) method using Sn(β) = 0 as a constraint. The method
65
is first to define the probability vector {p̂i(β,Xi,X , β )}n0 i=1 that solves ∏maxp pi, where p = (p1, . . . , pn); ∑ ∑ (4.3)n nsubject to pi = 1, pi ∈ (0, 1), pim(Xi, β) = 0.
i=1 i=1
Then β̂n can be defined via maximizing the Profile Empirical Likelihood (pEL), or
equivalently, minimizing the negative logarithm of the pEL, i.e.
∑n
β̂n = arg min l(β), where l(β) = − ln (p̂i(β,Xi,X , β0)) (4.4)
β
i=1
However, to prove that there exists a neighborhood Uβ0 of β0 such that for any β ∈
Uβ0 , there exists a unique solution to (4.3), and to establish the asymptotic behavior
of β̂n, we need to prove lemmas parallel to Lemma 11.2 and 11.4 in [32] by Owen,
and Lemma 1 and Theorem 1 by Qin in [33]. Since the major difference between the
CEE and the EEE is the appearance of the higher order summations in the latter,
the simplest torms of the Law of Large Numbers (LLN) and Central Limit Theorem
(CLT) cannot be applied directly in the EEE setting. To handle more general forms
of summation, we introduce the concepts of Donsker and GlivenKo-Cantelli classes
in the empirical process theory using the series of examples in Chapter 19 of [41].
4.1 Assumptions and Notations
Let Xi, for i = 1, . . . , n be the i.i.d. sample defined in (4.1) with common distri-
bution function FX(x, θ) supported on Rd. Let Uβ0 be a neighbourhood of β0 that
66
will be defined in Lemma 5.2.2 and{(5.48), and∑define }n
j=1C(Xj)k(Xi, Xj, β)
mn(Xi,X , β) = Q(Xi, β) C(Xi)− ∑n . (4.5)
∑ j=1 k(Xi, Xj, β)n1
m̄n(X, β) = mn(Xi,X , β). (4.6)
n
i=1
Z∗n(X, β) = max ‖mn(Xi,X , β)‖; (4.7)
1≤i≤n
kc(Xi, β) = E {C(Xj)k(Xi, Xj, β)|Xi} , for i 6= j; (4.8)
k̄(Xi, β) = E {k(Xi, Xj, β)|Xi} , for i 6= j; (4.9)
η(Xi, Xj, β) = C(Xj)k(Xi, Xj, β)− kc(Xi, β), for i 6= j (4.10)
E {·} ≡ Eθ0 {·} ; P {·} ≡ Pθ0 {·} (4.11)
Note that η(Xi, Xj, β) is defined such that Eθ {η(Xi, Xj, β)|Xi} ≡ 0, for all θ ∈ Θ.
In this Chapter, we also use O(·), o(·) to represent the almost sure magnitude, and
OP (·) and oP (·) the magnitude in probability. A more detailed definition of these
four notations can be found in Chapter 1 of [36]. In general, for a vector υ ∈ Rr,
‖υ‖ denotes the Euclidean norm.
Assumptions
(A.1) Eβ0,λ {mn(X,X, β0)} = 0, for all λ ∈ H.
(A.2) For all x and β ∈ Uβ0 , ‖C(x)‖ < b <∞, |Q(x, β)| < M <∞.
(A.3) Eθ {∇βm(X,X, β)} is nonsingular for θ ∈ Uβ0 ×H.
(A.4) Fo{r any (x, θ) ∈ R
d
} × (Uβ0 ×H), k(·, ·, ·) > 0; for positive γ that is close to 1,
E k̄(Xi, β
−γ
0) <∞
67
(A.5) For each x and x, ∇βmn(x,x, β) exists for β in Uβ0 , and is continuous at β0.
(A.6) E {∇βmn(x,x, β0)} is of full rank p.
(A.7) E[mn(X,X, β )m
tr
0 n (X,X, β0)] is positive definite.
(A.8) There exists a constant M such that the jth component ηj of η = η(x,X, β0)
in (4.10) satisfies |ηj| ≤M , and
{ }
E e|ηj |/M − − ηj1 M2 ≤ 1var(ηj).
M 2
(A.9) There exists M such that for any β, ‖k(x, y, β)‖ < M .
(A.10) Let α = (x, β) ∈ Rd × Uβ0 with Euclidean norm, and k(x, y, β) = kα(y).
There exists a measurable function b(y) : Rd 7→ R such that for any α1 6= α2,
|kα1(y)− kα2(y)| ≤ b(y)‖α1 − α2‖,
and E {|b(X1)|} <∞.
{ }
(A.11) Let γ > 0 be close to 1, then E k̄(X1, β0)
−γ <∞.
4.2 Lemmas Parallel to Owen, and Qin and Lawless
In this section, we provide some lemmas that are parallel to Lemmas 11.2 through
11.4 in Chapter 11 of [32], which are used by Owen to establish the Nonarametric
Maximum Empirical Likelihood (NPMELE) estimator and the Wilks type theorem
when the dimension of the estimating function r equals the dimension of the struc-
tural parameter p. Qin and Lawless also applied these lemmas when generalizing
68
Owen’s work to the case when r > p. We will show that under regularity conditions,
even though estimating functions of Extended Estimating Equations (EEE) denoted
by mn(Xi,X , β) are no longer i.i.d., the parallel versions of these Lemmas continue
to hold.
The way to overcome loss of independence in EEE is to refer to the tools in
empirical process theory. Compa{re the follo∑wing two expressions, }n
j=1C(Xj)k(Xi, Xj, β)
mn(Xi,X , β) = Q(Xi, β) C(Xi)− ∑n (4.12)
j=1 k(Xi, Xj, β)
versus { }
E {C(Y )k(Xi, Y, β)|Xi}
V (Xi, β) = Q(Xi, β) C(Xi)− . (4.13)
E {k(Xi, Y, β)|Xi}
Applying the LLN to the numerator and denominator in equation (4.12), then ap-
plying Slutsky’s lemma, we can see that (4.12) and (4.13) are close when n is large.
In order to take advantage of the similarity of (4.12) and (4.13) uniformly over in-
dices i, we need to show the higher order summations in the former converge to the
corresponding terms in the latter in probability, and uniformly in Xi and β ∈ Uβ0 .
More strictly,
Lemma 4.2.1 Suppose k(x, y, β) : Rd×Rd×Uβ0 7→ R+ and C(x) : Rd 7→ Rp satis-
fies assumptions (A.1), (A.2), (A.9) and (A.10), where Uβ0 is a open and bounded
set in Θ that contains the true parameter value β0. Then
sup ∣∣∣∣ ∑n ∣1 ∣ Pk(x,X ∣j, β)− E {k(x,X, β)} ∣ −→ 0 (4.14)
(x,β)∈X×U nβ0 j=1
69
and ∣
sup ∣∣∣ n ∣1 ∑ ∣C(Xj)k(x,Xj, β)− E {C(X)k(x,X, β)} ∣∣ −P→ 0 (4.15)
(x,β)∈X×U nβ0 j=1
The proof of the lemma involves constructing Glivenko-Cantelli classes, and an
application of parametric class discussed by Van der Vaart in Chapter 19 of [41].
The following Proposition is parallel to Lemma 11.2 in Chapter 11 of [32]
Proposition 4.2.1 Let Z∗n(X, β) = max1≤i≤n ‖mn(Xi,X , β)‖. Under assumptions
(A.3), (A.4) and (A.6), for any fixed β ∈ Θ,
Z∗n(X, β) = o(n
1/2). (4.16)
To prove Proposition 4.2.1, we decompose mn(Xi,X , β) into an i.i.d part and a
non-i.i.d. quotient part. Then the conclusion can be drawn by applying Lemma
11.2 in [32] to the two parts separately.
The following two lemmas play roles that are equivalent to the Law of Iterated
Logarithm (LIL) in [32]. They will be applied to prove the EEE version of Lemma
11.4 in [32].
Lemma 4.2.2 Let 0 < δ < 1/2, then under assumptions (A.2), (A.4), (A.5),
(A.8), (A.9) and (A.10)
∑n
n−1/2 mn(Xi,X , β0) = OP (n
δ) (4.17)
i=1
Lemma 4.2.3 Let 0 < δ < 1/2, then under assumptions (A.2), (A.4), (A.5),
(A.8), (A.9) and (A.10)
1 ∑n ‖m (X ,X, β )‖2 = O(nδn i 0 ). (4.18)
n
i=1
70
The proofs of Lemmas 4.2.2 and 4.2.3 involve two steps. First, mn(Xi,X , β0) is
split into the summation of an i.i.d. part and a higher order summation quotient
part. The magnitude of the i.i.d. term is given by the LIL, and the quotient bounded
by Berstein’s Inequality listed as a proposition in Section 4.3. We also prove both
lemmas first for univariate β, then generalize the conclusion to the multivariate case.
With the conclusion in Lemma 4.2.3, we present Proposition 4.2.2 which is parallel
to Lemma 11.3 in Chapter 11 of [32].
Proposition 4.2.2 Under assumptions of Lemma 4.2.3,
∑n
n−1 ‖mn(Xi,X , β0)‖3 = o(n1/2). (4.19)
i=1
The proof of Proposition 4.2.2 is a direct application of Proposition 4.2.1 and
Lemma 4.2.3.
4.3 Some Proofs
Proof of Lemma 4.2.1 The convergence of (4.15) and (4.14) for fixed (x, β) is
guaranteed by the LLN. Therefore the main concern is to prove the uniformity in
parameter (x, β). Let α = (x, β), kα(y) = k(x, y, β) and K = {kα(y), α ∈ X × Uβ0},
where X and Uβ0 are bounded subsets of Rd and Rp as defined in Definition 1.2.1.
Under assumption (A.10), K forms a parametric class mentioned in Example 19.7
on page 271 of [41], which refers to a class of functions that are Lipschitz in a
finite-dimensional parameter on a bounded region, but the function domain may
be unbounded. Therefore K is a Donsker class, and as in Theorem A.0.1, is also a
71
Glivenko-Cantelli class. Hence the convergence in (4.15) is uniform in (x, β).
As for (4.14), note that KC = {C(y)kα(y), α ∈ X × Uβ0}, for any α1, α2 ∈ X×Uβ0 ,
‖C(y)kα1(y)− C(y)kα2(y)‖ ≤ ‖C(y)‖ · ‖kα1(y)− kα2(y)‖ (4.20)
Under assumption (A.2), the right hand side of (4.20) is bounded by ‖C(y)‖ ·
b(y)‖α1 − α2‖. Under assumption (A.2) and (A.10),
E| b(Y ) · ‖C(Y )‖ | = E|b(Y )| · E‖C(Y )‖ <∞ (4.21)
ThereforeKC also forms a parametric class mentioned in Theorem A.0.2 of Appendix
A. So with the same reasoning we made for (4.15), the convergence (4.14) is also
uniform in (x, β) ∈ X × Uβ0 . 
Proof of Proposition 4.2.1 Let(Z
∗
n(X, β)∑= max1≤i≤n ‖mn(Xi,X ,)β)‖. Sincen
j∑=1 C(Xj)k(X,Xj, β)mn(X,X, β) = Q(X, β) C(X)− n ,
j=1 k(X,Xj, β)
For a fixed β, Z∗n(X, β) is bounded by the sum of A and B, where
A = max ‖Q(Xi, β)C(Xi)‖,
i
and
∥ ∑∥∥∥ nj∑ ∥=1C(Xj)k(Xi, Xj, β)∥B = max Q(X , β) ∥i n .i
j=1 k(Xi, Xj, β)
∥
By Lemma 11.2 in [32], under assumption (A.6), A = o(n1/2). As for B, it is
bounded by ∥∥∑n ∥
| j=1
C(Xj)k(Xi, Xj, β)∥
max Q(Xi, β)| ·max∥∥ ∑ ∥n .i i
j=1 k(Xi, X , β)
∥
j
72
Since th∥e∑k(·, ·, β)’s are assumed t∥o be non(ne∑gative in assumption (A.4),∥ )nj=1C(Xj)k(X ni, Xj, β)∥ j=1 k(Xi, Xj, β) ·maxj ‖C(Xj)‖
max∥∥ ∑ ∥n ∥ ≤ max ∑ni i
j=1 k(Xi, Xj, β) j=1 k(Xi, Xj, β)
= max(‖C(Xj)‖).
j
Then under the boundedness assumption in (A.2),
B ≤ max |Q(Xi, β)|max ‖C(Xi)‖ = o(n1/2).
i i
Therefore for any fixed β ∈ Uβ0 , Z∗n(X, β) = o(n1/2). 
Before giving the proof of Lemma 4.2.3, we state the Bernstein Inequality as
follows. This is a well-known theorem and can be found in references like [42].
Proposition 4.3.1 (Bernstein’s Inequality) Let X1, . . . , Xn be independent vari-
ables with zero mean such that E|X |m ≤ m!Mm−2i vi/2, for every m ≥ 2 and all i
and some constant M and vi. Then { }
1 x2
P (|X1 + · · ·+Xn| > x) ≤ 2 exp − · ,
2 v +Mx
for v ≥ v1 + · · ·+ vn.
Proof of Lemma 4.2.2 Let us start with the univariate case that β0 ∈ Θ ⊂ R
√
and C(x) : Rd 7→ R. We can split nm̄n(X, β0) as the difference of the following
two terms, ∑n { }
√1 kc(Xi, β0)Q(Xi, β0) C(Xi)− (4.22)
n k̄(Xi, βi=1 0)
and
∑ {n ∑ }n
√1 j∑=1C(Xj)k(Xi, Xj, β0) kc(Xi, β0)Q(Xi, β0)
n n
− . (4.23)
j=1 k(Xi, Xj, β0) k̄(Xi=1 i, β0)
73
Note that (4.22) is a summation of i.i.d. terms, therefore under assumption (A.5), by
√
the LIL, it is O( ln lnn), and is asymptotically normal by CLT. Next, we evaluate
the order of magnitude of (4.23). To begin with, let us rewrite (4.23) as
∑n
√1 Q∑(Xi, β0)
n k̄(X{i, β ni=1 0) j=∑1 k(Xi, Xj, β0)n ∑n }
× k̄(Xi, β) C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) k(Xi, Xj, β0) (4.24)
j=1 j=1
Then we can split (4.24) into the difference of the following two terms,
n n ( )
√1
∑ ∑Q(Xi, β ) 1 ∑0
n √1 n
√ C(Xj)k(Xi, Xj, β0)− k̄c(Xi, β0) (4.24.a)
j=1 k(Xi, Xj, β0) ni=1 n j=1
and
∑n ∑ n ( )√1 Q(Xi, β0)kc(Xi, β0) √1 ∑ k(Xi, Xj, β0)− k̄(Xi, β0)
n k̄(X , β )√1 ni 0 k(X ,X , β ) ni=1 n j=1 i j 0 j=1
(4.24.b)
Let us consider the following term in the numerator of (4.24.a),
n
√1
∑{ }
C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) (4.24.a.1)
n ∑j=1n1 { } 1 { }
=√ C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) + √ C(Xi)k(Xi, Xi, β0)− kc(Xi, β0)
n n
j:j 6=i
Note that by Proposition 4.2.1, for any i = 1, . . . , n, the norm of the second term in
equation (4.24.a.1) is bounded by
√1 1max ‖C(X )k(X ,X , β )− k (X , β )‖ = √ · o(n1/2i i i 0 c i 0 ) = o(1).
n i n
Therefore for i = 1, . . . , n, we can rewrite (4.24.a.1) as
1 ∑{ }√ C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) + o(1). (4.24.a.2)
n
j:j 6=i
74
Then by substituting (4.24.a.2) for (4.24.a.1), we know that (4.24.a) is equal to
∑n
√1 ∑Q(Xi, β0)
n √1 { nj=1∑k(X(i, Xi=1 n j, β0) ) } (4.24.a.3)
× √1 C(Xj)k(Xi, Xj, β0)− k̄c(Xi, β0) + o(1)
n
j:j 6=i
By (4.8), { }
E kc(Xi, β0) = E {C(Xj)k(Xi, Xj, β0)} .
Therefore the terms inside the summation of (4.24.a.2) have expectation zero. More-
over, conditioned on Xi, (4.24.a.2) is the summation of i.i.d. terms with zero mean,
under assumption (A.8), we can apply Bernstein’s inequality in Proposition 4.3.1
with the choice of constant x equals to k = cn(δ+1)/2n . Since δ is between 0 and 1/2,
k2 n1+δ nδn = ∼ as n→∞
nσ +Mkn σn+Mn(1+δ)/2 σ{ }
where σ = supx∈X var C(X1)k(x,X1, β0)− k̄c(x, β0) , and ∼ means that the ratio
of the two ex{pr∣essions converges to 1.Hence by Pro∣position 4}.3.1, for i = 1, . . . , n,
∣∣∣∣∑ ∣∣ ∣
∣
δ
P C(Xj)k(Xi, Xj, β0)− kc(Xi, β ∣ −cn0)∣∣ ≥ kn∣Xi ≤ 2e .
j:j=6 i
By putting{togeth∣er all such sets for different i, we get
∣∣∣∣∑ − ∣∣
∣ }
δ
P max C(Xj)k(Xi, Xj, β0) kc(Xi, β0)∣∣ ≥ k −cnn ≤ 2n e .i
j:j 6=i
Therefore
∑ { ∣∣∑ ∣∣ }∞
P max √1 ∣∣∣ ∣ knC(Xj)k(Xi, Xj, β0)− kc(Xi, β0)∣ ≥ √i n
n=1 j:j 6=i ∑ ∣ n (4.25)∞
≤ 2 ne−cnδ <∞.
n=1
75
By Borel-Cantelli lemma, the inequality above implies that for all n sufficiently
large,
∣∣∣∣∣ 1 ∑
∣
max √ C(Xj)k(Xi, Xj, β0)− kc(Xi, β0)
i n ∣∣∣∣ ≤ √kn = nδ, a.s. (4.26)n
j 6=i
Now, let us discuss the denominator in (4.24.a), namely,
1 ∑n√ k(Xi, Xj, β0). (4.27)
n
i=1
In the following proof, we would like to show that under the assumptions that we
have made up to now and assumption (A.11), (4.27) is O(nδ). Split (4.27) in to
1 ∑ 1 ∑√ k(Xi, Xj, β0) + √ k(Xi, Xj, β0), (4.28)
{ n nj:i∈J } j:i∈/J
where J ≡ k̄(X , β ) < c′n(δ−1)/2i 0 . With the same argument used to deduct (4.25),
by the Berstein ine∣quality, for δ∗ = δ/3,∣∣∣∣ 1 ∑ ∣
∣∣ ∗
max √ k(Xi, Xj, β)− k̄(Xi, β0)∣∣ ≤ cnδ /2, , a.s. (4.29)i n
j:j 6=i
which implies that we can re-write (4.28) as
1 ∑ { } {√ ∗ }k(Xi, Xj, β )+I k̄(X , β ) ≥ c′n(δ−1)/20 i 0 · k̄(Xi, β ) +O(n(δ −1)/20 ) (4.30)
n
J
Next, we discuss the order of magnitude of the two terms in (4.30). Since k(·, ·, ·) is
bounded assumed to be bounded by M in assumption (A.9), and the first term in
(4.30) is symmetric with respect to i, we know that the first term in (4.30)
2 √ { }√ME { number of i ∈ J } = 2M nP k̄(X , β ) < c′n(δ−1)/2i 0
n
√ { }
= 2M n{P k̄(X , β )
−γ > c′−γnγ(1−δ)/2 (4.31)i 0} δ−1 1
≤ 2ME k̄(X −γ γ· · δi, β0) n 2 2 ≤ n
76
where γ is the positive constant that is close to 1 defined in (A.11) and the inequality
is attained by the Chebyshev’s inequality. So in (4.30), the first term is O (nδP ).
Combining (4.26) - (4.31) we know that (4.24.a.3) is bounded in probability by
2 ∑n { Q(Xi, β0)} nδ. (4.32)
n O (nδ) + I k̄(X , β ) ≥ c′n(δ−1)/2 k̄(X , β ) +O(n(δ∗−1)/2)
i=1 P i 0 i 0
With the ch√oice of δ
∗ = δ/3, by the LIL for i.i.d. summands, the order of (4.32) is
given by O( ln lnn/n).
By far we have shown that (4.24.a.3) is O(nδ). Since (4.24.a.3) equals to (4.24.a),
the latter is also O(nδ). Using the same strategy, we can also prove that (4.24.b) is
√
O(nδ). Therefore nm̄n(X, β0) = O(n
δ).
Next, we generalize our conclusion to vector valued β0 and C(x). It suffices to
show that (4.26) holds for β0 ∈ Rp and C(x) : Rd 7→ Rp. Define a p-dimensional
vector η trj = (ηj1, . . . , ηjp) , where
ηj = C(Xj)k(Xi, Xj, β0)− kc(Xi, β0), j = 1, . . . , n and i 6= j.
Then by (4.26), fo{r each component ηjk of ηj,∑ }
P lim sup √1 ηjk ≥ √
kn
= 0, k = 1, . . . , p. (4.33)
n→∞ n n
j:j=6 i
Let t∣ = (t1, . . ∣. , tp)∣tr be a unit vector in Rp,∣∣∣∑n ∣∣∣ ∣∣∣∑n ∑p ∣∣∣ ∣∣∣∑
(
p ∑ )n ∣∣∣ ∑p ∣∣∑n ∣∣ttrηi = t ∣ ∣kηjk∣ = ∣ tk ηjk ∣ ≤ |tk|∣ ηjk∣ (4.34)
j:j 6=i j:j 6=i k=1 k=1 j:j 6=i k=1 i=1
where we attain the inequality using triangle inequality. Since ‖t‖ = 1, |tk| ≤ 1 for
k = 1, . . . p, then together wi∣th (4.34),∣∣∣∑n ttrηi∣∣∣∣ ∑p ∣∣∣∣∑n ∣∣∣∣≤ ηjk (4.35)
j=1 k=1 j:j=6 i
77
Note that {(4.35) hold∣s for any ∣unit ve}ctor in{R
p, w∣hich ind∣icates}that
P sup ∣∣∣∑n ∣ttrη ∣∣ ∑p ∣j ≥ kn ≤ P ∣∣∑n ∣η ∣jk∣ ≥ kn
t∈Rp,‖t‖=1 j:j 6=i k=1 {j:j∣6=∣∣∣∑
i
n ∣
≤ pmaxP η ∣∣∣
}
kn
jk ≥ (4.36)
k p
j:j 6=i
Then combine (4.33){and (4.36), we kn∣ow that }
P lim sup sup
→∞ ∣∣∣ ∑
n ∣
√1
∣
tr knt η ∣j ≥ √ = 0 (4.37)
n t∈Rp,‖t‖=1 n
∣ n
j:j=6 i

Proof of Lemma 4.2.3 Note that we can split the left hand side of (4.18) as the
summation of the following three expre(ssions,
1 ∑ √ )n ln lnn
Q(X , β )2C(X )tri 0 i C∑(Xi) = O , (4.18.a)n∑ ni=1n n2 C(X )k(X ,X , β )− Q(X , β )2 j=1 j i j 0i 0 C[ (Xi)
tr ∑n ] [ = O(n
δ), (4.18.b)
n k(X ,X , β )
∑i=1 ∑
j=1 i j 0
tr
n ∑ ]n
1 j=1C(Xj)k(Xi, Xj, β0)/n
n
j=1C(Xj)k(Xi, Xj, β0)/n
Q(X 2i, β0) [∑ ]n 2 ,n
i=1
j=1 k(Xi, Xj, β0)/n
(4.18.c)
where we get the order of (4.18.a) and (4.18.b) applying the LIL and the same
strategy used in the proof of (4.26), respectively. Next, we investigate the order of
(4.18.c). By the pr[oof of Lemma 4.2.2, an alm]os[t sure upper bound of (4.18.c),∑n ∑n tr4 Q(X , β )2 1 1 ∑ ]ni 0
C(Xj)k(Xi, Xj, β0) C(Xj)k(Xi, Xj, β0) (4.38)
n k̄(X , β )2 n n
i=1 i 0 j=1 j=1
Similar to the proof of Lemma 4.2.2, let us assume that C(X) and β0 are scalar
valued then generalize the conclusion into vector valued case. Applying the same
78
method in the proof of Lemma 4.2.2 where equation (4.24.a.2) was attained, we can
show that for i = 1, . . . , n,
∑n n1 1 ∑
C(Xj)k(Xi, Xj, β0) = C(Xj)k(Xi, Xj, β0) + o(1).
n n
j=1 j:j=6 i
1 5
Next, we show that for k δ+n = O(n 2 4 ).
∑n1 kn
C(Xj)k(Xi, Xj, β0) ≤ , a.s., for i = 1, . . . , n.
n n
j:j 6=i
1 5
Case 1 If C(X)k(x,X, β0) is centered at 0 under β0, Let k = O(n
δ+
n 2 4 ) be the
constant mention(ed in Berns)tein’s I(nequality. Since)δ is between 0 and 1/2,
x2 nδ+
5 ( )
2 1
O = O = O n δ+
5
2 4
1 5 .nσ +Mx nσ +Mn δ+2 4
Hence by{Proposition 4.3.1, for i = 1, . . . , n∑ }n ∣
P C(X )k(X ,X , β ) ≥ k ∣∣∣ 1 5X ≤ e−k2n/(nσ+Mkn) = e−cn 2 δ+4j i j 0 n i .
j:j 6=i
By putting t{ogether all such sets for i = 1,}. . . , n, we get∑n 1
2 δ+
5
P C(Xj)k(Xi, Xj, β0) ≥ k ≤ e−kn/(nσ+Mkn) = e−cn 2 4n .
j:j 6=i
Therefore
∑ {n ∑ }n ∑n1 k 1n δ+5
P C(Xj)k(Xi, Xj, β ) ≥ ≤ e−n 2 40 <∞
n n
i=1 j:j=6 i i=1
By Borel-Cantelli le{mma, the inequality above indicates that
1 ∑ }n kn
P lim sup C(Xj)k(Xi, Xj, β0) ≥ = 0, (4.39)
n→∞ n n
j:j 6=i
which shows that
1 ∑n ( )k ( )n 1 1
C(Xj)k(Xi, Xj, β0) = O = O n
δ+
2 4 ,
n n
j:j=6 i
79
and that
1 ∑n kn
C(Xj)k(Xi, Xj, β0) ≤ , a.s. (4.40)
n n
j:j 6=i
Case 2 C(X)k(x,X, β0) is centered at E {C(X)k(x,X, β0)} = µ(x, β0) 6= 0, then
by (4.39)
1 ∑n ( )k ( )n 1 δ+ 1{C(Xj)k(Xi, Xj, β0)− µ(Xi, β0)} = O = O n 2 4 , (4.41)
n n
j:j 6=i
and
1 ∑n { knC(Xj)k(Xi, Xj, β0)− µ(Xi, β0)} ≤ , a.s. (4.42)
n n
j:j=6 i
From (4.41) and (4.42), we know that
1 ∑n 1 δ+ 1 1 1C(Xj)k(Xi, Xj, β0) = µ(X1, β0) +O(n 2 4 ) = O(n δ+2 4 ) (4.43)
n
j:j 6=i
and ∑n1 kn + nµ(X1, β) kn
C(Xj)k(Xi, Xj, β0) ≤ = , a.s. (4.44)
n n n
j 6=i
Plug (4.44) into (4.38), yielding
∑ (
an√almost )sure u(pper bou)nd fo(r (4.18.c))
n 2 1 δ+ 5 1 δ+ 54 Q(Xi, β0) kn kn ln lnn n 2 4 2 4· · n= O ·O ·O = O(nδ)
n k̄(X , β )2 n n n n n
i=1 i 0
∑ (4.45)n
Therefore ‖m 2n(Xi,X , β)‖ /n is also O(nδ). This conclusion can be extended for
i=1
β ∈ Rp0 using the same strategy as the proof of Lemma 4.2.2. 
Proof of Proposition 4.2.2 Since Z∗n(X, β0) = max1≤i≤n ‖mn(Xi,X , β0)‖, write
∑n1 1 ∑n‖mn(X 3 ∗ 2i,X , β)‖ ≤ Zn(X, β0) · ‖mn(Xi,X , β0)‖ , (4.46)n n
i=1 i=1
80
By Proposition 4.2.1, Z∗n(X, β0) = o(n
1/2). By Lemma 4.2.3, the second factor on
the right hand side of (4.46) is O(nδ) with δ between 0 and 1/2. Therefore
∑n1 ‖mn(Xi,X , β 30)‖ = o(n1/2).
n
i=1

81
Chapter 5: Technical Results II: Zero in the Convex Hull Theorems
In this chapter, we continue to discuss the technical results for Classic Estimat-
ing Equations (CEE) and Extended Estimating Equations (EEE), with the same
notations and assumptions in Chapter 4, from pages 65 - 68. We define a neigh-
bourhood of β0 such that for all β in this neighbourhood, the Empirical Likelihood
(EL) method has a unique maximizer with probability approaching 1.
Under the CEE setting, consider random samples X1, . . . , Xn from distribution
family Pθ, where θ = (β, λ) ∈ Θ × H, Θ ⊂ Rp and H infinite dimensional. Using
the Lagrange multiplier method, it can be shown that for fixed β, if there exists a
unique solutionto ∏nmaxp pi where p = (p1, . . . , p n
);
 i=1 ∑ (5.1)n ∑nsubject to pi = 1, pi ∈ (0, 1), pim(Xi, β) = 0,
i=1 i=1
then the solution is given by
1 1
p̂i(β,Xi,X ) = · ,
n 1 + ttrm∑(Xi, β)n (5.2)m(Xi, β)
where t = t(β,X ) solves = 0.
1 + ttrm(X , β)
i=1 i
82
A typical interpretation of p̂i(β,Xi,X ) i∑s that the distribution of X1 is approxi-n
mated by the modified empirical measure piδX . This is why the conditioni
i=1
Eθ0 {m(X1, β0)} = 0
∑n
is rendered through the approximating distribution of X1 as pim(Xi, β0) = 0
i=1
connecting the pi’s and β.
Owen in [32], and Qin and Lawless in [33], claimed that for a fixed β, a sufficient
condition for problem (5.2) to have a local unique solution is “zero in the convex
hull”, i.e., for fixed β{∈ Uβ0 , 0 ∈ Conv(β) ⊂ R
r, where
∑ }n ∑n
Conv ≡ pim(Xi, β) : pi = 1, pi ∈ (0, 1), β ∈ Uβ0 . (5.3)
i=1 i=1
However, they did not state explicitly how the set Uβ0 in (5.3) is constructed, and
whether (5.3) is a deterministic fact, or an asymptotic result.
In the following section, we answer the two questions in the previous paragraph.
Then we generalize these conclusions to discontinuous estimating function m(x, ·).
In Section 5.2, we prove the existence and uniqueness of solutions to (5.1) under the
EEE setting with estimating function mn(x,x, β) that is continuous with respect
to β. For the discontinuous case, we prove the conclusion for the ϕ-transformation
model, which to our knowledge, is the broadest class of semiparametric models
satisfying the EEE definitions.
83
5.1 Classic Estimating Equation
In this section, under the CEE setting, we establish the uniqueness of solutions to
maximization problem (5.1) initially for continuous estimating functions, then for
discontinuous ones.
5.1.1 Continuous Criterion Function
We prove the existence and uniqueness of solutions to (5.1) in the following steps.
First, we demonstrate that (5.3) is true when β = β0 and that p̂i(β0, Xi,X ) is
the calculus maximizer of (5.1). Then, using the continuity of m(x, ·), we apply
Rolle’s theorem on the gradient of the Lagrangian of the negative logarithm of the
Profile Empirical Likelihood (pEL) function to prove the uniqueness of solution to
the second equation in (5.2). Finally, combining the results in the previous two
steps with a continuation method, we prove that there exists a neighbourhood of β0
on which (5.1) has a calculus maximum with probability approaching 1.
Let us start with the first step described in the previous paragraph, which is
showing 0 ∈ Conv(β0) with probability approaching 1.
Lemma 5.1.1 For i.i.d. random variables X1, . . . , Xn, assume that m(x, ·) is con-
84
tinuous, and
E {{m(X1, β0)} =} 0; (5.4)
E m(X , β )⊗21 0 = Σ is positive definite. (5.5)
E‖m(X1, β0)‖3 <∞. (5.6)
For any constant K > 0, let X = {X }∞i i=1, t0 be a unit vector in Rp.
(a) Let λ0 be the smallest eigenvalue of Σ, and I ∈ Rp×p be the identity matrix,
define { }
n
A(1) √1
∑
tr
K,n ={X : inf t0 m(Xi, β0) ≥ −K} , (5.7)∥t∥ 0 n i=1
A(2) ∥ε,n ={X : ∥
1 ∑n ∥∥
m(X , β )⊗2 − Σ∥i 0 ∥ ≤ ε , (5.8)n
1 ∑i=1 2 }nA(3)n = {X : ‖m(Xi, β0)‖
3 ≤ 2 · E‖g(X , β )‖3} 1 0 , (5.9)n∑i=1n
A(4) 1= X : m(X , β ⊗2 1n i 0) ≥ λ0I . (5.10)n 2
i=1
Then for any ε ∈ (0, λ0), there exist Kε and Nε, such that
{ }
≥ A(1)(i) for any n Nε, P K ≥ 1− ε;ε,n
(ii) the follow{ing limits a}re all identic{al to 1⋂ ⋂ } { ⋂ }
lim P A(2) (3) (4)
→∞ ε,n
, lim P A , lim P A .
n1 n1→∞ n →∞ nn1
n≥n1 n≥n1 n≥n1
(b) Let Conv(β0) be the convex hull of m(Xi, β0). For any ε > 0, there exist Kε
and Nε such that P {A∗ε} ≥ 1− ε, where
⋂ { }
A∗ A(1)= ∩ A(2) (3) (4)ε Kε,n ε,n ∩ An ∩ An , (5.11)
n≥Nε
85
and for n > N and X ∈ A∗ε ε, P {0 ∈ int (Convn(β )) |X ∈ A∗0 } ≥ 1−ε, where
in general, int(·) denotes the interior of a set.
Now, let usgo back to the EL problem, which is essentially solving ∏ max
n
p
 i=1
pi, where p = (p1, . . . , pn),
 ∑ ∑ (5.12)n nsubject to pi > 0, pi = 1, pim(Xi, β) = 0,
i=1 i∑=1n
for any fixed β. Given β, by the concavity of ln(pi) in p, a unique maxi-
i=1
mum exists provided that 0 is in the interior of Convn(β), the convex hull of
m(X1, β), . . . ,m(Xn, β). By Lemma 5.1.1, for any ε, there exists Nε such that
for any n > Nε and X∞ ∈ Aε,
P {0 ∈ int (Conv(β0)) |X ∈ A∗} > 1− ε.
Thus for n > Nε, with β = β0, there exists a unique solution p̂0 to the maximization
problem in (5.12) with probability greater than 1− ε.
To conclude that the unique maximizer over p’s for fixed β = β0 is a calculus
maximizer, we need to show that there is a ball of dimension n−1−p for p’s within
which to take the derivative, where p is the dimension of m(X1, β0). Let 1 be an
1 × n vector with all entries equal to 1, 0 be a p × 1 vector with all entries equal
to 0, M0 = (m(X1, β0), . . . ,m(Xn, β0)) be a p × n matrix, and M = (1tr,M tr)tr0
be a (p + 1) × n matrix. Lemma 5.1.1 guarantees the existence and uniqueness of
solutions to maximization problem (5.12). Use p̂0 to denote that solution, then p̂0
86
belongs to { ∑ }n
P ≡ p : Mp = (1,0tr)tr, all pi > 0, pi = 1 (5.13)
i=1
For any v in the null space of M and α ∈ R, define
1
p = · (p̂ + αv),
1 + αv · 1 0
we know that Mp = (1,0tr)tr, hence for small α such that the entries for p are all
positive, we know that p belongs to P . Furthermore, v is in the null space of M ,
(2)
and rank(M) = p + 1, so dim(P) = n − p − 1. Observing that X∞ ∈ Aε,n for
ε < λ0, there is a relative open set P of dimension n − p − 1 within which we can
take derivative.
Next, we give the form of the unique solution to (5.12) when β = β0 and n > Nε,
using Lagrange multipliers λ ∈ R{and t ∈ Rp}. Define G(p, λ, t) as∑n ∑n ∑n
G = ln p − λ p − 1 − nttri i pim(Xi, β0)
i=1 i=1 i=1
To maximize the concave function G, differentiate G with respect to pi, i = 1, . . . , n,
then set the derivatives to be zero,
∂G 1
= − λ− nttrm(Xi, β0) = 0, i = 1, . . . , n. (5.14)
∂pi pi
Multiply the equations above by pi, then add them together,∑n ∑n
n− λ pi − nttr pim(Xi, β0) = 0 (5.15)
i=1 i=1
By the constraints in (5.12), (5.15) implies that n − λ = 0, so n = λ. Therefore
from equation (5.14),
1
p̂i = (5.16)
n {1 + ttr0 m(Xi, β0)}
87
∑n
Since p̂im(Xi, β0) = 0, the vector t must solve the following equation,
i=1
n
1 ∑ m(Xi, β0)
= 0. (5.17)
n 1 + ttrm(X
i=1 i
, β0)
Important Event In Lemma 5.1.1, we proved that for any ε, there exists Nε such
that P {A∗ε} ≥ 1− ε, where
⋂ { }
A∗ = A(1) ∩ A(2) ∩ A(3)ε K ε,n n ∩ A
(4) .
ε,n n
n≥Nε
Furthermore, let Σ(β, β ) = E {m(X , β)⊗20 β0 1 }. Since Σ(β, β0) is continuous, and
{ }
Σ = Eβ0 m(X1, β )
⊗2
0 = Σ(β0, β0)
is positive definite, there exists Uβ0 , a neighbourhood of β0 such that for all β in
Uβ0 , matrix Σ(β, β0) is positive definite, i.e.,
Uβ0 ≡ {β : Σ(β, β0) is positive definite} . (5.18)
Then, consider the following class of functions,
G ≡ {m(x, β) : β ∈ Uβ0} ,
In Chapter 4, we assumed
(A.11) Let α = (x, β) ∈ Rd×Θ with Euclidean norm, and k(x, y, β) = kα(y). There
exists a measurable function b(y) : Rd 7→ R such that for any α1 6= α2,
|kα1(y)− kα2(y)| ≤ b(y)‖α1 − α2‖,
88
Since m(x, β) is Lipschitz continuous with respect to β, by Example 19.7 on page
271 of [41], we know that∥G is a Donsker class, which implies that∥∥ ∑n1 ∥∥∥ a.s.
sup ∥∥ m(X ⊗2i, β) − Σ(β, β0)∥∥ −→ 0,{ } β∈U nβ0 i=1 2
(5)
therefore P Aε,n ≥ 1{− ε, where∥∥∥∥ ∑ ∥ }n ∥A(5) 1 ⊗2 ∥ εε,n = X : sup ∥ m(Xi, β) − Σ∥ ≤ ,β∈U n ∥ 2β0 i=1 2
Let λ0 be the smallest eigenvalue of Σ, and 0 < ε < 2λ0/3. For Lemma 5.1.2 th⋂rough(5)
Theorem 5.1.1, we restrict our discussion to X ∈ A+ε , where the set A+ε = A∗ε Aε,n
and P {A+ε } > 1− 2ε.
The following lemma states the uniquness of t = t(β) for β ∈ Uβ0 and X ∈ A+ε .
The proof is given by contradiction using Rolle’s theorem.
Lemma 5.1.2 Under the assumptions of Lemma 5.1.1, and (A.11), for X ∈ A+ε
and any fixed β in set Uβ0 defined in (5.18), if there exists a solution t to equation
g(t,X, β) = 0, then the solution must be unique a.s., where
1 ∑n m(Xi, β)
g(t,X, β) = ,
n 1 + ttrm(X
i=1 i
, β)
Note that the unique solution of (5.12), denoted by p̂i in (5.16), must lie in (0, 1),
therefore 1 + ttrm(Xi, β) > 1/n, for all i. Define{ }
1
Dβ(X ) = t : for all i, 1 + t
trm(Xi, β) > . (5.19)
n
Our goal is to show that for any fixed β ∈ Uβ0 , there is a unique solution t(β) to
the maximization problem (5.12).
89
Theorem 5.1.1 Under the assumptions of Lemma 5.1.1, for any β ∈ Uβ0 and for
any X ∈ A+ε , there exists a unique solution t(β) ∈ Dβ(X ) to the maximization
problem (5.12).
5.1.2 Discontinuous Criterion Function
When the criterion function m(x, β) is no longer continuous with respect to β,
many methods we used in the the continuous case no longer applies, for example,
Rolle’s theorem in Lemma 5.1.2 and the continuation method in Theorem 5.1.1.
Therefore we seek different ways to attain the existence and uniqueness of solutions
to maximization problem (5.1) for β in some neighbourhood of β0. We prove a more
general version of the “zero in the convex hull” theorem in the previous section,
namely, we demonstrate that the statement is true not only for convex hull Conv(β0),
but also true for Conv(β) with probability approaching 1, when β belongs to some
neighbourhood of β0 that is to be specified, i.e.,
0 ∈ Conv(β), for β ∈ U∗β (5.20)0
If we can prove (5.20), then for any β ∈ U∗β , the maximization problem (5.1) is0
guaranteed to have a unique solution for any β ∈ U∗β .0
To reach this goal, instead of assigning m(Xi, β) with the constant probability
mass 1/n in Lemma 5.1.1, we construct a random probability vector as follows. Let
Wi be i.i.d. random variables, i = 1, . . . , n that follow an exponential distribution
90
with E(W1) = 1, and also assume that {Wi}ni=1 is independent of {X }
n
i i=1. Let
dPβ(x)
q(x, β) = (5.21)
dPβ0(x)
be the Radon-Nikodym derivative of Pβ with respect to Pβ0 , and define
W ∗i = Wiq(Xi, β), i = 1, . . . , n (5.22)
Then ∫∫ ∫ {∫ }
{ ∗} dPβ(x)E W1 = ∫ w dPβ0(x)dPW (w) = w dPβ(x) dPW (w)dPβ0(x)
= wdPW (w) = 1
and
E {W ∗1 |X1} = E {W1q(X1, β)|X1} = q(X1, β) (5.23)
Define V = (V1, . . . , Vn) be a vector in the simplex ∆n defined in (5.72), where
W ∗
Vi = ∑ in . (5.24)
W ∗i
i=1
We would like to show that 0 ∈ in (Convn(β)) with weights in (5.23). Assume that
(A.12) Assume that Σ2(β, β0) and Σ3(β, β0) are two continuous functions given by
{ }
Σ2(β, β0) = Eβ0 {q
2(X1, β)m(X , β)
⊗2
1 } (5.25)
Σ 3 33(β, β0) = Eβ0 q (X1, β)‖m(X1, β)‖ (5.26)
with Σ2(β0, β0) = Eβ0 {m(X1, β0)⊗2} and Σ3(β, β0) = E 3β0 {‖m(X1, β0)‖ }
Since Σ2(β, β0) and Σ3(β, β0) are both continuous with respect to β, we define
U∗β ≡ {β : Σ2(β, β0) is positive definite, and Σ3(β, β0) is bounded} . (5.27)0
91
Now we are ready to establish the “zero in the convex hull” theorem for discontin-
uous estimating function. The following lemma is a generalization of Lemma 5.1.1,
which essentially states 0 ∈ Conv(β) for all β ∈ U∗β .0
Lemma 5.1.3 Assume that (5.4)-(5.6) in Lemma 5.1.1, and (A.12) are satisfied.
For β ∈ U∗β defined in (5.27), for any constant K > 0, let X = {Xi}
∞
0 i=1
, t0 be a
unit vector in Rp.
(a) Let λ2(β, β0){be the smallest eigenvalue of Σ2(β, β0). De}fine∑n
B(1) 1K,n ={X : inf √ t
tr
0 q(Xi, β)m(Xi, β) ≥ −K , (5.28)∥t∥ 0∥∥ 1 ∑
n
i=1
n
⊗ ∗ ∥∥ }B(2)ε,n = {X : q(X
2
i, β)m(Xi, β) − Σ (β, β ∥0)∥ }≤ ε , (5.29)n∑i=1 2n
B(3) 1n = {X : ‖q(Xi, β)m(Xi, β)‖
3 ≤ 2 · Σ3(β, β0) , (5.30)
n∑i=1 }n
B(4) 1 ⊗2 1n = X : q(X ∗i, β)m(Xi, β) ≥ λ (β, β0)I . (5.31)n 2 0
i=1
Then for any ε, there exists Kε and Nε, such that
{ }
(1)
(i) for any n ≥ Nε, P BK ,n ≥ 1− ε;ε
(ii) the followin{g limits a}re all identical to 1,⋂ { ⋂ } { ⋂ }
lim P B(2) (3) (4)
→∞ ε,n
= lim P B = lim P B .
n1 n1→∞ n n1→∞ n
n≥n1 n≥n1 n≥n1
(b) Let Convn(β) be the convex hull of m(Xi, β). For any ε > 0, there exist Kε
and Nε such that P {B∗ε} ≥ 1− ε, where
⋂ { }
B∗ B(1)= ∩ B(2) ∩ B(3) (4)ε K ,n ε,n n ∩ Bn , (5.32)ε
n≥Nε
and for n > Nε and X ∈ B∗ε , P {0 ∈ Convon(β) |X ∗∞ ∈ Bε} ≥ 1− ε.
92
5.2 Extended Estimating Equation
In this section, we extend the conclusions in the previous section when the con-
straints of the EL maximization is an EEE. In Section 5.2.1, we prove the uniqueness
of solutions to the EL maximization in (5.1) for mn(x,x, β) that is continuous with
respect to β. In Section 5.2.2, we discuss the case when mn(x,x, β) is no longer
continuous with respect to β.
5.2.1 Continuous Criterion Function
We can make the same conclusion for mn(Xi,X , β0) if we can show that (a)
in the Lemma 5.1.1 is true. Note that the con∑clusions of (5.7) and (5.9) aren
gua∑ranteed by the asymptotic normality of n
−1/2 mn(Xi,X , β0) and the order
i=1
n
of ‖mn(Xi,X , β0)‖3, respectively, which has been proved in Lemma 4.2.2 and
i=1
Proposition 4.2.2. As for the conclusion regarding (5.8) and (5.10), it suffices to
show
1 ∑n
m (X ,X, β )⊗2n i 0 −
P→ Σ (5.33)
n
i=1
We will discuss the conditions for (5.33) in the following proposition, then state and
prove a result for continuous mn(x,x, β) that is parallel to Lemma 5.1.1.
Proposition 5.2.1 Let Yi, i = 1, 2 be random variables that are i.i.d. as X1, and
are independent of X = (X1, . . . , Xn). Under assumptions (A.2), (A.4), (A.9), and
(A.10), ∑n1 P
mn(Xi,X , β0)
⊗2 −→ Σ, (5.34)
n
i=1
93
where { [ ] }⊗2
2 − E {C(Y2)k(Y1, Y2, β0)|Y1}Σ = E Q(Y1, β0) C(Y1) . (5.35)
E {k(Y1, Y2, β0)|Y1}
Next, we present a result that is parallel to Lemma 5.1.1 for extended estimating
equations.
Lemma 5.2.1 For i.i.d. random variables Xn = (X1, . . . , Xn), assume that the es-
timating function mn(Xi,X , β0) satisfies (A.2), (A.9) and (A.10). For any constant
K > 0, let X∞ = {Xi}∞i=1, and t0 be a unit vector in Rp.
(a) Let λ0 be the smallest eigenvalue of Σ, and I ∈ Rp×p be the identity matrix,
define { ∑ }n
C(1) 1K,n ={X∞ : inf √ t
tr
0 mn(Xi,X , β0) ≥ −K , (5.36)∥t0∥∥∥ 1 ∑
n
i=1 ∥∥ }n
C(2)ε,n = {X∞ : mn(Xi,X , β0)
⊗2 − Σ∥∥} ≤ ε , (5.37)n∑i=1 2n
C(3) 1n = {X∞ : ‖mn(Xi,X , β
3
0)‖ ≤ C ,} (5.38)n∑i=1n
C(4) 1 1n = X∞ : mng(Xi,X , β ⊗20) ≥ λ0I . (5.39)n 2
i=1
where C in (5.38) is a constant in R that is greater than E‖mn(X1,X , β0)‖3.
Then for any ε, there exists Kε and Nε, such that
{ }
(1)
(i) for any n ≥ Nε, P CK ,n ≥ 1− ε;ε
(ii) the followin{g limits a}re all identic⋂ {
al to 1,
⋂ } { ⋂ }
lim P C(2) = lim P C(3) (4)
→∞ ε,n →∞ n
= lim P Cn .
n1 n1 n1→∞
n≥n1 n≥n1 n≥n1
94
(b) Let Conv(β0) be the convex hull of mn(Xi,X , β0). For any ε > 0, there exist
Kε and Nε such that P {C∗ε} ≥ 1− ε, where
⋂ { }
C∗ (1)= C ∩ C(2)ε K ,n ε,n ∩ C
(3)
n ∩ C(4)ε n , (5.40)
n≥Nε
and for n > Nε and X∞ ∈ C∗ε , P {0 ∈ Convn(β0)
o} ≥ 1− ε.
Now, let us go back to the EL problem, which for any fixed β, is essentially the
followingmaximization problem ∏n maxp i=1 pi, where p = (p1, . . . , pn) ∑ (5.41)n ∑nsubject to pi ≥ 0, pi = 1, pimn(Xi,X , β) = 0
i=1 i=1
Note that for a given β, a unique maximum exists provided that 0 is in the inte-
rior of set Conv(β), the convex hull of mn(X1,X , β), . . . ,mn(Xn,X , β). Since by
Lemma 5.2.1, for any ε, there exists Nε such that for any n > Nε and X∞ ∈ Cε,
P {0 ∈ Convo(β0)} > 1− ε, we know that for n > Nε, there exist a unique solution
denoted by p̂0 to the maximization problem in (5.41) with β = β0, with probability
greater than 1− ε.
To know that the unique maximizer is a calculus maximizer, we need to know
that there is a ball of dimension n−1−p for p’s within which to take the derivative,
where p is the dimension of mn(X1,X , β). Let 1 be an 1× n vector with all entries
equal to 1, and 0 be a p× 1 vector with all entries equal to 0,
M0 = (mn(X1,X , β), . . . ,mn(Xn,X , β))
95
be a p×n matrix, and M = (1tr,M tr0 )tr be a (p+1)×n matrix. Suppose there exists
one solution to the m{aximization problem (5.41) denoted by p̂0. T}hen p̂∑ 0
belongs to
n
P ≡ p : Mp = (1,0tr)tr, all pi > 0, pi = 1 (5.42)
i=1
For any v in the null space of M and α ∈ R, define
1
p = · (p̂ + αv),
1 + αv · 1 0
we know that Mp = (1,0tr)tr, hence for small α such that the entries for p are all
positive, we know that p belongs to P . Furthermore, v is in the null space of M ,
and dim(M) = p + 1, so dim(P) = n − p − 1. Therefore we know that there is a
relative open set P of dimension n− p− 1 within which we can take derivative.
Next, we give the form of the unique solution to (5.41) when β = β0 and n > Nε,
using Lagrange multipliers λ ∈{R and t ∈}Rp. Define G(p, λ, t) as∑n ∑n ∑n
G = ln pi − λ pi − 1 − nttr pimn(Xi,X , β0)
i=1 i=1 i=1
To maximize the concave function G, differentiate G with respect to pi, i = 1, . . . , n,
then set the derivatives to be zero,
∂G 1
= − λ− nttrmn(Xi,X , β0) = 0, i = 1, . . . , n. (5.43)
∂pi pi
Multiply the equations above by pi, then add them together,∑n ∑n
n− λ p − nttri pimn(Xi,X , β0) = 0 (5.44)
i=1 i=1
By the constraintts in (5.41), (5.44) indicates that n − λ = 0, so n = λ. Therefore
from equation (5.43),
1
p̂i = (5.45)
n {1 + ttr0 mn(Xi,X , β0)}
96
∑n
Since p̂imn(Xi,X , β0) = 0, the vector t must solve the following equation,
i=1
n
1 ∑ mn(Xi,X , β0)
= 0. (5.46)
n 1 + ttrmn(Xi,X , β )i=1 0
Important Event In Lemma 5.2.1, we proved that for any ε, there exists Nε such
that P {C∗ε} ≥ 1− ε, where
⋂ { }
C∗ C(1)ε = K ∩ C
(2) ∩ C(3) ∩ C(4)
ε,n ε,n n n
.
n≥Nε
Furthermore, let { [ ] }
2 − E {
⊗2
C(Y2)k(Y1, Y2, β)|Y1}
Σ(β, β0) = E Q(Y1, β) C(Y1) . (5.47)
E {k(Y1, Y2, β)|Y1}
Since Σ(β, β0{) is continuous, and by (5.35),[ ] }⊗2
2 − E {C(Y2)k(Y1, Y2, β0)|Y1}Σ = E Q(Y1, β0) C(Y1) = Σ(β , β )
E { 0 0k(Y1, Y2, β0)|Y1}
is positive definite, there exists Uβ0 , a neighbourhood of β0 such that for all β in
Uβ0 , matrix Σ(β, β0) is positive definite, i.e.,
Uβ0 ≡ {β : Σ(β, β0) is positive definite} . (5.48)
Then, consider the following class of functions,
G ≡ {mn(Xi,X , β) : β ∈ Uβ0} ,
By Example 19.7 on Page 271 of [41] and Lemma 4.2.1, we know that family G is a
Donsker class, which im∥plies that∥∥ ∑ ∥n1 ∥∥ a.s.
sup ∥∥ mn(Xi,X , β)⊗2 − Σ(β, β0)∥ −→ 0,β∈U n ∥β0 i=1 2
97
{ }
C(5)therefore P ε,n ≥{1− ε, wher∥e∥∥ ∥ }nC(5)ε,n = X : sup ∥
β∈U ∥ 1 ∑ ∥
∥ ε
mn(Xi,X , β)
⊗2 − Σ∥∥ ≤ ,n 2
β0 i=1 2
Let λ0 be the smallest eigenvalue of Σ, and 0 < ε < λ0. For Lemma 5.2.2 through
Theorem 5.2.1, we restrict our discussion to X ∈ C+∞ ε , where the set
⋂
C+ = C∗ C(5)ε ε ε,n (5.49)
and P {C+ε } > 1− 2ε.
Lemma 5.2.2 Under the assumptions of Lemma 4.2.1 and Lemma 5.1.1, for X∞ ∈
C+ε and any β in set Uβ0 defined in (5.48), if there exists a solution t to equation
g(t,X, β) = 0, then the solution must be unique, where
n
1 ∑ mn(Xi,X , β)
g(t,X, β) = , (5.50)
n 1 + ttrmn(Xi,X , β)i=1
Note that the unique solution of (5.41), denoted by p̂i in (5.45), must lie in (0, 1),
therefore 1 + ttrmn(Xi,X , β0) > 1/n, for all i. Define{ }
1
Dβ(X ) = t : for all i, 1 + t
trmn(Xi,X , β) > . (5.51)
n
Our goal is to show that for any fixed β ∈ Uβ0 , there is a unique solution t(β) to
the maximization problem (5.41).
Theorem 5.2.1 Under assumptions of Lemma 5.2.1, for any β ∈ Uβ0 and X ∈ C+ε ,
there exists a unique solution t(β) to the maximization problem (5.41).
98
5.2.2 Discontinuous Criterion Function
Similar to the strategy with used in the proof under CEE settings, when the
criterion function mn(Xi,X , β) is no longer continuous with respect to β, we use
random variables instead of the fixed constant 1/n as the probability mass assigned
to each mn(Xi,X , β), i = 1, . . . , n. More specifically, let Pβ0(x) be the cumulative
density function of X1, and q(x, β) be the Radon-Nicodym derivative of Pβ(x) with
respect to Pβ0(x), i.e.,
dPβ(x)
q(x, β) = , (5.52)
dPβ0(x)
Let Y1 ∼ Pβ0 be independent of X
∞
{∞ = {Xi}i=1, and }
E {C(Y1)k(Xi, Y1, β)|Xi}
V (Xi, β) = Q(Xi, β) C(Xi)− (5.53)
E {k(Xi, Y1, β)|Xi}
We claim that E {V (Xi, β0)} = 0 because,
0 = E {{mn(Xi,X , β0)} = l}im E {{mn({Xi,X , β )}n→∞ 0 }}
= E lim mn(Xi,X , β0) = E E lim mn(Xi,X , β0)|X (5.54)1
n→∞ n→∞
= E {V (X1, β0)} ,
therefore ∫
{ dPβ(x)Eβ0 q(X1, β)V (X1, β)} = V (x, β) · dPβ0(x) = 0.dPβ0(x)
Assume that { }
E q2β0 (X1, β) <∞, for all β ∈ Uβ0 , (5.55)
then by the CLT, for any t0 ∈ Rp with ‖t0‖ = 1 and Σ∗(β, , β0) = E 2β0 {q (X1)V (X1, β)⊗2},
n
√1
∑
q(X )ttr
D
i 0 V (Xi, β) −→ N(0, ttr0 Σ∗(β, , β0)t0) (5.56)n
i=1
99
At this point, we would like to point two facts. First, by (5.54), V (Xi, β) is eligible
as a criterion function of CEE, and based on the discussion in the previous sections,
there exists Uβ0 such that 0 ∈ Conv(β), for all β ∈ Uβ0 . Another heuristic observa-
tion is that mn(Xi,X , β) and V (Xi, β) are “close” in some sense because they only
distinct by the quotient term, and by LLN and Slutsky’s lemma, the quotient term
in mn(Xi,X , β) converges to that in V (Xi, β) in probability.
However, in order to pass the desirable feature of V (Xi, β) to mn(Xi,X , β), for
any fixed β ∈ U −1/2β0 , we need n ‖mn(x,X )−V (x, β)‖ to be bounded uniform in x,
i.e.,
sup ‖mn(x,X, β)− V (x, β)‖ = OP (n−1/2). (5.57)
x∈X
By Example 19.11 in [41] by Van der Vaart, if we know that
(i) function k(x, ·, ·) has bounded variation
(ii) for any ε > 0 and δε > 0, let Xε ≡ {x : E {k(x,X, β)} > δε}, then for any
constant cε > 0, there∥exists Nε such that for any n > N ,∥
ε
∥
∥ 
P ∥∥ ∑ ∥ ∥ ∥√1 m ∥n(Xi,X , β)∥∥ ≥ cεn  < ε, (5.58)i:X ∈X ci ε
then (5.57) is guaranteed.
Since up to now, the broadest class of semiparametric models for which the struc-
tural parameter can be defined via EEE is the ϕ-transformation model discussed in
Chapter 3, for the rest of this chapter, we restrict our attention to the mn(Xi,X , β)
under the ϕ-transformation model assumptions.
100
Recall that under the ϕ-transformation model assumption, k(x, ·, ·) is an indicator
function, therefore (ii) on page 100 is satisfied. Let us consider a martingale assump-
tion that guarantees (ii) in the previous lemma. Assume thatXi = (Ti, Ci, Zi), where
the lifetime Ti and the right censoring variable Ci are independent conditional on
covariate Zi. Recall that under the ϕ-transformation model assumption,
Ti = ϕ(εi, β
tr
0 Zi), (5.59)
where ϕ(x, ·) is strictly increasing, εi is the residual with distribution function F (x)
and hazard rate λ(x). Let ζi be the residual when lifetime is censored, i.e., Ci =
ϕ(ζ , βtri 0 Zi). Use the classic method in [21] and construct the compensated counting
process martingales
{ } { } ∫ u { }
Mi ϕ(u, β
tr
0 Zi) = Ni ϕ(u, β
tr
0 Zi) − λ(x)Yi ϕ(x, βtr0 Zi) dx (5.60)
−∞
is a martingale with respect to measure Pβ0 and filtration
Fn(u) = σ(Xi, Ni(ϕ(s, βtr tr0 Zi)), Yi(ϕ(s, β0 Zi)) : s ≤ u, i = 1, . . . , n)
where Vi = min(Ti, Ci) is the observed time, ∆i = I {Ti ≤ Ci} is the non-censored
indicator, Ni(x) = I {Ti ≤ x,∆i = 1}, Yi(x) = I {Vi ≥ x} is the at-risk indicator,
Q(Xi, β) = ∆i, C(Xi) = C(Zi), k(Xi, X
tr
j, β) = Yj(ϕ(εi, β0 Z )), (5.61)j
Then we can write ∑n
Sn(β0) = mn(Xi,X{, β0)∑i=1n ∑n
j∑ }=1C(Xj)k(Xi, Xj, β)= Q(X{i, β) C(Xi)− n (5.62)∑i=1 j=1 k(Xi, Xj, β) }n ∑n
− j= ∆i C(Zi) ∑=1C(Zj)Yj {ϕ(εi, βtr0 Zj)}n Y tr
i=1 j=1 j
{ϕ(εi, β0 Zj)}
101
By (5.60), together with the∫model assumption (5.59), on the residual scale,∑n { }{ }
Sn(β0) = dNi ϕ(u, β
tr
0 Zi) C(Zi)− C̄(u, β )
∑i=1n ∫
0
{ }{ } (5.63)
= dMi ϕ(u, β
tr
0 Zi) C(Zi)− C̄(u, β0)
i=1
where ∑n∑C(Z )Y {ϕ(u, βtrj=1 j j 0 Zj)}C̄(u, β0) = n
j=1 Y
tr
j {ϕ(u, β0 Zj)}
Note that Eβ0 {Sn(β0)} = 0 due to the martingale property. Recall that the pre-
dictable variation process 〈dMi, dMi〉 = λ(u)Y tri {ϕ(u, β0 Zi)} du as mentioned in [1],
hence {∫ ∑n { } ∣ }
{ } − 2
{ } ∣
V ar Sn(β0) = E C(Zi) C̄(u, β0) λ(u)P Vi ≥ ϕ(u, βtr0 Z ∣i) du ∣ Z
i=1
Under assumption (A.2), ‖C(x)‖ < b, for all x ∈ X , therefore
∥∥ { }∥∥ 2 ∑n ∫4bV ar n−1/2Sn(β0) ≤ λ(u)S(u)du ≤ 4b2.
n
i=1
Now that we have established (5.57), for any ε > 0, there exists Cε and Nε such
that for any n > Nε,
‖ Cεsup mn(x,X, β)− V (Xi, β)‖ ≤ √ , w.p. no less than 1− ε
{x } n
therefore P C(6)ε,n > 1− ε, where
{ }
C(6)ε,n ≡ X : sup ‖
Kε
mn(x,X, β)− V (Xi, β)‖ ≤ √ , (5.64)
x n
Let q(x, β) be he Radon-Nikodym derivative defined in (5.52), and as we have as-
sumed in (5.55), E {q(X , β)21 } < ∞, for any β ∈ Uβ0 . Then by the law of large
102
{ }
(7)
numbers, any fixed β ∈{Uβ0 , we know that P Cε,n > 1− ε, w}here
n
C(7) ≡ 1
∑ { }
ε,n X : q(Xi, β)
2 < E 2β
n 0
q(X1, β) , (5.65)
i=1
(6) (7)
Now, we can update the definition of C+ε in{(5.49) by}including Cε,n and Cε,n, namely,⋂ ⋂7
C+ε ≡ C∗ε C(i)ε,n , (5.66)
i=5
and P {C+ε } > 1− 4ε for n > Nε.
Theorem 5.2.2 Under assumptions of (5.57), for any β ∈ U +β0 and X∞ ∈ Cε , 0
belongs to int (Conv(β)).
103
5.3 Some Proofs
5.3.1 Proofs Under the CEE Setting
Let us state two widely used lemmas without proof. The Lyapunov Central Limit
Theorem was found in Chapter 27 of [6], and the Hyperplane Separation Theorem
was found in Chapter 2 of [7].
Lemma 5.3.1 (Lyapunov Central Limit Theorem) Suppose Z1, . . . , Zn are in-
dependent random variables, each with finite expected value µi and variance σ
2
i .
Define ∑n
s2 = σ2n i
i=1
If for some δ > 0, the Lyapunovs condition
1 ∑n [ ]
E |Z − µ |2+δi i → 0 (5.67)
s2+δn i=1
as n → ∞ is satisfied, then a sum of (Zi − µ)/sn converges in distribution to a
standard normal random variable, as n goes to infinity:
1 ∑n D
(Zi − µi) −→ N(0, 1). (5.68)
sn i=1
Lemma 5.3.2 (Hyperplane Separation Theorem) Let A and B be two dis-
joint nonempty convex sets. If A is open, then there exist a nonzero vector v and
real number c such that
〈x, v〉 > c and 〈y, v〉 ≤ c
for all x in A and y in B.
104
In general, for two matrices A and B in Rp×p
A > B ⇔ A−B is positive definite.
Proof of Lemma 5.1.1 First, let us show that (a) is true. Under assumptions
(5.4)-(5.6), by law of large numbers, we know that (ii) is true. Note that by the
multivariate central limit theorem,
∑n
Sn = √
1 D
m(Xi, β0) −→ S∞ ∼ N(0,Σ). (5.69)
n
i=1
Since S∞ follows a multivariate normal distribution, for any ε > 0 and any unit
vector t0 ∈ Rp, there exists Kε{such that }
ε
P inf ttr0 S∞ < −Kε < . (5.70)
t0 2
By (5.69), there∣ ex{ists Nε such that}for an{y n > Nε,∣∣∣ } ∣∣P inf ttr0 Sn < −K − P inf ttrε 0 S∞ < −Kεt t ∣∣ ε< (5.71)0 0 2
Combining (5.70) and (5.71), we conclude that P {inf trt0 t0 Sn < −Kε} < ε, hence
the conclusion for (i) is also true.
Next, we prove (b) using Lemma 5.3.2. Let Convn(β0) be the convex hull of
m(Xi, β0), then we want to show that the two{sets∑ }n
A = {0} , B = int (Convn(β0)) = vim(Xi, β0); v ∈ ∆n
i=1
are not separated, whe{re ∑ }n
∆n = v = (v1, . . . , vn) : vi = 1, vi ∈ (0, 1) . (5.72)
i=1
105
By Lemma 5.3.2, the statement 0 ∈ Convon(β0) is equivalent to
∀t0 ∈ Rp with ‖t0‖ = 1 and ∀a ≤ 0,
(5.73)
∃z ∈ int (Convn(β0)) such that ttr0 z > a.
To prove (5.73), we will show in the following paragraphs that for any ε > 0,
√
nonpositive constant a = −c/ n and vector t p0 ∈ R with ‖t0‖ = 1, there exists Nε
such that for n{> Nε, given X ∈ A
∗
ε
∑n ∣∣ }
inf P ttr
c ∣
0 Vim(Xi, β0) ≥ −√ ∣ X > 0, for X ∈ A∗ε, (5.74)t0 n
i=1
where we consider V = v as a continuously distribut∑ed random vector in symplexn
denoted by ∆n, and z in (5.73) for each t0 is a value Vi(ω)m(Xi, β0) for ω in the
i=1
event where the probability is positive in (5.74). Note that (5.74) is true if we can
prove the case when{a = 0, namely, for n > Nε,∑ ∣∣ }n
inf P ttr0 Vim(Xi, β0) ≥ 0 ∣∣ X > 0, for X ∈ A∗ε (5.75)t0
i=1
We prove (5.75) by constructing V in the following way. Let Wi be i.i.d. random
variables that follow an exponential distribution with mean that equals to 1, and
let V = (V1, . . . , Vn) with
Vi = ∑Win . (5.76)
Wi
i=1
106
∑n
Let W = Wi/n and t0 be a unit vector in Rp, then the probability in (5.75) can
i=1
be written{as ∑ }n
P tr{t0 Vim(Xi, β0) ≥ 0 ∣
∣∣∣ X
i=∑1n1 ∣∣ }
=P {√ t
tr
0 Wim(Xi, β0) ≥ 0n ∣∣ X∑i=1 }n1 ∑n ∣∣
=P √ ttr 1 tr ∣{ n 0 m(Xi, β0) + √ t0 (Wi − 1)m(Xi, β0) ≥ 0 ∣ X∑ ni=1 i=1 ∣ }n n1 1 ∑ ∣
=P √ ttr0 (Wi − 1)m(Xi, β0) ≥ −√ ttr0 m(X ∣i, β0) ∣ X (5.77)n n
i=1 i=1
Combining (5.77) and the conclusion (i) in (a), we obtain for n > Nε, any nonneg-
ative constant K , and{X ∈ A∗ε ε∑n ∣∣ }
P ttr{ 0 Vim(Xi, β0) ≥ 0
∣∣ X
i=∑1 ∣∣ } (5.78)n
≥ 1P √ ttr0 (Wi − 1)m(Xi, β0) ≥ K ∣εn ∣ X
i=1
∑n
Next, we show that the term n−1/2 ttr0 (Wi−1)m(Xi, β0) in (5.78) satisfies (5.67)
i=1
in Lemma 5.3.1, with the choice of δ = 1, Zi = t
tr
0 (Wi − 1)m(Xi, β0), and
{ }
µ = E (W − 1) · ttri { i 0 m(Xi, β0)|Xi }= 0;
σ2{ i = var (Wi −}1) · t
tr
0 m(Xi, β )|X = ttr ⊗20 i 0 m(Xi, β0) t0.
∑n
Let s2 tr ⊗2n = t0 m(Xi, β0) t0, then the left hand side of (5.67) becomes
i=1
n
1 ∑ { }
E ‖(W tr(i − 1) · t0 m(Xi, β)‖
3|X
s(3 in i=1 ) ∑ )n ( ) (5.79)
≤ 2 · 1 ‖ 1m(Xi, β0)‖3 · √
n−3/2s3n n ni=1
107
∑n
By Lemma 5.3.1, given X ∈ A∗ε, it follows that n−1/2 ttr0 (Wi − 1)m(Xi, β0) is
i=1
asymptotically normal, i.e., for any unit vector s ∈ Rp,
∑n
n−
( )
1/2 D(Wi − 1) · strm(Xi, β tr0) −→ N 0, s Σs , with probability 1. (5.80)
i=1
Note that by the strong law of large numbers, W converges to EW1 = 1 almost
surely, therefore{for n[> Nε, constant Kε > 0, and ]X ∈ A
∗
ε
∑ }n ∣1 ∣
P inf √ (W − 1)stri m(Xi, β ∣0) ≥ Kε ∣ X > 0. (5.81)s n
i=1
Combining (5.77), (5.78) and (5.81), we get that for any n > Nε, constant a =
√
−c/ n, unit{vector t0 ∈ Rp, and X ∈ B∗∣ ε ,∑ }n ∣∣P tr{t0 Vim(Xi, β0) ≥ 0 ∣ Xi=1 ∣ } (5.82)n
≥ √1
∑ ∣
P inf (Wi − 1)strm(Xi, β0) ≥ K ∣ε X > 0,
s∈Rp:‖s‖=1 n ∣
i=1
and hence we complete the proof. 
Lemma 5.3.3 (Implicit Function Theorem) Let f : Rn+m → Rm be a contin-
uously differentiable function, and let Rn+m have coordinates (x, y). Fix a point
(a, b) = (a , . . . , a , b , . . . , b ) with f(a, b) = c, where c ∈ Rm1 n 1 m . If the Jacobian
matrix Jf,y(a, b) = [(∂fi/∂yj)(a, b)]is invertible, then there exists an open set U
containing a, an open set V containing b, and a unique continuously differentiable
function g : U → V such that
{(x, g(x))|x ∈ U} = {(x, y) ∈ U × V |f(x, y) = c}.
108
Proof of Lemma 5.1.2 For ε smaller than λ0, the smallest eigenvalue of Σ, on
the event A+ε , for any fi∥xed β ∈ Uβ∥ 0
,
∥∥∥ 1 ∑
n ∥∥∥ ε
m(X ⊗2i, β) − Σ(β, β0)
n ∥∥ < 2
i=1 2
Since β ∈ Uβ0 , we know that Σ(β, β0) is positive definite, hence
1 ∑n
m(Xi, β)
⊗2 is positive definite for β ∈ Uβ0 . (5.83)n
i=1
By contradiction, we can show that the conclusion is true for any fixed β ∈ Uβ0 .
Suppose that for a fixed β ∈ Uβ0 , there exist distinct t1 and t2 such that
g(t1,X , β) = g(t2,X , β) = 0.
Therefore by Rolle’s theorem from [15], there exists s ∈ (0, 1) such that for t3 =
st1 + (1− s)t2, ∂g(t3,X , β{)/∂s = 0, i.e.,∑ }n ⊗2
(t − t )tr 1 m(Xi, β)1 2 (t1 − t2) = 0, (5.84)
n (1 + ttr 2
i=1 3
m(Xi, β))
which implies
1 ∑n
m(X , β)⊗2i is singular,
n
i=1
contradicting (5.83). 
Proof of Theorem 5.1.1 Define C in the following way:
C ≡ {β∗ ∈ Uβ0 : ∃ rβ∗ > 0 such that for any β ∈ B ∗r ∗ (β ),β
(5.85)
there exists t = t(β) such that m(t,X, β) = 0},
where
1 ∑n m(Xi, β)
g(t,X, β) = . (5.86)
n 1 + ttrm(X
i=1 i
, β)
109
First, we show that for any fixed β∗ such that the equation g(t,X, β∗) = 0 has
a solution t = t∗, then β∗ ∈ C. By (5.86), g(t∗,X , β∗) = 0 implies 0 belongs
to int (Convn(β
∗)), consequently, there exists a unique solution to the maximizing
problem (5.12) with the choice of β = β∗. Since this unique solution is bounded by
0 and 1, i.e.,
1 · 1∗ ∈ (0, 1) for i = 1, . . . , n,n 1 + t trm(X , β∗i )
we know that t∗ ∈ Dβ∗(X ). Since g(t∗,X , β∗) = 0, ∂g(t,X, β)/∂t|(t∗,β∗) is neg-
ative definite. By the implicit function theorem, there exists rβ∗ such that for
β ∈ B ∗r ∗ (β ), equation g(t,X, β) = 0 has a solution t = t(β). Therefore t∗ ∈ C.β
In particular, since we have shown in (5.16) that when β = β0, there exists t0
such that g(t0,X , β0) = 0, we know that the conclusion in the previous paragraph
is true for there is β = β0, i.e., β0 ∈ C.
Next, we show that C = Uβ0 , i.e., for any β∗ ∈ Uβ0 , β∗ also belongs to C. Let
ρmax = sup {s : β0 + r(β∗ − β0) ∈ C, for all 0 < r < s} , (5.87)
By contradiction, we can show that ρmax ≥ 1. Otherwise, suppose ρmax < 1. By
equation (5.87), we know that
β = β0 + r(β
∗ − β0) ∈ C, for all r < ρmax (5.88)
Let {r ∞k}k=1 be a series of increasing positive numbers that are bounded by ρmax,
and
βk = β
∗
0 + rk(β − β0), for 0 < rk < ρmax. (5.89)
110
Therefore by (5.88), βk ∈ C, which together with (5.85) - (5.87) implies for each βk,
∑n1 m(Xi, βk) 1
= 0 and 1 + ttr(βk)m(Xi, βk) ≥ (5.90)
n 1 + t(βk)trm(Xi=1 i, βk) n
Let
βmax = lim β
∗
k = β0 + ρmax(β − β0) (5.91)
k→∞
then by taking the limit of (5.90) as k goes to ∞,
∑n1 m(Xi, βmax)
= 0. (5.92)
n 1 + t(β trmax) m(Xi, βmax)i=1
Therefore 0 belongs to int (Convn(βmax)). Consequently, there exists a unique solu-
tion to the maximizing problem (5.12) with β = βmax. Since the unique solution is
in (0, 1), we know that t(βmax) ∈ Dβmax(X ). Similar to the previous proof, βmax ∈ C.
By the definition of C, there exists rmax such that for all β ∈ Brmax(βmax), we know
that t = t(β) solves g(t,X, β) = 0. Therefore β = β0 + r(β
∗ − β0) ∈ C, where
r = ρmax + rmax/4 > ρmax
contradicts the definition of ρmax, hence ρmax ≥ 1 and we conclude that β∗ ∈ C. 
Proof of Lemma 5.1.3 First, let us show that (a) is true. By the LLN, we know
that statement (ii) is true. Note that
∫
{ } dPβ(x)E q(X1, β)m(X1, β) = m(x, β) dPβ
dP (x) 0
(x) = 0,
β0
by the multivariate central limit theorem,
∑n1 Dβ
Sn = √ q(Xi, β)m(Xi, β) −−→0 S ∗∞ ∼ N(0,Σ (β, β0)). (5.93)
n
i=1
111
Since S∞ follows a multivariate normal distribution, for any ε > 0 and any unit
vector t0 ∈ Rp, there exists Kε{such that }
tr − εP inf t0 S∞ < Kε < . (5.94)
t0 2
By (5.93), there∣ ex{ists Nε such that}for an{y n > Nε,∣∣∣ }P inf ttrS < −K − P inf ttrS∞ < −K ∣∣∣∣ ε0 n ε 0 ε < (5.95)t0 t0 2
Combining (5.94) and (5.95), we conclude that P {inf ttrt0 0 Sn < −Kε} < ε, hence
the conclusion for (i) is also true.
Next, we prove (b) using Lemma 5.3.2. Let Convn(β) be the convex hull of
m(Xi, β), then we want to show that the tw{o sets∑ }n
A = {0} , B = Convn(β) = vim(Xi, β); v ∈ ∆n
i=1
are not separated, whe{re ∑ }n
∆n = v = (v1, . . . , vn) : vi = 1, vi ∈ (0, 1) .
i=1
By Lemma 5.3.2, the statement 0 ∈ int (Convn(β)) is equivalent to
∀t0 ∈ Rp with ‖t0‖ = 1 and ∀a ≤ 0,
(5.96)
∃z ∈ int (Convn(β)) such that ttr0 z > a.
To prove (5.96), we will show in the following paragraphs that for any ε > 0,
√
nonpositive constant a = −c/ n and vector t0 ∈ Rp with ‖t0‖ = 1, there exists Nε
such that for n{> Nε, given X ∈ B∗ε∑n c ∣∣ }
inf P ttr0 Vim(Xi, β) ≥ −√ ∣∣ X > 0, for X ∈ B∗ε , (5.97)t0 n
i=1
112
where we consider V = v as a continuous∑ly distributed random vector in ∆n, and zn
in equation (5.96) for each t0 is a value Vi(ω)m(Xi, β) for ω in the event where
i=1
the probability is positive in (5.97). Note that (5.97) is true if we can prove the case
when choosing a ={0, namely, for n > Nε,∑ ∣∣ }n
inf P ttr0 Vim(Xi, β) > 0 ∣∣ X > 1− ε, for X ∈ B∗ε (5.98)t0
i=1
∑n
We prove (5.98) using V defined in (5.24). Let W ∗ = W ∗i /n and t0 be a unit
i=1
vector in Rp{ , then the probabilit}y in (5.98) can be written as∑n ∣∣
P ttr V m(X , β) ≥ 0 ∣{ 0 i i ∣ Xi=1
n ∣ }
1 ∑ ∣
=P tr ∗ ∣{√ t0 Wi m(Xi, β) ≥ 0 ∣ X (5.99)n
1 ∑i=1n 1 ∑n ∣∣ }
=P √ ttrq(X , β)m(X , β) + √ ttr ∗{ 0 i i 0 {Wi − q(Xi, β)}m(Xi, β) ≥ 0n n ∣
∣ X
∑i=1 i=1 ∑ }n n ∣
√1 tr { ∗ − } ≥ −√1
∣
=P t0 W
tr ∣
i q(Xi, β) m(Xi, β) t0 q(Xi, β)m(Xi, β) ∣ X ,n n
i=1 i=1
Combining (5.99) and conclusion (i) in (a), we obtain for n > Nε, any nonnegative
constant c and K{ε, and X ∈ B∗ε∑ ∣∣ }n
P {t
tr
0 Vim(Xi, β) ≥ 0 ∣∣ X
i=
1 ∑1n ∣∣ } (5.100)≥ P √ ttr0 {W ∗i − q(Xi, β)}m(X , β) ≥ K ∣∣i ε Xn
i=1
∑n
Next, we show that the term n−1/2 ttr ∗0 {Wi − q(Xi, β)}m(Xi, β) in (5.100) sat-
i=1
isfies equation (5.67) in Lemma 5.3.1, with the choice of δ = 1,
Zi = t
tr ∗
0 {Wi − q(Xi, β)}m(Xi, β),
113
by (5.23)
{ }
µ = E {W ∗ − q(X , β)} · ttri { i i 0 m(Xi, β)|Xi }= 0;
σ2i = v{ar {W
∗
i − q(X , β)} · ttr tri 0 m}(Xi, β)|Xi = t0 q(X , β)
2m(X , β)⊗2i i t0.
∑n
Let s2 = ttr q(X , β)2m(X , β)⊗2n 0 i i t0, then the left hand side of (5.67) becomes
i=1
n
1 ∑ {
E ‖ {W ∗
}
(i − q(Xi, β)} · t
tr
0 m(Xi, β))‖
3|X
(3 ) (isn i=1 ∑n ) (5.101)
≤ 2 1 1− · ‖q(Xi, β)m(Xi, β)‖
3 · √
n 3/2s3n n ni=1 ∑n
By Lemma 5.3.1, givenX ∈ B∗ε , it follows that n−1/2 ttr {W ∗0 i − q(Xi, β)}m(Xi, β)
i=1
is asymptotically normal, i.e., for any unit vector s ∈ Rp,
∑n
−1/2 D ( )n {W ∗ βi − q(Xi, β)} · strm(X , β) −−→0 N 0, str ∗i Σ (β, β0)s , (5.102)
i=1
with probability 1. Note that by the strong law of large numbers, as n goes to
infinity, W ∗ converges to EW ∗1 = 1 almost surely, therefore for any n > Nε, positive
constant{Kε, a[nd X ∈ B
∗
ε
1 ∑ ]n ∣∣ }
P inf √ {W ∗i − q(Xi, β)} strm(Xi, β) ≥ K ∣ε ∣ X > 0. (5.103)s n
i=1
Combining (5.99), (5.100) and (5.103), we get that for any n > Nε, constant a =
√
−c/ n, {unit vector t ∈ Rp0 , and∣X ∈}B
∗
ε ,
∑n ∣
P tr{t0 Vim(Xi, β) ≥ 0
∣∣ X
i=1 ∑ ∣∣ } (5.104)n
≥ P inf √1 {W ∗ − q(X , β)} stri i m(X , β) ≥ K ∣i ε ∣ X > 0,s∈Rp:‖s‖=1 n
i=1
and hence we complete the proof. 
114
5.3.2 Proofs Under the EEE Setting
In order to prove Proposition 5.2.1, we first prove the following lemma.
Lemma 5.3.4 Let {U ni,j}i,j=1 and {Vi}ni=1 be p-dimensional random vectors, and
{Zi,j}n ni,j=1 and {Wi}i=1 be random variables with finite means and variances, respec-
tively. Suppose that Wi > 0 a.s., and that as n→∞,
n n
1 ∑ 1 ∑P P
sup (Ui,j − Vi) −→ 0 and sup (Zi,j −Wi) −→ 0, (5.105)
i n i n
j=1 j=1
If for any i, there exists∥∥a∑constant∥∥b ∈ R such that∥ n∣∣∣∑j=1
Ui,j∥∣∣ ≤ ‖Vi‖b and ≤ b (5.106)n |Wi|
j=1 Zi,j∣
then for any i, as n→∞,
∥∥∥∥∑∑ ∥n∥ ∥j=1 Ui,j Vi ∥ Pn − ∥ −→ 0. (5.107)j=1 Zi,j Wi∥
Proof: First, by the triangle inequality and (5.106), we know that for any i, the
left hand side of ∥(5.10∑7) must be bounded; al∑so, note that it equates to∥∥∥∥
∥
W ni j=1(Ui,j − Vi)∑/n− V ni j=1(Zi,j −W )/n∥∥i
W n
∥ ,
i j=1 Zi,j/n ∥
which is bound∥ed∥∥∥∑
by ∥ ∥ ∥ ∥ ∥n ∑
∥ j∑=1(U ∥ ∥
n ∥
i,j − Vi)/n∥∥∥ ∥∥∥ Vi ∥∥∥ · ∥∥∥ j=∑1(Zi,j −Wi)/n∥n + ∥ (5.108)j=1 Zi,j/n W ni j=1 Zi,j/n ∥
Moreover, notice that by the law of large numbers, for any i = 1, 2, . . . ,
∑n1 P
Zi,j −→Wi, a.s., (5.109)
n
j=1
so the conclusion follows by (5.105), (5.106) and (5.109).
115
Proof of Proposition 5.2.1 Note that we can write the left hand side of (5.34)
as
1 ∑ {n ∑ }nj∑ ⊗2=1 C(Xj)k(Xi, Xj, β2 − 0)Q(Xi, β0) C(Xi) (5.110)
n n∑i=1 j=1 k(Xi, Xj, β0)n1
= Q(Xi, β0)
2C(X ⊗2∑i) (5.111)n i=1
1 ∑n nj=1 C(Xi)C(X )trj k(Xi, X− j, β0)Q(Xi, β0)2∑ ∑n (5.112)n∑ j=1 k(Xi, Xj, βi=1 0)n n1 j=1 C∑(Xj)C(X )tri k(X− 2 i, Xj, β0)Q(Xi, β0) { n } (5.113)n ∑ j=1 k(Xi, Xj, βi=1 0)∑ ⊗2n {n1 j=∑1C(Xi)k(Xi, Xj, β0)+ Q(Xi, β 20) }2 , (5.114)n n
i=1
j=1 k(Xi, Xj, β0)
and the right hand{side of (5.34) as[ ] }
2 − E {
⊗2
C(Y2)k(Y1, Y2, β0)|Y1}
Σ = E Q(Y1, β0) C(Y1)
{ } E {k(Y1, Y2, β0)|Y1}
= E Q{(Y1, β0)2C(Y )⊗21 } (5.111’)
− 2E {C(Y1)C(Y
tr
2) k(Y1, Y2, β0)|Y1}
E {Q(Y1, β0) (5.112’)E {k(Y1, Y2, β0)|Y1} }
tr
− E {C(Y2)C(Y1) k(Y1, Y2, β0)|Y1}E{Q(Y1, β )20 E {k(Y1, Y2, β0)|Y1} } (5.113’)
⊗2
− [E {C(Y2)
trk(Y1, Y2, β0)|Y1}]
E Q(Y1, β
2
0) 2 (5.114’)[E {k(Y1, Y2, β0)|Y1}]
By law of large numbers, (5.111) converges to (5.111’) in probability. Now, let us
show that (5.112) converges to (5.112’) in probability. Note that by adding and
subtracting the term
E {C(Xi)C(Y tr2) k(Xi, Y2, β0)|Xi}
Q(Xi, β )
2
0 , (5.115)
E {k(Xi, Y2, β0)|Xi}
116
we can write (5.112) as th{e∑summation of
1 ∑n nj=1 C∑(X )C(X )tri j k(Xi, Xj, β0)A = Q(Xi, β )20
n nj=1 k(Xi, Xj, β0)i=1 } (5.112.a)
− E {C(Xi)C(Y )
tr
2 k(Xi, Y2, β0)|Xi}
E {k(Xi, Y2, β0)|Xi}
and
∑n { }1 2 E {C(Xi)C(Y )tr2 k(Xi, Y2, β0)|Xi}B = Q(Xi, β0) { | } (5.112.b)n E k(Xi, Y2, β ) Xi=1 0 i
By LLN, (5.112.b) converges to (5.112’) in probability. Next, we show that (5.112.a)
converges to zero in pr∑obability. Letn
j=1C(Xi)C(Xj)
trk(Xi, Xj, β0)
d(Xi,X , β0) = ∑n
j=1 k(Xi, Xj, β0) (5.116)
− E {C(Xi)C(Y )
tr
2 k(Xi, Y2, β0)|Xi}
.
E {k(Xi, Y2, β0)|Xi}
First, we apply Lemma 5.3.4 to (5.116) by setting
Ui,j = C{(Xi)C(X
tr
j) k(Xi, Xj, β0), }
Vi = E C(X )C(Y )
tr
i 2 k(Xi, Y2, β0)|Xi
(5.117)
Zi,j = k(Xi, Xj, β0),
Wi = E {k(Xi, Y2, β0)|Xi} .
The assumption (5.106) in Lemma 5.3.4 is guaranteed by Lemma 4.2.1 under as-
sumptions (A.2), (A.9) and (A.10). By (A.2), there exists a constant b such that
‖C(x∑)‖ ≤ b for all x, hence
‖∑n ∑n trj=1 Ui,j‖ ‖ j=1C∑(Xi)C(Xj) k(Xi, Xj, β0)‖n = ≤ b2‖ j=1 Zi,j‖ ‖ nj=1 k(Xi, Xj, β0)‖
‖V ‖ ‖E {C(X )C(Y )tri i 2 k(Xi, Y2, β0)|Xi} ‖ ≤ 2 |E {k(Xi, Y2, β0)|Xi} |= b = b2
|Wi| E {k(Xi, Y2, β0|Xi} E {k(Xi, Y2, β0)|Xi}
117
Therefore the conclusion (5.107) of Lemma 5.3.4 holds. Moreover, notice that
(i) C(x) and k(x, y, β) are both bounded by a fixed constant under assumptions
(A.2) and (A.9);
(ii) X ≡support(X1) and Uβ0 are both bounded in Rp;
The convergences in (4.15) and (4.14) actually hold in Lp norm, where p = 1, 2, . . . ,
therefore
E {‖ Pd(Xi,X , β0)‖ | Xi} −→ 0, for any i. (5.118)
Note that as a random variable, the left hand side of (5.118) also satisfied (i) and
(ii), hence it also converges in Lr norm, and
E {{‖d(Xi,X , β0)‖}}→ 0, for any i. (5.119)
E ‖d(Xi,X , β )‖20 → d <∞, for any i.
Recall that under assumption (A.2), |Q(x, β0)| < M <∞, for any x, hence
∑n1 1 ∑n‖A‖ = Q(X 2 2i, β0) ‖d(Xi,X , β0)‖ ≤M · ‖d(Xi,X , β0)‖ (5.120)
n n
i=1 i=1
Therefore for any ε > 0, { ∑ }n
P {‖ ‖ 1A > ε} ≤ P M2 · ‖d(Xi,X , β0)‖ > ε
( { n i=1 }) (5.121)
≤ ε ‖ ‖2 1/2E d(X ,X, β ) → 0,
M2
i 0
where we got the last inequality by Chebyshev’s Inequality, and the convergence to
zero by (5.119). Hence the conclusions follows by the definition of convergence in
probability. 
118
Proof of Lemma 5.2.1 First, let us show that (a) is tr∑ue. Under assumptionn
(A.2) and (A.9), apply Proposition 5.2.1, we know that n−1 mn(Xi,X , β
⊗2
0) con-
i=1
verges to Σ in probability. Therefore the conclusions for (5.37) and (5.39) are true.
By Proposition 4.2.2, the conclusion for (5.38) is true. A∑s for the conclusion inn
(5.36), it is guaranteed by the asymptotic normality of n−1/2 mn(Xi,X , β0) shown
i=1
in Lemma 4.2.2.
To prove (b), we follow the steps in the proof of part (b) in Lemma 5.1.1. Recall
that by Lemma 5.3.2, the conclusion that 0 belongs to the int (Convn(β0)), which is
the convex hull of mn(Xi,X , β0), can be drawn by showing for any ε > 0 and unit
vector t p0 ∈ R {, there exists Nε such that for n >}N
∗
∣ ε
, given X∞ ∈ Cε ,
∑n ∣
inf P ttr0 Vimn(Xi,X , β0) ≥ 0∣∣X∞ > 1− ε, for X∞ ∈ C∗ε ,t0
i=1
where V = v is as constructed in (5.76). Note that with g(Xi, β0) replaced by
the term mn(Xi,X , β0), (5.77) and (5.78) are both true , and (5.79) is true by
Proposition 5.2.1 and Lemma 4.2.2. Therefore the conditions for the Lyapunov
central limit theorem mentioned in Lemma 5.3.1 are verified with respect to {Wi}
variables under the extended estimating equation setting, and given X ∈ C∗∞ ε , it
follows that ∑n
n−1/2 ttr0 (Wi − 1)mn(Xi,X , β0)
i=1
is asymptotically normal, i.e., for any unit vector s ∈ Rp,
∑n
n−1/2 tr
D
t0 (W − 1)stri mn(Xi,X , β tr0) −→ N(0, s Σs), with probability 1.
i=1
119
Therefore (5.81) and (5.82) are both true with g(Xi, β0) replaced by mn(Xi,X , β0),
and we finish the proof. 
Proof of Lemma 5.2.2 For ε smaller than λ0, the smallest eigenvalue of Σ, on
the event C+ε , for any∥∥fixed β ∈ Uβ0 ,∥∥∥ 1 ∑
n ∥∥∥ ε
mn(Xi,X , β)
⊗2 − Σ(β, β0)∥
n ∥ < 2
i=1 2
Since β ∈ Uβ0 , we know that Σ(β, β0) is positive definite, hence
1 ∑n
mn(Xi,X , β)
⊗2 is positive definite for β ∈ Uβ
n 0
. (5.122)
i=1
By contradiction, we can show that the conclusion is true for any fixed β ∈ Uβ0 .
Suppose that for a fixed β ∈ Uβ0 , there exist distinct t1 and t2 such that g(t1,X , β) =
g(t2,X , β) = 0. Therefore by Rolle’s theorem from [15], there exists s ∈ (0, 1) such
that for t3 = st1 + (1{− s)t2, ∂g(t3,X , β)/∂s = 0, i.∑ }
e.,
n
− tr 1 m (X ,X, β)
⊗2
n i
(t1 t2) (t1 − t2) = 0, (5.123)
n (1 + ttr3 m
2
i=1 n
(Xi,X , β))
which implies ∑n1
m (X ,X, β)⊗2n i is singular,
n
i=1
contradicting (5.122). 
Proof of Theorem 5.2.1 Define C in the following way:
C ≡ {β∗ ∈ Uβ ∗0 : ∃ rβ > 0 such that for any β ∈ B (β∗r ∗ ),β
(5.124)
there exists t = t(β) such that g(t,X, β) = 0},
where
1 ∑n mn(Xi,X , β)
g(t,X, β) = . (5.125)
n 1 + ttrm
i=1 n
(Xi,X , β)
120
First, we show that for any fixed β∗ such that the equation g(t,X, β∗) = 0 has
a solution t = t∗, then β∗ ∈ C. By (5.125), g(t∗,X , β∗) = 0 implies 0 belongs
to int (Convn(β
∗)), consequently, there exists a unique solution to the maximizing
problem (5.41) with the choice of β = β∗. Since this unique solution is bounded by
0 and 1, i.e.,
1 · 1 ∈ (0, 1) for i = 1, . . . , n,
n 1 + t∗trm ∗n(Xi,X , β )
we know that t∗ ∈ D ∗(X ). Since g(t∗,X , β∗β ) = 0, ∂g(t,X, β)/∂t|(t∗,β∗) is neg-
ative definite. By the implicit function theorem, there exists rβ∗ such that for
β ∈ B ∗r ∗ (β ), equation g(t,X, β) = 0 has a solution t = t(β). Therefore t∗ ∈ C.β
In particular, since we have shown in (5.45) that when β = β0, there exists t0
such that g(t0,X , β0) = 0, we know that the conclusion in the previous paragraph
is true for there is β = β0, i.e., β0 ∈ C.
Next, we show that C = Uβ0 , i.e., for any β∗ ∈ U ∗β0 , β also belongs to C. Let
ρmax = sup {s : β0 + r(β∗ − β0) ∈ C, for all 0 < r < s} , (5.126)
By contradiction, we can show that ρmax ≥ 1. Otherwise, suppose ρmax < 1. By
equation (5.126), we know that
β = β0 + r(β
∗ − β0) ∈ C, for all r < ρmax (5.127)
Let {r ∞k}k=1 be a series of increasing positive numbers that are bounded by ρmax,
and
βk = β0 + rk(β
∗ − β0), for 0 < rk < ρmax. (5.128)
121
Therefore by (5.127), βk ∈ C, which together with (5.85) - (5.87) implies that for
each βk,
1 ∑n mn(Xi,X , βk) 1
= 0 and 1 + ttr(βk)mn(Xi,X , βk) ≥ (5.129)
n 1 + t(β )trm (X ,X, β ) n
i=1 k n i k
Let
βmax = lim βk = β0 + ρ
∗
max(β − β0) (5.130)
k→∞
then by taking the limit of (5.129) as k goes to ∞,
n
1 ∑ mn(Xi,X , βmax)
= 0. (5.131)
n 1 + t(β )trm (X ,X, β )
i=1 max n i max
Therefore 0 belongs to int (Convn(βmax)). Consequently, there exists a unique solu-
tion to the maximizing problem (5.41) with β = βmax. Since the unique solution is
in (0, 1), we know that t(βmax) ∈ Dβmax(X ). Similar to the previous proof, βmax ∈ C.
By the definition of C, there exists rmax such that for all β ∈ Brmax(βmax), we know
that t = t(β) solves g(t,X, β) = 0. Therefore β = β0 + r(β
∗ − β0) ∈ C, where
r = ρmax + rmax/4 > ρmax
contradicts the definition of ρmax, hence ρmax ≥ 1 and we conclude that β∗ ∈ C. 
Proof of Theorem 5.2.2 Let Ui be i.i.d. random variables that has a uniform
distribution over the span of ttr0 X , and let Ui be independent of X∞, hence by
definition, there exists a constant M such that |Ui| ≤M . Define
V ∗i = ∑U∗i where U∗n i = Uiq(Xi, β), i = 1, . . . , n (5.132)
U∗i
i=1
122
Let
1 ∑n√ U∗ttri 0 mn(Xi,X , β)n∑i=1n1 ∑ (5.133)n
=√ U∗ttri 0 V (Xi, β) + √
1
U∗i t
tr
0 ((mn(Xi,X , β)− V (Xi, β))n n
i=1 i=1
Note that on the event C+ε , by the discussion between page 100 and 102, and the
boundedness of U∗∣ i , the second term in (5.13∣3) is OP (1), namely,∣∣∣∣
n
1 ∑ ∣∣ 1 ∑n√ U∗ttr Kεi 0 ((mn(Xi,X , β)− V (Xi, β))∣∣ ≤ √ M · √ = MKε. (5.134)n n n
i=1 i=1
Th{erefore on event C
+
ε , for a fixed
p
}β ∈ Uβ0 and any unit vector t0 ∈ R
1 ∑ {n ∑ }n
P √ 1U∗ tri t0 mn(Xi,X , β) > 0 ≥ P √ U∗ tri t0 V (Xi, β) > MKεn n
i=1 i=1
{ ∑ (5.13}5)n1
= P √ Uiq(Xi, β)ttr0 V (Xi, β) > MKεn
i=1
Using the same procedure as in the proof of Lemma 5.1.3, for n > Nε and X ∈ C+ε ,
by the Lyap{unov central limit theorem mentioned in Lemma}5.3.1∑n1
P inf √ Uiq(Xi, β)ttr0 V (Xi, β) > MKε > 0 (5.136)
s∈Rp:‖s‖=1 n
i=1
Combine (5.135) and (5{.136), yielding,∑ }n1
P √ U∗ttri 0 mn(Xi,X , β) > 0 > 0. (5.137)n
i=1
Note that equation (5.137) implies that under the condition of this lemma, 0 ∈
int (Convn(β)) with weight V
∗
i defined in (5.132) with a positive probability. Hence
we complete the proof. 
123
Chapter 6: Empirical Likelihood Applied to Extended Estimating
Equations
Consider d-dimensional independently identically distributed (i.i.d.) observations
i.i.d.
X1, . . . , Xn ∼ Pθ, i = 1, . . . , n (6.1)
with support X , and parameters
θ = (β, ν) ∈ Rp ×H, (6.2)
and H is a infinite dimensional space such as function space. The Empirical Likeli-
hood (EL) method is an estimation method that maximizes the empirical distribution
subject to constraints. For example, in a classic setting, if there exists a estimating
function
m(x, β) : Rd × Rp 7→ Rr (6.3)
such that Eβ0 {m(X1, β0)} = 0, then the EL method seeks solution to the maxi-
mization problem for a fix β in a neighbourhood of β0 ∏narg maxp pi, where p = (p1, . . . , p n
),
 i=1 ∑ (6.4)n ∑nsubject to pi = 1, pi ∈ (0, 1), pim(Xi, β) = 0.
i=1 i=1
124
The solution of (6.4), given by p̂i(β,Xi) can be used to construct the Profile
Empirical Likelihood. A point estimator for β0 can be constructed via maximizing
the pEL, or equivalently, its negative logarithm value, i.e.,
∑n
β̂n = arg min l(β), where l(β) = ln p̂i(β,Xi). (6.5)
β
i=1
EL method has been extensively researched in statistical literature. Owen in
[32] established the Wilks type confidence region when the dimension of β0 equals
the dimension of m(x, β), i.e., r = p. The Wilks type Confidence Region (CR)
does not require calculating the variance covariance matrix of m(X1, β0), and the
E {∇βm(X1, β0)}, and can usually provide a narrower CR then the Wald type statis-
tics.
Another convenient feature of EL method is that it can by pass some regularity
conditions that is essential to estimating equations. For example, the embedded
constraints m(x, β) can have a higher dimension than parameter, i.e., r > p. This
result can be found in Qin and Lawless’s work in [33].
Continuity is a necessary condition for the existence of solution to estimating
equation Sn(β) = 0, where
∑n
Sn(β) = m(Xi, β)
i=1
and thus cannot be ignored. Under the CEE setting, Owen, and Qin and Lawless
both assumed that the criterion function m(x, ·) is continuous in [32] and [33].
However, we found in some prominent right censored semiparametric model like
125
ϕ-transformation model and Cox model, mn(x,x, ·) is usually discontinuous due to
the appearance of indicator functions. Therefore in Chapter 5, we show that under
Classic Estimating Equation (CEE) setting with continuous or discontinuous m(x, ·),
and Extended Estimating Equation (EEE) setting with continuous mn(x,x, ·), there
exists Uβ0 , a non shrinking neighbourhood of β0 such that for any β ∈ Uβ0 , (6.4) has
a unique solution with probability approaching 1. The same result can be shown
for EEE when mn(Xi,X , ·) loses continuity if we restrict to the ϕ-transformation
model discussed in Chapter 3, which is the broadest type of semiparametric model
of which the structural parameter can be constructed via EEE to our knowledge.
6.1 Empirical likelihood of Extended Estimating Equation
In this section, we consider i.i.d. observations
i.∼i.d.X1, . . . , Xn Pθ, θ = (β, ν) ∈ Rp ×H, (6.6)
described in (6.1) - (6.2). In Definition 1.2.1 of page 6, we defined EEE as
∑n
Sn(β) = mn(Xi,X , β) = 0, β ∈ Uβ0 , (6.7)
i=1
where Q(x, β) : Rd × Rp →7 R, k(x{, y, β) : R
d × Rd p∑ × R 7→ R
+, C(x) :}R
d 7→ Rp,
n
j∑=1C(Xj)k(X− i, Xj, β)mn(Xi,X , β) = Q(Xi,X , β) C(Xi) n , (6.8)
j=1 k(Xi, Xj, β)
with Eβ0,ν {mn(X1,X , β0)} = 0 for all ν ∈ H, and β̃n is the unique solution to (6.7).
In the following sections, we investigate the asymptotic normality of β̂n, the pEL
estimator, and compare its asymptotic variance matrix of β̃n.
126
6.2 Asymptotic Normality Associated with the EEE
In this section, we discuss the asymptotic normality of β̃n via solving EEE, and
β̂n via minimizing the negative logarithm of pEL, i.e., ∑n
β̃n : solution to Sn(β̂n) = 0, where Sn(β) = mn(Xi,X , β);
∑ ( ) i=1 (6.9)n
β̂n : arg min ln 1 + t
tr(β)mn(Xi,X , β)
β
i=1
√
First compute the asymptotic variance matrix of nSn(β0), which lead to the sand-
wich form variance of β̃0. Then we compute the variance for β̂n following the idea
of Qin and Lawless’s Lemma 1 and Theorem 1 in [33]. At the end of this section,
we show that when the dimension of the EEE r equals to the dimension of β0, then
the two variances from β̃n and β̂n are identical.
∑n
Theorem 6.2.1 Under assumptions of (A.1) - (A.10), n−1/2 mn(Xi,X , β0) →
i=1
N(0,Σ), where Σ{= Σ1 + Σ2,{and for }}
E {C(X2)k(X1, X2, β0)|X1}
Σ1 = V ar Q(X1, β0) C(X1)−
E {k(X1, X2, β0)|X1} (6.10)
Σ2 = V ar {q1(X1, β0)C(X1)}+ V ar {q2(X1, β0)} − 2ΣAB,
and { ∣ }
Q(X2, β0)k(X2, X1, β0) ∣
q1(X1, β0) = E { ∣∣ X1 ,E {k(X2, X3, β0)|X2} ∣ }
Q(X2, β0)k(X2, X1, β0)E {C(X3)k(X2, X3, β0)|X2} ∣
q ∣2(X1, β0){= E X{ 1E {k(X 2 ∣2, X3, β0)|X2}}
Q(X1, β0)
ΣAB = E {C(X2)k(X1, X2, β0)− E {C(X4)k(X1, X4, β0)|X1}}
k̄(X1, β0) }
× Q(X3, β0)kc(X3, β0) {k(X ,X
2 3 2
, β0)− E {k(X3, X4, β0)|X3}} .
k̄(X3, β0)
127
Asymptotic Varaince Covariance of β̃n Using the conclusion in Theorem 6.2.1,
we can calculate the asymptotic variance matrix of β̃n. Let β̃n be the solution to
Sn(β) = 0. We can derive the sandwich formed asymptotic variance covariance
√
matrix for n(β̃n − β0) using the same Taylor expansion method in Section 5.3
of [41] by Van der Vaart to conclude
√ D
n(β̃n − β0) −→ N(0, Σ̃), (6.11)
where for Σ defined in Theorem 6.2.1,
Σ̃ = E {∇βmn(X1,X , β )}−10 ΣE {∇βmn(X1,X , β0)}−1 (6.12)
Next, we discuss the asymptotic normality of β̂n. The following lemma is parallel
Lemma 1 by Qin and Lawless in [33].
Lemma 6.2.1 Under assumptions {(A.1)-(A.10), as n → }∞, with probability, l(β)
is minimized at β̂n in the interior of β : ‖β − β0‖ ≤ n−1/3 , with β̂n and t̂n = t(β̂n)
given by
Q1n(β̂n, t̂n) = 0, Q2n(β̂n, t̂n) = 0, (6.13)
where
n
1 ∑ 1
Q1n(β, t) = mn(Xi,X , β),
n∑ 1 + t
trm (X ,X, β)
i=1 n i
n ( )tr
1 1 ∂mn(Xi,X , β)
Q2n(β, t) = t.
n 1 + ttrmn(Xi,X , β) ∂βi=1
The proof of this Lemma is almost identical to Lemma 1 by Qin and Lawless in [33]
after we developed the proceding parallel lemmas in Chapter 4. With Lemma 6.2.1,
√
we can show the asymptotic normality of n(β̂n − β0).
128
√ D
Theorem 6.2.2 Under assumptions (A.1) - (A.10), n(β̂n − β0) −→ N(0, V ),
where V is defined in (6.12), where
( ) ( )tr
V = Ŝ−1 Ŝ Σ̂−1 Σ Ŝ−122.1 21 1 22.1Ŝ
−1
21Σ̂1 (6.14)
with Σ and Σ1 defined in Theorem 6.2.1, Ŝ21 = E {∇βmn(Xi,X , β0)}, and
Ŝ = ŜtrΣ−122.1 21 1 Ŝ21. (6.15)
This is an extension of Theorem 1 by Qin and Lawless in [33] under the CEE
setting. We followed their idea of proof after establishing preceding parallel lemmas
in Chapter 4.
Now, assume r = p as in all the other chapters of this thesis, where r is the
dimension of mn(Xi,X , β) and p is the dimension of β0. Also assume Ŝ21 =
E {∇βmn(Xi,X , β0)} is non singular for β ∈ Uβ0 , where Uβ0 is defined in Theo-
rem 5.2.1. Then (6.14) becomes
V = Ŝ−1Σ Ŝ−121 1 21 Ŝ
−1 −1
21Σ1 ΣΣ1 Ŝ21Ŝ
−1Σ Ŝ−1 = Ŝ−121 1 21 21 ΣS
−1
21 . (6.16)
Comparing (6.12) and (6.16), we can see that V = Σ̃, i.e., the asymptotic variance
covariance matrix for β̂n is the same as that for β̃n.
6.3 Some Proofs
√
Proof of Lemma 6.2.1 Note that nm̄n(X, β0) can be split into the difference
of
1 ∑n { }√ − kc(Xi, β0)Q(Xi, β0) C(Xi) . (6.17)
n k̄(X , β )
i=1 i 0
129
and
∑ { }n1 n− ∑1/2 nj∑=1 C(Xj)k(Xi, Xj, β√ 0) kc(Xi, β0)Q(Xi, β0)
n n−
− (6.18)
1/2 n
j=1 k(Xi, Xj, β0) k̄(Xi, β )i=1 0
Since (6.17) is a summation of i.i.d. terms, under assumption (A.1)-(A.3) and (A.5)-
(A.6), by the CLT, it converg{es to N(0,{Σ1), where }}
Σ1 = V ar Q(X, β0) C(X)−
kc(X, β0)
. (6.19)
k̄(X, β0)
Next, we show that (6.18) is also asymptotically normal. Using the same strategy
in the proof of Lemma 4.2.2, first split (6.18) into the difference of the following two
terms,
n n
√1
∑ ∑Q(X ∑i, β0) 1 { }
− n √ C(Xj)k(Xi, Xj, β0)− kc(X , β )n n 1/2 i 0j=1 k(X ,X , β ) ni=1 i j 0 j=1
(6.18.a)
and
∑n n
√1 Q(Xi, β∑0)kc(Xi, β0) ∑{ }n √1− k(Xi, X1/2 j, β0)− k̄(Xi, β0) .n k̄(Xi, β0)ni=1 j=1 k(Xi, Xj, β0) n j=1
(6.18.b)
By Lemma 4.2.1, we know∥ that ∥
n
∥∥∥∥ 1 ∑ ∥∥ a.s.sup k(x,Xj, β0)− k̄(x, β0)∥ −→ 0x∈X n ∥
j=1
Therefore for large n, (6.18.a) and (6.18.b) equals to
1 ∑n nQ(Xi, β0) 1 ∑{ }
A = √ C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) (6.18.a.1)
n k̄(Xi, βi=1 0) n j=1
and
1 ∑n nQ(Xi, β0)kc(Xi, β ∑0)√1 { }B = k(Xi, Xj, β0)− k̄(Xi, β2 0) , (6.18.b.1)n k̄(X , β ) n
i=1 i 0 j=1
130
respectively. Let
n
1 ∑
dµn = δX ,
n i
i=1
and dµ = dFX , then (6.18.a.1) can be re-written as an integral with respect to the
difference between µn and µ,
∑n ∫
i.e.
1 Q(Xi, β0) √
C(y)k(Xi, y, β0) n {dµn(y)− dµ(y)} . (6.20)
n k̄(Xi, β )i=1 0
Let
√
dγn(y) = n {dµn(y)− dµ(y)} .
Recall that for i = 1, . . . , n, ∫
k̄(Xi, β0) = E {k(Xi, X, β0)|Xi} = k(Xi, x, β0)dµ(x).
Continuing to use the integrated empirical process notation to replace the sum, we
know that (6.20)∫is equal to∫ ∫Q(z, β0) dµn(z) C(y)k(z, y, β0)dγn(y) (6.21)
k(z, x, β0)dµ(x)
Note that, as n→∞, µ∫n∫→ µ, which indicates that the integral in (6.21) approaches
Q(∫z, β0)C(y)k(z, y, β0)dµ(z)dγn(y) (6.22)
k(z, x, β0)dµ(x)
Let ∫
Q∫ (z, β0)k(z, y, β0)q1(y, β0) = dµ(z), (6.23)
k(z, x, β0)dµ(x)
then (6.22) equals to ∫
q1(y, β0)C(y)dγn(y), (6.24)
which converges to N(0, V ar {q1(Y, β0)C(Y )}). Note that for large n, (6.18.a) is
equal to (6.18.a.1) almost surely. Therefore combining (6.20)-(6.24), we conclude
131
that (6.18.a) converges to N(0, V ar {q1(Y, β0)C(Y )}). Similarly, (6.18.b) can be
re-written as∫∫ ∫{k∫(z, x, β0)C(x)dµ(x)Q(z, β0)k(z, y, β0) }2 dµ(z)dγn(y), (6.25)
k(z, x, β0)dµ(x)
which converges to ∫N(0, V ar {q2(Y, β0)})∫, where
k(z, x, β0)C(x)dµ(x)
q2(y, β0) = Q(z, β0)k(z, y, β0) {∫ }2 dµ(z). (6.26)
k(z, x, β0)dµ(x)
It remains to show the joint normality and the asymptotic covariance of (6.18.a)
and (6.18.b), or equivalently, the covariance of A and B defined in (6.18.a.1) and
(6.18.b.1). Note that
∑n ∑n n n {
· 1
∑∑ Q(Xi, β0) { }
E(A B) = E C(Xj)k(Xi, Xj, β0)− kc(X3 i, β0)n ′ ′ k̄(Xi=1 i =1 j=1 j =1 i, β0)
(6.2}7)
× Q(Xi
′ , β0)kc(Xi′ , β0) { }
k(Xi′ , Xj′ , β0)− k̄(Xi′ , β0)
k̄(Xi′ , β 20)
Let us consider the following cases which are broken down according to the number
of distinct number of elements in I = {i, j, i′, j′},
Case 1 If the number of distinct elements in I is 1, then there are n identical
terms to be added
Case 2 If the number of distinct elements in I is 2, then there are {C14 + C24}n(n−
1) identical terms to be added.
Case 3 If the number of distinct elements in I is 3, then there are C14 ·n(n−1)(n−2)
identical terms to be added.
132
Case 4 If the number of distinct elements in I is 4, then due to the independence
of Xi’s, the expected value in the summand of (6.27) is 0.
Since (6.27) is the summation of groups of identical terms then divided by n3, so
we only need to consider Case 3. Assume that i 6= j and i′ 6= j′ as we did before,
and consider the following subcases for Case 3,
Case 3a i = i′, then the expected values of (6.27) are all equal to
{
Q(X1, β0) { }
E C(X2)k(X1, X2, β0)− kc(X1, β0) } (6.28)k̄(X1, β0)
× Q(X1, β0)kc(X1, β0)
{ }
k(X ,X , β )− k̄(X
2 1 3 0 1
, β0) .
k̄(X1, β0)
We can show that (6.28) is identical to 0 by conditioning on X1.
Case 3b i = j′, then the expected values of (6.27) are all equal to
{
Q(X1, β0) { }
E C(X2)k(X1, X2, β0)− kc(X1, β0) } (6.29)k̄(X1, β0)
× Q(X3, β0)kc(X3, β0)
{ }
k(X3, X1, β0)− k̄(X3, β0)
k̄(X 23, β0)
We can show that (6.29) is identical to 0 by conditioning on X1 and X3.
Case 3c j = i′, then the expected values of (6.27) are all equal to
{
Q(X1, β0) { }
E C(X2)k(X1, X2, β0)− kc(X1, β0) } (6.30)k̄(X1, β0)
× Q(X2, β0)kc(X2, β0)
{ }
k(X2, X3, β0)− k̄(X2, β0)
k̄(X2, β )20
We can show that (6.30) is identical to 0 by conditioning on X1 and X2
133
Case 3d i = 1, j = 2, i′ = 3, j′ = 2, then the expected values of (6.27) are all
equal to
{
Q(X1, β0) { }
ΣAB = E C(X2)k(X1, X2, β0)− kc(X1, β0) } (6.31)k̄(X1, β0)
× Q(X3, β0)kc(X3, β0)
{ }
k(X3, X2, β0)− k̄(X3, β0)
k̄(X3, β )20
Combining (6.18.a), (6.24) and (6.26), we know that (6.18) converges to N(0,Σ2),
where
Σ2 = V ar {q1(Y, β0)C(Y )}+ V ar {q2(Y, β0)} − 2ΣAB (6.32)
As for the asymptotic covariance of (6.17) and (6.18), consider
∑n ∑n { { }1 kc(Xi, β0)
E Q(Xi, β0) C(Xi)− (6.33)
n ′ k̄(X , β )i=1 i =1 { i 0 }
n−
∑
1 n
j∑ }=6 iC(Xj)k(Xi′ , Xj, β0) kc(Xi′× − , β0)Q(Xi′ , β0)
n−1 nj=6 i k(Xi′ , Xj, β0) k̄(Xi′ , β0)
(6.34)
When i 6= i′, then the terms in (6.33) and (6.34) are independent, hence the expected
value of the product of (6.33) and (6.34) equals to the product of their expected
value. Furthermore, notice that (6.33) are centered at zero, therefore when i 6= i′,
the expectation of (6.33)-(6.34) is zero. Next, we consider the case when i = i′.
Since there are n identical terms, we can rewrite the expectation of the (6.33)-(6.34)
as
{ { }
− kc(X1, β0)E Q(X1, β0) C{(X1) ∑ k̄(X1, β0) }}
n−1 nj∑=6 iC(Xj)k(X1, Xj, β× 0) kc(X1, β0)Q(X1, β0)
n− n
−
1
j=6 i k(X1, Xj, β0) k̄(X1, β0)
134
∑
Since by∑strong law of large numbers, as n goes to infinity, n
−1 n
j=1 C(Xj)k(X1, Xj, β0)
and n−1 nj=1 k(X1, Xj, β0) goes to kc(X1, β0) and k̄(X1, β0), respectively, then by
dominated convergence theorem, the expected value in the expression above also
approach zero as n goes to infinity. Therefore (6.17) and (6.18) are asymptotically
√
independent, and nm̄n(X, β0)→ N(0,Σ) with Σ = Σ1 + Σ2. 
Proof of Lemma 6.2.1 Let β be on the surface of the ball centered at β0 and
with radius n−1/3. Hence for a unit vector u,
β = β0 + un
−1/3.
Next, we give a lower bound for l(β) on the surface of the ball. Let υ ∈ Rp be a
unit vector an{d t = ‖t‖υ. Owen in}[32] showed that the (6.35) holds uniformly for
values of β ∈ β : ‖β − β0‖ ≤ n−1/3 under the CEE setting, i.e., when the criterion
function is given by[m(Xi, β)∑ ] [ ]n −1 n1 1 ∑
t(β) = m(X , β)⊗2 m(X , β) + o(n−1/3i i ), (6.35)
n n
{ i=1 } i=1
uniformly about β ∈ β : ‖β − β ‖ ≤ n−1/30 .
It is essential that (6.35) is still true in the EEE setting, therefore we prove it
following the steps in Qin and Lawless [33] and Owen [32]. Since we have shown the
parallel preceding lemmas in Chapter 4, the steps are not much different than that
in [33] or [32]. Let Yi(X
tr
i,X , β) = t mn(Xi,X , β) and substitute
1
= 1− Yi(Xi,X , β) (6.36)
1 + Yi(Xi,X , β) 1 + Yi(Xi,X , β)
135
Note that ∑n1 mn(Xi,X , β)
= 0 (6.37)
n 1 + ttrm (X ,X, β)
i=1 n i
can be re-written as
1 ∑n mn(Xi,X , β)
= 0 (6.38)
n 1 + Yi(Xi,X , β)i=1
Multiply (6.38) by υ, then plug i(n (6.36), then we get∑n )1 Yi(Xi,X , β)
0 = υtrmn(Xi,X , β) 1−
n ∑ ∑1 + Yi(Xi,X , β)i=1n n1 υtrtr − mn(Xi,X , β)m trn(Xi,X , β) t= υ mn(Xi,X , β) . (6.39)
n 1 + Yi(Xi,X , β)i=1 i=1
Define
n
1 ∑ m (X ,X, β)⊗2n i
S̃(X, β) =
n∑ 1 + Yi(Xi,X , β)i=1 (6.40)n1
S(X, β) = m (X ,X, β)⊗2n i ,
n
i=1
Plug (6.40) into (6.39), yielding
υtrm̄n(X, β) = ‖t‖υtrS̃(X, β)υ (6.41)
By the definition of S̃ in (6.40),
1 ∑n≥ mn(X ,X, β)⊗2i 1S̃(X, β) = S(X, β) · ,
n 1 + max Y
i=1 i i
(Xi,X , β) 1 + maxi Yi(Xi,X , β)
implying
S(X, β) ≤ S̃(X, β)(1 + maxYi(Xi,X , β)),
i
where the notation that A ≥ B for matrix A and B means that A − B is positive
definite. We know that .
‖t‖υtrS(X, β)υ ≤ ‖t‖υtrS̃(X, β)υ(1 + maxYi(Xi,X , β)). (6.42)
i
136
Let Z∗n(X, β) = maxi ‖mn(Xi,X , β)‖ in (4.7), so
‖t‖υtrS̃(X, β)υ(1 + maxYi(Xi,X , β)) ≤ ‖t‖υtrS̃(X, β)υ(1 + ‖t‖Z∗n(X, β)) (6.43)
i
Now, by (6.41), ‖t‖υtrS̃(X, β)υ on the right hand side of (6.43) can be substituted
by υtrm̄n(X, β), hence
‖t‖υtrS̃(X, β)υ(1 + maxYi(Xi,X , β)) ≤ υtrm̄n(X, β)(1 + ‖t‖Z∗n(X, β)), (6.44)
i
where ∑n1
m̄n(X, β) = mn(Xi,X , β). (6.45)
n
i=1
Combining (6.42) and (6.44), we attain
‖t‖υtrS(X, β)υ ≤ υtrm̄n(X, β)(1 + ‖t‖Z∗n(X, β)), (6.46)
which result in
‖t‖(υtrS(X, β)υ − Z∗n(X, β)υtrm̄n(X, β)) ≤ υtrm̄n(X, β). (6.47)
Note that we assume t = ‖t‖υ and υ is a unit vector. By Proposition 4.2.1 and
Lemma 4.2.2, we know that
Z∗(X, β)υtrm̄ (X, β) = o(n1/2)O(nδ−1/2) = o(nδn n ). (6.48)
Plug (6.48) into (6.47), yielding
‖t‖(υtrS(X, β)υ + o(nδ)) ≤ O(nδ−1/2) (6.49)
Since υtrSυ is bounded by the minimum and maximum eigen value of V ar(mn(Xi,X , β)),
(6.49) indicates that
‖t‖ = O(n−1/2). (6.50)
137
From (6.38) we can see that
∑n ( )1 2
0 = mn(Xi,X , β) 1−
Yi(Xi,X , β)
Yi(Xi,X , β) +
n ∑ 1 + Yi(Xi,X , β)i=1 n
− 1 mn(Xi,X , β)= m̄n(X, β) S(X, β)t+ Yi(X 2i,X , β) . (6.51)
n 1 + Y
i=1 i
(Xi,X , β)
Now, let us discuss the norm of the last term in (6.51). Recall that Yi(Xi,X , β) =
ttr∥ mn(Xi, ,β), therefore∥∥∥ ∥ ∥ ∥n n∥ 1 ∑ mn(Xi,X , β) ∥ ∥ ∑2∥ ∥ 1 mn(Xi,X , β) ( ) ∥Y tr 2∥i(Xi,X , β) ∥ = ∥ t mn(Xi, ,β) ∥n 1 + Yi(Xi,X , β) ∥ ∥n∑ 1 + Yi(Xi,X , β) ∥i=1 i=1n
≤ 1 ‖ 1mn(Xi,X , β0)‖3‖t‖2 .
n 1 + Yi(Xi,X , β)i=1
(6.52)
Plug Z∗n(X, β) = maximn(Xi,X , β) into (6.52), yielding
1 ∑n ‖ 1mn(Xi,X , β )‖30 ‖t‖2 (6.53)
n 1 + Yi(Xi,X∑, β)i=1 n
≤ Z∗ 1 1n(X, β) · ‖mn(X 2i,X , β)‖ · ‖t‖2 · .n 1 + Yi(Xi,X , β)i=1
Let δ be a number between 0 and 1/2, by Proposition 4.2.1, Lemma 4.2.2 and (6.50),
we know
1 ∑
that
n
‖m (X ,X, β )‖3‖t‖2 1n i 0 ≤ o(n1/2)O(nδ)O(n−1)
n 1 + Y (X ,X, β)
i=1 i i (6.54)
= o(n−(1/2−δ))
Equations (6.52)-(6.54) implies that the norm of the last term in (6.51) is of order
n−(1/2−δ), therefore from (6.51), we know that
t = S−1(X, β)m̄n(X, β) + γ, where γ = o(n
−1/3). (6.55)
138
Let l(β) be the negative logarithm of the profile empirical likelihood function
∑n { }
l(β) = ln 1 + ttr(β)mn(Xi,X , β)
∑i=1n n
tr − 1
∑[ ]2
= t (β)mn(Xi,X , β) t
tr(β)mn(Xi,X , β) + o(n
1/3) a.s.
2
i=1 i=1
(6.56)
By[(6.55), we can rewr]ite [the right hand side of (6.56) as
n 1 ∑ tr ∑ ] [ ]n n −1 n1 1 ∑
mn(Xi,X , β) mn(Xi,X , β)
⊗2 mn(Xi,X , β) +o(n
1/3)
2 n n n
i=1 i=1 i=1
(6.57)
Ta[ke Taylor expansion of (6.57) around the true param]ete[r value β0, attaining∑n ∑n tr ]n −1n 1 1 ∂mn(Xi,X , β ∑0)
m −1/3
1 ⊗2
[ n(Xi,X , β0) + un mn(Xi,X , β)2 n n ∂β ni=1 i=1 ] i=1
n n
× 1
∑ 1 ∑ ∂mn(Xi,X , β0)
m (X ,X, β ) + un−1/3 + o(n1/3n i 0 ) (6.58)
n n ∂β
i=1 i=1
Combining (6.56)-(6.58), by Lemma 4.2.2, we know that
[ ( ) ]tr
n −1/2+δ ∂m(X, Y, β0)l(β) = O(n ) + E un−1/3
2 ∂β
× [Σ−11 ( ) ]
× −1/2+δ ∂m(X, Y, β0)O(n ) + E un−1/3 + o(n1/3)
∂β
≥ (c− ε)n1/3 a.s.,
where 0 < δ < 1{/6, and [ ] }
{ ⊗22 − E C(Y2)k(Y1, Y2, β0)|Y1}Σ1 = E Q(Y1, β0) C(Y1) . (6.59)
E {k(Y1, Y2, β0)|Y1}
139
with Y1, Y2[being i.i.d. replica o]f X[1 by Proposition 5.2.1.]Sim[ilarly,∑ ]n tr ∑n −1 nn 1 1 1 ∑
l(β0) = mn(Xi,X , β0) mn(Xi,X , β0)
⊗2 mn(Xi,X , β0) a.s.
2 n n n
i=1 i=1 i=1
= O(nδ) a.s.
{ }
Since l(β) is continuous for β ∈ β|‖β − β ‖ ≤ n−1/30 , there exists a minimum for
l(β) in the interior of the ball, and β̂n satisfies
∣∣∣∣ ∑n ∣∂l(β) (∂ttr(β)/∂β)mn(Xi,X , β) + ttr(β) (∂mn(Xi,X , β)/∂β) ∣= ∣∂β 1 + ttrβ=β̂ ∑ ((β)mn(Xi,X , β)
∣
n i=1
n )tr ∣ β=β̂n
1 ∂mn(Xi,X , β) ∣
= t(β)∣ = 0
1 + ttr(β)mn(Xi,X , β) ∂β ∣i=1 β=β̂n

Proof of Theorem 6.2.2 Let us take the derivative of Q1n and Q2n in Lemma
6.2.1 with respect to β and t and get
n n
∂Q1n(β, 0) 1 ∑ ∂m ∑n(Xi,X , β) ∂Q1n(β, 0) 1
= , = m (X ,X, β)⊗2n i ,
∂β n ∂β ∂ttr n
i=1 (i=1n )tr
∂Q2n(β, 0) ∂Q (β, 0) 1 ∑2n ∂mn(Xi,X , β)
= 0, = .
∂β ∂ttr n ∂β
i=1
Then take the Taylor expansion of Q1n(β̂n, t̂n) and Q2n(β̂n, t̂n) at (β0, 0),
0 = Q1n(β̂n, t̂n)
∂Q1n(β0, 0) − ∂Q1n(β0, 0)= Q1n(β0, 0) + (β̂n β0) + (t̂n − 0) + oP (δn),
∂β ∂ttr
0 = Q2n(β̂n, t̂n)
∂Q2n(β0, 0) ∂Q2n(β0, 0)
= Q2n(β0, 0) + (β̂n − β0) + (t̂n − 0) + oP (δn),
∂β ∂ttr
140
where δn = ‖β̂n −β0‖+ ‖t̂n‖.   t̂n  −Q (β , 0) + o−  1n 0 P (δn) = Ŝ 1n  ,
β̂n − β0 oP (δn)
where  
     { } ∂Q1n ∂Q1n − ∂mn E {m m
tr
n n } E 
 ∂ttr ∂β Ŝ =  −P→  Ŝ11 Ŝ12  =  { } ∂β n ,∂Q2n
0 Ŝ21 0
 ∂mn 
E 0,
∂ttr (β=β0,t=0) ∂β
(6.60)
where in (6.60), mn = mn(Xi,X , β0).
n
1 ∑
Q1n(β0, 0) = mn(Xi,X , β0) = OP (n
−1/2)
n
i=1
implies that δn = O (n
−1/2
P ). Therefore
√ √
n(β̂n − β0) = Ŝ−122.1Ŝ Ŝ−121 11 nQ1n(β0, 0) + oP (1),
where { ( )tr ( )}
∂mn tr −1 ∂mnS22.1 = E (Emnm ) E .
∂β n ∂β
√
Furthermore, by Lemma 6.2.1, nQ1n(β0, 0)→ N(0, Σ̃), hence
√ { }tr
n(β̂ − β )→ N(0, V ), where V = Ŝ−1 Ŝ Ŝ−1Σ̃ Ŝ−1n 0 22.1 21 11 22.1Ŝ21Ŝ−111 . (6.61)

141
Chapter 7: Computational Results
This chapter is devoted to showing some simulation results under the right cen-
sored Accelerated Lifetime Model (AFT) using Extended Estimating Equation (EEE)
and Profile Empirical Likelihood (pEL) with R (3.4.1). Let T be the lifetime, and
consider the model
Y = βtr0 Z + ε, (7.1)
where Y = ln(T ), β0 ∈ Rp is the structural parameter, Z is the p-dimensional
covariate and ε is the error term centered at zero with unknown hazard rate function.
We assume that T may be subject to right censoring C with unknown distribution
function, therefore the data we actually observe are triplets (V,∆, Z), where
V = min(ln(T ), ln(C)), and ∆ = I {T ≤ C} . (7.2)
In previous chapters, we proposed two ways to construct an estimator for β0, which
were β̃n via solving Extended Estimating Equation (EEE), and β̂n via minimizing
the negative logarithm of the pEL, i.e.:
∑n
EEE: β̃n such that Sn(β̃) = mn(Xi,X , β̃n) = 0; (7.3)
i=1 ∑n ( )
pEL: β̂n = arg min l(β), where l(β) = ln 1 + t
tr(β)mn(Xi,X , β) (7.4)
β
i=1
142
where t(β) = t(β,X ) is the solution to the gradient of the Lagrangian function
defined in (5.50), and under
∑ {
model∑assumption (7.1), }n n tr tr
− ∑j=1 ZjI {Vj − β Zj ≥ Vi − β Zi}mn(Xi,X , β) = ∆i Zi n . (7.5)tr tr
i=1 j=1
I {Vj − β Zj ≥ Vi − β Zi}
The purpose of this chapter is to provide algorithms to numerically compute β̃n
and β̂n. Since both Sn(β) and l(β) involve indicator functions, usual root-finding
and optimization methods may not be directly applied here. Then we compare
(7.3) and (7.4) with respect to the asymptotic behavior of β̃n and β̂n under different
parameter settings, the time-efficiency of the two methods, and the local continuity
of Sn(β) and l(β). We also compare the empirical variance-covarince matrices of β̃n
and β̂n with the corresponding theoretical ones, and with each other.
7.1 Description of the Algorithm Associated with the EEE
In this section, we outline the algorithms for computing β̂n and β̃n. To summarize,
we compute β̂n in two steps, namely, first to construct a function t(β) that expresses
the Lagrange multiplier in terms of β, and second to calculate β̂n by maximizing
the pEL. We compute β̃n by minimizing the Euclidean norm of Sn(β) because, due
to the discontinuity issue of Sn(β), a root may not exist.
143
7.1.1 Algorithm for Computing the pEL Estimator
Given the AFT model described in (7.1), for fixed β, to find the probability vector
that maximizes the empirical likelihood, we solve the following problem,
 ∏arg maxp pi, where p = (p1, . . . , pn), ∑ ∑ (7.6)n nsubject to pi = 1, pi ∈ (0, 1), pimn(Xi,X , β) = 0.
i=1 i=1
We have proved in Chapter 5 that with probability approaching 1, there exists a
neighbourhood of β0 in which (7.6) has a unique solution given by
1 1
p̂i = · , (7.7)
n 1 + ttrmn(Xi,X , β∑)n mn(Xi,X , β)
where t is the solution to = 0. (7.8)
1 + ttrmn(Xi,X , β)i=1
Therefore the algorithm to compute β̂n is divided into two steps. First, write t as a
function of β and X according to (7.8); then after combining (7.1) and (7.7), β̂n is
given by ∑n ( )
β̂n = arg min ln 1 + t(β)
trmn(Xi,X , β) (7.9)
β
i=1
In the following paragraph, we explain how these two steps are performed using
software R (3.4.1).
Step1: Construct t(β) To construct t(β) we convert the root solving problem of
equation (7.8) into an optimization problem. For any fixed β, let
∑n
t∗ = arg max f(t), where f(t) = ln(1 + ttrmn(Xi,X , β)) (7.10)
t
i=1
144
Since we seek the calculus maximum of (7.10), the solution t∗ must satisfy
∑n
∇ mn(Xi,X , β)tf(t∗) = 0, where ∇tf(t) = (7.11)
1 + ttrmn(Xi,X , β)i=1
In other words, for any fixed β, we can construct t(β) via maximizing in (7.10) using
the nlm function built in R. For any fixed β and X , it is easy to verify that ∇⊗2t f(t)
is always negative definite, therefore the solution of (7.11) is guaranteed to be a
calculus maximum, if it exists. Since tinitial = (0, . . . , 0) always provides a legitimate
probability vector (7.7), we use tinitial as the initial value for the nlm function in this
step.
We also need to pay attention to the domain of f(t), denoted by Df . For the
univariate case, it is easy to show that
( )
D = −(max(m (X ,X, β)))−1f n i ,−(min(mn(Xi,X , β)))−1 .
i i
However, for the vector valued β, the analytical form of Df is no longer simple.
In cases when 1 + ttrmn(Xi,X , β) has negative components, f(t) is no longer well-
defined. Therefore we replace ln(·) by a monotone function h(·) defined everywhere
but extremely negative at feasible values t. This penalty function should guarantee
that the interactive root-finding method like Newton-Raphson converges. Condi-
tions on h(z) include,
1. h(z) = ln(z) for any z > ε, where ε is a positive constant that is close to 0;
2. For any z −1 ∈ R ∪ {0} and z2 ∈ R+, h(z2) > h(z1). This way, replacing
the function ln(·) by the penalty function will not change the solution to the
maximization problem (7.10);
145
3. h(z) is continuous and differentiable for z ∈ R;
4. When seeking a root with Newton-Ralphson, if in the kth step zk < 0, then
h′(zk) should point to the direction such that zk+1 > 0. In other words, if
z < 0, then h′(z) should always guarantee z − h(z)/h′(z) > 0.
A feasible choice of such a function is
ln(z) when z > ε
h(z) =  (7.12)z − εln(ε) + when z ≤ ε
ε2
Using h(z), for any fixed β, we define t = t(β) as
∑n
t(β) ≡ arg max f trh(t), where fh(β) = h(1 + t mn(Xi,X , β)). (7.13)
t
i=1
Step 2: Estimate β0 After writing the Lagrange multiplier t as a function of
β, we can compute β̂n by minimizing the negative logarithm of the pEL, namely,
β̂n = arg minβ l(β), where
∑n ( )
l(β) = ln 1 + ttr(β)mn(Xi,X , β) (7.14)
i=1
The quasi-Newton-Raphson methods do not work for (7.14) under the AFT model
assumption due to the appearance of indicator functions in mn(Xi,X , β). Therefore
we use the default Nelder-Mead method in the optim function of R, which is a
topological method that can be applied to nonlinear optimization problems for which
derivatives may not exist. We use the least- square estimator on the uncensored data
146
denoted by b0 as the initial value, where
Uncensored Index Set: R = {i : ∆i = 1}, where #R = u
Vector of Uncensored Lifetime: O = (Vk; k ∈ R) ∈ Ru; (7.15)
Matrix of Covariates with Uncensored Observations: H = (Z : k ∈ R) ∈ Rp×uk
then b = (H trH)−1H tr0 O and β̂n is computed by optim(b0,l)$par. This choice of
b0 is good for uncensored AFT, but in the right-censored case, a more reasonable
initial choice is the estimator of Koul, Sursarla and van Ryzin in [25].
Restarting Improvement Ideally, with a reasonable initial value, Nelder-Mead
method should give the solution to the minimization problem with a signle appli-
cation of optim. However, the optimization procedure with respect to β turns out
to require multiple restarts. More specifically, let b0 be the least square estimator
calculated on set S, and
br+1 = arg min l(β) with initial value br, r = 0, 1, . . . (7.16)
β
Let ‖ · ‖1 be the L1 norm on vectors. The sequence of restarts stops when ‖br −
br+1‖1 ≤ 0.01. Ideally, the sequence (7.16) should stop at r = 0. However, we
found that this ideal case only happens when the initial value is good enough, which
occurs especially when the data are uncensored, or using the Koul-Susarla-van Ryzin
estimator in [25] as b0 for right censored data. On the contrary, when the simple least
square estimator is far away from β0, then the restarting improvement is necessary.
147
For example, when
Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1), Z3 ∼ ln(F3,5),
(7.17)
Z4 ∼ ln(Beta(5, 3)), ε = Weibull(1)− 1, C = Exp(5.8);
for sample size n = 200 and batches of simulations of size m = 1000, the number of
iterations r are 1, 2 and 3, with frequencies 780, 209 and 11.
7.1.2 Algorithm for Computing the Z-Estimator
As we discussed in previous chapters, we can construct an estimator for β0 by
solving the extended estimating equation Sn(β) = 0. Due to the discontinuity of
Sn(β), instead of directly solving the equation, we define
β̃n ≡ arg min ‖Sn(β)‖, (7.18)
β
where ‖·‖ denotes the Euclidean norm. Again, we minimize using optim with initial
value b0, i.e., the least square estimator on set R.
As for the computation of β̂n, the Nelder-Mead method is not guaranteed to give
the local solution to the minimization problem (7.18) in one step. We use the same
restarting strategy described on page 146. In the following section, we will show
that the negative profile log-likelihood function l(β) is much smoother than Sn(β),
therefore we should expect to require more restarts in the calculation of β̃n than β̂n.
This conjecture will be verified numerically in Section 7.4
148
7.2 Local Continuity of EEE and pEL
In this section, we compare the local continuity of the EEE Sn(β) with the negative
logarithm of the pEL function l(β). From the plots of l(β) and Sn(β), we will see
clearly that the pEL is much smoother than the EEE. At the end of this section, we
define a quantity tot.dif that measures the continuity pattern in the neighborhood
of β̂n, βLS, and β0. From the histograms tot.dif, we will see that continuity behavior
in the neighborhoods of β̂n and β0 are very similar.
7.2.1 Plots of Projection of EE and EL
Since both Sn(β) and l(β) are defined in Rp, it is not easy to provide a plot
directly. Therefore we plot only one “slice” at a time, namely, project Sn(β) and
l(β) with respect to βi, where in general, for v ∈ Rp, vi is the ith component of v.
More specifically, for EEE
∑n
Sn(β) = mn(Xi,X , β), β ∈ Uβ0 , (7.19)
i=1
consider function dS(s, i, j)
dS(s; β, i, j) = Sn(β + sej) · ei, (7.20)
where β ∈ Uβ0 , s ∈ (−δ, δ) with δ being a small positive number, i, j = 1, . . . , p, and
ei is the ith column of the p× p identity matrix. The function dS(s; β, i, j) allows β
to change only in the direction of ej, then records the value of of the ith component
in Sn(β + s · ej) as output.
149
As for the negative log empirical likelihood function, consider
dl(s; β, u, i) = l(β + s · ui), i = 1, . . . , p, (7.21)
where β ∈ Uβ0 , and s ∈ (−δ, δ) for a small positive number δ, and u is a randomly
generated unit direction. In the simulation, we use rnorm(p) to generate u, then
divide it by its Euclidean norm.
We run the simulation under two parameter settings as follows, for β0 = (4, 3, 2, 1)
with normally distributed covariates in Set 1, and none-normally distributed covari-
ates in Set 2,
Set 1: Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1), Z3 ∼ N(3, 25),
(7.22)
Z4 ∼ N(5, 9), ε ∼Weibull(1)− 1, C ∼ Exp(0.015)
and
Set 2: Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1), Z3 ∼ F3,5
(7.23)
Z4 ∼ Beta(5, 3), ε ∼Weibull(1)− 1, C ∼ Exp(.08)
With n = 1000 for m = 1 replica. For this Set 1 and Set 2, censoring rates
are given by 18.6% and 33.2%. β̂n and β̃n in both sets are identical to the second
decimal place, and are (4.01, 3.00, 1.99, 0.99) and (3.86, 3.09, 2.05, 0.80), respectively.
We plot dS for both Set 1 and Set 2. As we can see from Figure 7.1 and 7.2, even for
a large sample size n = 1000, within a small neighbourhood of true β0, the extended
estimating equation has a lot of jumps.
150
−0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02
x x x x
−0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02
x x x x
−0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02
x x x x
−0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02
x x s x x
Figure 7.1: Plot of dS(s; β0, i, j), normal covariates
Figure 7.1 shows a plot of d (s; β , i, j) = S (β + se )trS 0 n 0 j ei for s ∈ (−0.02, 0.02),
where i, j = 1, . . . , p index horizontal and vertical plots. Horizontal lines indicate
level 0. It implies that the function is only linear and smooth on the diagonal. The
magnitude of the y-coordinates indicates Sn(β) is close to a diagonal matrix times
β.
151
y y snb y y
2.18 2.22 2.26 7.92 7.96 −0.005 0.005 0.015 −0.20 −0.10 0.00
y y y y
2.10 2.20 2.30 7.90 7.95 8.00 8.05 −0.6 −0.2 0.2 0.6 −0.100 −0.090 −0.080
y y y y
2.15 2.20 2.25 2.30 −5 0 5 10 20 −0.05 0.05 −0.09 −0.07 −0.05
y y y y
−2 0 2 4 6 8 7.70 7.80 7.90 8.00 −0.05 0.05 0.10 −0.100 −0.085 −0.070
−0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4
x x x x
−0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4
x x x x
−0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4
x x x x
−0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4
x x s x x
Figure 7.2: Plot of dS(s; β0, i, j), non-normal covariates
Figure 7.2 shows a plot of dS(s; β0, i, j) = Sn(β
tr
0 + sej) ei for s ∈ (−0.4, 0.4), where
i, j = 1, . . . , p index horizontal and vertical plots. Horizontal lines indicate level
0. It implies that the function is only linear and smooth on the diagonal. The
magnitude of the y-coordinates indicates Sn(β) is close to a diagonal matrix times
β.
152
y y snb y y
0.085 0.100 0.115 −1.62 −1.56 −1.50 −2.10 −2.00 −1.90 −1 0 1 2
y y y y
0.07 0.09 0.11 0.13 −1.8 −1.6 −1.4 −5 0 5 0.60 0.70 0.80
y y y y
0.06 0.10 0.14 −15 −5 0 5 −2.2 −1.8 −1.4 0.50 0.60 0.70 0.80
y y y y
−0.1 0.1 0.2 0.3 −1.60 −1.56 −1.52 −2.04 −2.00 −1.96 0.73 0.75 0.77
Then we check the smoothness of the negative profile log likelihood l(β) for β is
in a neighbourhood of β̂n by plotting dl(s; βn, u, i). Figure 7.3 and 7.4 are calculated
under Set 1 (normal covariates) and Set 2 (non-normal covariates), respectively. The
solid vertical line in each picture denotes β̂n, and the dotted vertical line denotes the
true β0. As we can see from Figure 7.3 and 7.4, the distance between the estimated
value and the true parameter value are small, the profile likelihood function is very
smooth within the neighborhood of β̂n, and β̂n is the calculus maximum of l(β) in
each randomly generated directions for both normal and non-normal covariates.
153
−0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02
x x x x
−0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02
x x x x
−0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02
x x x x
−0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02
x x s x x
Figure 7.3: Plot of dl(s; β̂n, u, i), normal covariates
Figure 7.3 shows a plot of dl(s; β̂
i
n, u, i) = l(β̂n + su ) for s ∈ (−0.02, 0.02), where
i, j = 1, . . . , p index horizontal and vertical plots. The solid vertical line in each
picture denotes β̂n, and the dotted vertical line denotes the true β0. The picture
indicates on any random direction, l(β) is smooth and has a parabolic form, and β̂n
is the calculus minimizer of l(β).
154
y y l(b) y y
0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 0.0 1.0 2.0 3.0 0.00 0.02 0.04 0.06
y y y y
0 1 2 3 0.00 0.10 0.20 0.00000 0.00015 0.0 0.5 1.0 1.5
y y y y
0.0 0.2 0.4 0.6 0.0 0.5 1.0 1.5 2.0 0.00 0.02 0.04 0.06 0 1 2 3 4
y y y y
0.0 1.0 2.0 0 1 2 3 4 0 1 2 3 4 0.0 0.5 1.0 1.5
−0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4
x x x x
−0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4
x x x x
−0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4
x x x x
−0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4
x x s x x
Figure 7.4: Plot of dl(s; β̂n, u, i), non-normal covariates
Figure 7.4 shows a plot of dl(s; β̂n, u, i) = l(β̂n + su
i) for s ∈ (−0.4, 0.4), where
i, j = 1, . . . , p index horizontal and vertical plots. The solid vertical line in each
picture denotes β̂n, and the dotted vertical line denotes the true β0. The picture
indicates on any random direction, l(β) is smooth and has a parabolic form, and β̂n
is the calculus minimizer of l(β).
155
y y l(b) y y
0.0 0.4 0.8 0 10 20 30 40 50 0.0 1.0 2.0 3.0 0 10 30 50
y y y y
0 10 20 30 40 0 5 10 15 20 0 5 10 15 20 0 2 4 6 8 10
y y y y
0 10 30 50 0 5 10 15 20 0 20 40 60 80 0 5 10 15
y y y y
0.0 1.0 2.0 3.0 0 2 4 6 8 10 0.00 0.02 0.04 0.06 0 5 10 15
7.2.2 Quantitative Measurement of Local Continuity of l(β)
In fact, the continuity of the negative profile log-likelihood function l(β), the local
continuity around β0 can be quantified using the total difference of the approximated
derivative for function dl(s; β, i) in the following steps,
1. For fixed, β ∈ Uβ0 , s ∈ (−δ, δ) with δ being a small positive number,
dl(s; β, i) = l(β + s · ei), , i = 1, . . . , p (7.24)
where ei is the ith row of p× p identity matrix.
2. For i = 1, . . . , p, approximate the derivative of dl(s) using
1
der(s; β, i) = (dl(s+ ε1; β, i)− dl(s− ε1; β, i)). (7.25)
2ε1
3. Generate a grid of points over interval (−a, a) with grid length ε0. Calculate
der(s; β, i) at each grid point. Record the output in a p× l matrix out, where
l is the number of grid points. outi,j approximates the derivative of l(β) in the
direction of βi at the jth grid point, where i = 1, . . . p and j = 1, . . . , l.
4. For each row of the matrix out Step 3, calculate the absolute value of the
difference between the consecutive components,
abs.di,j = |outi,j − outi,j+1|, i = 1, . . . , p, j = 1, . . . , l − 1, (7.26)
then record the row sum these differences, i.e.,
∑l−1
tot.di = abs.di,j, where i = 1, . . . , p. (7.27)
j=1
156
Remark 7.2.1 We do not have a universal criterion to choose a, ε0 and ε1 that
is guaranteed to work for every parametric setting. Since we are only interested in
the local continuity around a fixed β, ε0 should not be too large. The choice of ε1
depends on ε0. In general, the magnitude of ε1/ε0 should not be too large, in which
case, the total difference cannot reflect the subtle differences for different choices of
β; on the other hand, if the ratio is too small, then due to the appearance of the
indicator functions, the total difference would be large for any choice of β.
The quantity tot.di should be much smaller for smooth functions than non-smooth
functions. We also expect the total difference around β̂n,m to be similar to the
total difference of around β0, where m denotes the number of simulations. To test
these conjecture, we use the same simulated data sets in (7.22) in Section 7.1. The
parameters for grids and difference quotients are a = .5, ε0 = .01, ε1 = .001. We
ran the simulation for m = 1000 times with sample size 200. The censoring rate
ranges from 22% to 42% and the simulation means are (3.954,2.970,1.919,0.974).
157
500 1500 2500 0 4000 8000 0 4000 8000 0 50000 150000
s s s s
500 1500 2500 1000 3000 5000 1000 3000 5000 0e+00 4e+04 8e+04
s s s s
200 600 1000 1400 1000 3000 5000 1000 3000 5000 0 40000 80000
s s beta s s
Figure 7.5: Plot of tot.diff , normal covariates
This picture shows the histogram of total differences defined in (7.27), where the
first row is for β0, second for β̂n, and third for βLS. The shapes and ranges of
the histogram indicates that the β̂n has a similar continuity pattern compared with
β0, and such pattern cannot be preserved when β is far from β0 evidenced by the
histogram of β̂LS.
158
Frequency Fretoqtu.deinf(cbyeta) Frequency
0 50 100 150 200 250 0 50 150 250 350 0 50 150 250 350
Frequency Frequency Frequency
0 50 150 250 350 0 100 200 300 0 100 300 500
Frequency Frequency Frequency
0 50 150 250 350 0 50 150 250 350 0 100 300 500
Frequency Frequency Frequency
0 100 200 300 400 0 100 200 300 400 0 100 200 300 400
7.3 Convergence of the Estimator
∑n
Since t∗ = t(β) is the calculus maxima of f(t) = h(1 + ttrmn(Xi,X , β)) for
i=1
each fixed β, the gradient ∇tf(t) at t∗ must be zero, i.e.,
∑n
∗ mn(Xi,X , β)g(t , β) = ∗ , where t
∗ = t(β). (7.28)
1 + t trm
i=1 n
(Xi,X , β)
Therefore at β̂n, the gradient g(t(β̂n), β̂n) should be close to zero regardless of
parametric setting or sample size. To test this claim, we did experiments under
β0 = (1, 2, 3, 4)
Set 3: Z1 = Bernoulli(0.5), Z2 = ln (Γ(3)) , Z3 = ln (F3,5) ,
(7.29)
Z4 = ln (Beta(5, 3)) , ε ∼Weibull(1)− 1; C ∼ Exp(0.5)
sample sizes n = 30, 50 and 100 for m = 100 simulations. We found the gradient is
always bounded by a very small number to the magnitude of 1e− 5.
7.4 Asymptotic Normality of the Z-Estimator and pEL Estimator
In this section, we compare β̃n,m, the solution to the EEE, and β̂n,m the value
that minimizes the negative logarithm of the pEL. Under three different parametric
settings, we compare the center of β̃n,m and β̂n,m as well as the empirical variances.
We seek proper sample sizes n1 and n2 such that β̃n1,m and β̂n2,m reaches asymptotic
normality. We also compare the computational efficiency in terms of time lapsed
and iterations that is needed for restarting improvement described on page 147.
Throughout this section, m = 1000 if not otherwise specified.
159
7.4.1 Covariates without Normality
Let us start with a parametric setting with covariates that are not normally
distributed. Let β0 = (0.1, 0.2, 0.3, 0.4), and consider Set 3 described in (7.29),
Set 3: Z1 = Bernoulli(0.5), Z2 = ln (Γ(3)) , Z3 = ln (F3,5) ,
(7.29)
Z4 = ln (Beta(5, 3)) , ε ∼Weibull(1)− 1; C ∼ Exp(0.5)
Normality We found that the smallest sample size that is needed to attain nor-
mality is affected by the symmetricity of covariates. Checking the histograms, Z1
and Z3 are appears much more symmetric, the sample size that for β̃
1
n, β̃
3 1
n, β̂n and
β̂3n to reach normality is significantly smaller than that for β̃
2
n and β̃
4, and β̂2n n and
β̂4n. With n = 100, censoring rate ranging from 16% to 43% and mean equal to
29.3%. Based on the Shapiro test, β̃in,m and β̂
i
n,m reaches normality only when i = 1
and i = 3.
When the sample size n = 200, both β̃in,m and β̂
i
n,m pass the Shapiro test for
i = 1, . . . , 4. However, if we use the Mardia’s Test for multivariate normality, then
the sample size needs to be increased to 450. A histograms for β̃i450,1000 and scattered
plot of β̃i i450,m against β̃450,m can be found in Figure 7.6
160
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
0.20 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8
● ● ●
●●●
● ●●●
●
●●●
● ●●●
●●●
●●●●
●●● ●●●●
●●●●
●●●●
●●●
●
●●●●
●●●●
●●● ●●●
●●● ●●●●
●●
●● ●
●●●●
●●● ●●●
● ●●●
●
●●●●
●● ●●●●●
●●●
● ●
●●●
●●●
● ●
●●●
●●
●● ●●●
●●●
●●●●● ●●●●
●●
● ●●
●●● ●●●
●●●
●● ●
●● ●●●
●●●
● ●●
●●●
●●
●
●●
●
● ●
● ● ●●
−0.2 −0.1 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4
● ●
● ●●●●●●
s[1:m] ●●● ●●●●
●●●
● s[1:m] ●●
●● ●●●●
●●
●● ●●
●●●
●●
●●●●
●●
●
●●●●
● ●●●●
●●● ●●●●
●●●
● ●
●●●●
●
●●●●
●●●
●●●●
●●
●●●
●●●
●●●
●
●●●
●
●●● ●●
●●●●
●●●● ●●●
●●
●
●●●●
●●● ●
●●●
●●
● ●
●●●
●● ●●
●● ●
●●
●●●● ●●●
●●
● ●
●●●
●● ●●
●●● ●●
●
●●
●
●●
● ●
0.20 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8
s[1:m] beta s[1:m]
Figure 7.6: Histogram and QQ Plots for None-Normal Covariate Covariates
161
s[(m + 1):(2 * m)] s[(m + 1):(2 * m)]  Density Density
0.20 0.30 0.40 −0.2 0.0 0.2 0 2 4 6 8 12 0 1 2 3 4 5
s[(m + 1):(2 * m)] s[(m + 1):(2 * m)] Density Density
0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.0 1.0 2.0 3.0 0 1 2 3 4 5 6
Accuracy and Asymptotic Variance Covariance The first thing that we com-
pared was whether the two estimators β̂n,m and β̃n,m differs a lot from replica to
replica. In our simulation, we measure this difference using L1 vector norm of
β̃n,m − β̂n,m, which should be very small theoretically for proper n. This is because
when constructing β̂n,m, we first got t = t(β) by
∑n
t∗ = arg max f(t), where f(t) = ln(1 + ttrmn(Xi,X , β)) (7.30)
t
i=1
for fixed β. Since t is a calculus maxima for f(t),
∑n
∇ mn(Xi,X , β)tf(t) = = 0 (7.31)
1 + t(β)trmn(X ,X, β)i=1 i
On the other hand, β̃n is the solution to
∑n
Sn(β) = mn(Xi,X , β) = 0 (7.32)
i=1
Comparing (7.31) and (7.32) we see that Sn(β) is exactly ∇f(t) when t = 0. On the
other hand, theoretically, the solution to (7.30) has a very small magnitude relative
to the order of ‖β − β0‖. Since for large n, ‖β̂n − β0‖ is small, therefore by the
continuity of ∇tf(t), β̃n and β̂n should also be very close as well. We can observe
this fact from the scattered plot of β̃in,m against β̂
i
n,m for i = 1, . . . , p. For n = 100,
the dots is distributed closely to the line y = x. As n grows to 150 and 200, the
linear pattern is even more clear.
As for accuracy, even for n = 100, the mean and median for both β̃n,m and β̂n,m are
very close to the true β0. We have also compared the empirical variance covariacne
matrix of the two estimators denoted by cov1 and cov2. The largest eigenvalues of
cov1 − cov2 for n = 100 is to the magnitude of n−3.
162
Restarting Improvement and Timing We found that β̃n,m is always faster to
compute than β̂n,m. This is because the latter one requires two steps of optimization,
yet the β̃n,m requires only one step. As for the iterations that is needed for the
restarting improvement, we found that β̃n,m always requires more rounds. This is
not surprising because the negative logarithm of the pEL function l(β) is much
smoother than the extended estimating equation function Sn(β). Consequently, it
takes more restartings for to find the minimum value of ‖Sn(β)‖.
7.4.2 Normally Distributed Covariates
We investigate similar aspects of β̃n,m and β̂n,m under β0 = (0.51, 0.15, 1.18, 0.92)
Set 4: Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1) Z3 ∼ N(0.3, 0.25);
(7.33)
Z4 ∼ N(0.1, 0.09), ε ∼Weibull(1)− 1, C ∼ Exp(0.5)
When all the covariates are normally distributed, the n that for β̃i in,m β̂n,m to gain
normality is much smaller, where i = 1, . . . , p. For n = 100 with censoring rate
ranging from 15% to 44% and centered at mean = 28%, all components of β̃n,m and
β̂n,m passed the Shapiro normality test. When n = 400, both estimators pass the
Mardia test for multivariate normality. A histograms for β̃i400,1000 and scattered plot
of β̃i400,m against β̃
i
400,m can be found in Figure 7.7
163
−0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
0.20 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8
● ● ●
●●● ●
●●
●
● ●●●
●●
● ●●
●●●●
●● ●●
●●● ●●
●●●
●
●●●●
●●● ●●
● ●●●●
●
●●●●
● ●
●●●
●
●●●●
●●● ●●●●●
●
●●●●
●● ●●●
●●● ●●
●●
●●●● ●
●●
●●● ●●●
●●●
● ●●
●●●
●● ●
● ●●●
●●
●●●
●●● ●●
● ●●●●
●●
●●● ●●●●
●●●
●●● ●●
● ●●
●
●● ●●●
●●
● ●●●●
●● ●
●
● ●●
−0.2 −0.1 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4
●● ●●●●
s[1:m] ●●●
● ●●
●●● s[1:m] ●
●●
●●●
●
●●●●
● ●●●●●
●●● ●●
●●
●
●●● ●●
●●
●●●● ●●●●
●●●●
●
●●●
●● ●●●●
●●
● ●●
●●●●
●● ●●●●
●
●
●●●● ●●●
●●
●●●●
● ●●
●●●●● ●
●●
● ●●●
●●●●
●●● ●●●●
●
●● ●●●
●●●
●
●●●
●
●●●●
●● ●●
●● ●●●
●●
● ●
●●●
●●
● ●●
●●
●● ●
●●
●
●●
● ●
0.20 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8
s[1:m] beta s[1:m]
Figure 7.7: Histogram and QQ Plots for Normal Covariate Covariates
164
s[(m + 1):(2 * m)] s[(m + 1):(2 * m)]  Density Density
0.20 0.30 0.40 −0.2 0.0 0.2 0 2 4 6 8 12 0 1 2 3 4 5
s[(m + 1):(2 * m)] s[(m + 1):(2 * m)] Density Density
0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.0 1.0 2.0 3.0 0 1 2 3 4 5 6
Similar to the none-normal setting, the L1 norm of β̃n,m − β̂n,m is always small
evidenced by the the scattered plot of β̃i in,m − β̂n,m for i = 1, . . . , 4, on which the
points are dist(ributed closely along t)he line y = x. The magnitude of the absolute
value of eigen cov(β̃n,m)− cov(β̂n,m) is also small. When n = 450, the eigenvalue
with the largest absolute value is 1.66e-05
Under Set 4, the the EL method is still more costly in terms of time. However, like
in Set 3, the number of iterations needed for restarting improvement is much smaller
for the EL method. When n = 400, 938 replicas are finished by only 1 restarting
and the rest 62 are finished by in 2 restarting iterations for β̂n,m; in comparison,
when calculating β̃n,m, only 834 replicas are finished within 1 restarting, and the
maximum number of iterations is 4.
7.4.3 Extreme Cases
In this section, we consider an “extreme” parameter setting, under which the
lifetime t is either very large or very small. This is rarely seen in practice, however,
we would like to compare the behavior of β̃n and β̂n out of theoretical interest.
Consider β0 = (0.4, 0.3, 0.2, 0.1),
Set 5: Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1), Z3 ∼ F3,5,
(7.34)
Z4 ∼ Beta(5, 3), ε ∼Weibull(1)− 1, C ∼ Exp(0.8)
Under parametric setting (7.34), the censoring rate ranges between 25.8% to 38.6%
with mean around 31.3%. From the histogram, we can see that T and V are severely
skewed to the right with extremely large outliers, which happens because the co-
165
variate Z3 is severely skew to the right. Therefore we expect the estimator for β
3
0 to
gain asymptotic normality with a much larger n.
For both β̃n,m and β̂n,m, a large sample size is needed in order to attain asymptotic
normality. Unfortunately, we was not able to fine a proper n such that β̃3 3n,m or β̂n,m
gains normality. For n = 1000, the p-values of Shapiro test on both β̃in,m and β̂
i
n,m
for i = 1, 2, 4 are above 0.6, yet 0.0004 for β̃3n,m and β̂
3
n,m.
In comparison between β̃n,m and β̂n,m they are both very closed evidenced by a
small L1 norm form = 1, . . . , 1000, and the absolute difference of variance covariance
matrices with the magnitude of eigenvalue 1e-5 when n = 1000, and 1e-4 when
n = 500.
166
Chapter 8: Contribution and Future Work
In this section, we summarize the major contributions of this thesis, and then
give an outline of what future work can be done related to the Extended Estimating
Equations (EEE), and how to make the extension even further so that the EEE can
be applied to a broader class of semiparametric models.
8.1 Original Contribution
The most important concept we developed in this thesis is the EEE. Motivated
by the lack of coverage of Classic Estimating Equation (CEE) in the regime of
right censored semiparametric models including the widely used Cox model and
Accelerated Failure Time (AFT) model, we see the necessity to extend the scope of
CEE so that it can also serve as a tool for those semiparametric setting.
Inspired by the construction of the partial likelihood equation for Cox model in [11]
and the linear rank equation for the AFT in [38], we establish the EEE by allowing
the criterion function not only depend on data and structural parameter, but also on
the nuisance parameter. Using the concept of Glivenko-Cantelli class and Donsker
class, we are able to prove lemmas parallel to the EL under EEE setting in [32]
167
and [33], which lead to the asymptotic normality of the corresponding Z-estimator,
i.e., the root to the EEE.
We also constructed the ϕ-transformation model, which to our knowledge, is the
broadest class of semiparametric models of which the structural estimator can be
defined as the root of EEE. The ϕ-transformation is a generalization of the General
Transformed Model (GMT), and can represent a series of semiparametric models
including linear model, AFT, location-scale model, and operational time model,
and etc. The GMT model has been well research in statistical literature such as
in [38] by Tsiatis. Similar to Tsiatis’s work, we construct a martingale structure
on the residual scale. This structure guarantees the zero mean assumption of EEE.
Then we prove the local asymptotic linearity of the associated EEE around true
structural parameter, which leads to the asymptotic normality of the structural
parameter.
We can also use EEE as a constraint in the Empirical Likelihood (EL) maximiza-
tion. We prove that with criterion functions that are continuous with respect to the
structural parameter, or a martingale structure as described in the ϕ-transformation
model and Cox model, there exists a non-shrinking neighbourhood of the true struc-
tural parameter such that for any fixed value in that neighbourhood, there exists
a unique probability vector that maximizes the EL with probability approaching 1.
Then a structural estimator can be calculated by maximizing the Profile Empirical
Likelihood (pEL). After establishing lemmas that are parallel to those in [32] and [33]
168
using empirical process theory, we show that the pEL estimator is asymptotically
normal. When the dimension of the EEE equals to that of the structural parameter,
the asymptotic variance covariance matrix of the pEL is identical to the sandwich-
formed variance covariance matrix of the Z-estimator via solving the corresponding
EEE.
From simulation with R, we are able to visually compare the local continuity of
EEE and the pEL, and we found the latter is much smoother than the former. We
also propose a concept of “total difference” to quantify the local continuity of the
pEL. Although we are not clear about the statistical behavior of this quantity, from
the simulation result, we can see that the pEL estimator preserves a similar pattern
of total difference to the true parameter value. Although strictly speaking, neither
EEE or pEL is continuous with respect to the structural parameter, the pEL does
appear to be more smooth. One benefit of this “continuity” is that it takes much
less iterations to perform the restart improvement procedure using the Nelder-Mead
optimization.
We also find that for a sufficiently large sample size, the Z-estimator and the pEL
estimator are very close. Namely, the L1 distance between the two is small, and the
magnitude of the eigenvalue of the difference between the variance covariance matrix
small as well. The smallest sample size that is needed for the pEL to reach asymp-
totic normality is affected by the skewness of the covariate. The more symmetric
the covariance is, the smaller the threshold sample size is.
169
8.2 Future Work
Under the CEE setting, one of the important benefits of EL is constructing the
Wilk’s type confidence region and develop the EL ratio test. Compared with the
Wald type statistics, the EL confidence region does not involve calculating an vari-
ance covariance matrix, therefore is more computationally efficient. Owen developed
the EL confidence region in [32] for the case when the dimension of the criterion
function r equals the dimension of the structural parameter p, and Qin and Lawless
generalized his conclusion to the case when r > p. Following the thread of Qin and
Lawless’s work, we were able to establish the asymptotic normality of the pEL esti-
mator associated with EEE, however, we did not establish the Wilk’s type theorem
for EEE. Since we have already proved the asymptotic normality of EEE evaluated
at the true structural parameter, with the same Taylor expansion technique applied
to the pEL as the proof of Theorem 2 in [33], we should be able to show the EL
ratio statistic follows a chi squared distribution with degree of freedom equal to the
dimension of the structural parameter, and a confidence region will follow after this
result.
A major advantage of using CEE as a constraint in EL method rather than directly
solving for an Z-estimator is that the former allows the dimension of the CEE r
exceeds that of the structural parameter p. In this thesis, we restrict our discussion
to the case when r = p except for the proof of the lemmas that are parallel to
Lemma 1 and Theorem 1 of [33]. It is appealing to allow r > p under the EEE
170
setting. Related additional proof could involve establishing a neighbourhood of
the true structural parameter on which there exists a unique probability vector
maximizing the EL and etc.
The efficiency is of the Z-estimator and the pEL estimator associated with EEE
is an untouched area in this thesis. Recall that the broadest class of semipara-
metric models associated with the EEE is the ϕ-transformation model, and the ϕ-
transformation model is generalized from the General Transformation Model (GMT).
In [38], Tsiatis proposesd a class of estimators using linear rank tests for the GMT,
and constructed the efficient estimators within this class together with conditions
when they are fully efficient. In Theorem 3 of [33], Qin and Lawless pointed out
the pEL with r > p is efficient in the sense of [40] and [5]. As for EEE, we are
interested in the following two questions. First, when r = p, are the Z-estimator
and pEL efficient in any sense? Second, in the case when r > p, the asymptotic
variance covariance matrices are no longer identical. Therefore it is appealing to
inspect whether the two matrices can become the same after further simplification;
if not, then how does the efficiency compare with one another.
Up to now, the broadest class of semiparametric models that can we construct
to serve as an example of EEE is the ϕ-transformation model. Therefore a natural
question is whether it is possible to extend EEE even further to cover more models.
One such model could be the Frailty Model first introduced by Vaupel in univariate
survival models in [43], and later applied to multivariate situation on familial ten-
171
dency in chronic disease incidence by Clayton in [9]. Frailty model introduces an
unobserved random effect in the exponential proportionality part of the Cox model,
i.e., { }
λ trT |Z̃(t),W {t|z̃(t), w} = λ0(t) exp lnW + β Z(t) , (8.1)
where lnW is an unobserved continuous random variable unique to the linear pre-
dictor of each observation that is independent of Z, Z = {Z(t) : t ≥ 0} is a p × 1
covariate that may be time dependent, λT |Z̃(t),W {t|z̃(t), w} is the hazard function of
T conditional on W and Z̃(t) = {Z(s), s ≤ t}, λ0(t) is an unspecified base hazard
function. Let f(w, γ) be the density of function of W with unkown parameter γ ∈ R.
Let
θ = (β, γ) ∈ Θ, where Θ = Uβ0 × R,
then (8.1) yields a class of semiparametric models with parameters
ψ = (θ, λ) ∈ Θ×H,
{ ∫ }
where H ≡ ∞λ(s) : λ(u) > 0 for all u ∈ R+; λ(u)du =∞ . We did some pre-
0
liminary calculation on the (8.1) assuming W follows a gamma distribution, and
we found that an estimator of β can be defined via maximizing the pEL, which
eventually breaks down to solving the an equation that involves quotients among
three higher order summations. So the question is whether we could make a gen-
eral definition out of the frailty model that is an extended EEE. On the one hand,
this extended EEE should cover cases including like frailty model, ϕ-transformation
model and Cox model; on the other hand, there should be examples beyond those
three semiparametric models that can be investigated using the extended EEE.
172
Appendix A: Empirical Process Theory
In this appendix, we first show the that under suitable conditions, if F is a Donsker
class with finite bracketing integral, then it is also a Glivenko-Cantelli class. Then
we prove that under Pθ with θ = (β, µ),
1 ∑n { }tr P tr
ZjI(V > t)e
β Zj
j −→ E Z1I(V > t)eβ Z11 , (A.1)
n
j=1
where {(Zj, Vj)}∞j=1 are independently identically distributed (i.i.d.) samples defined
in Chapter 1, and the convergence is uniform in parameter (β, t) ⊂ Uβ0×R+, where
Uβ0 is a ball in Rp that contains the true parameter value. Definitions of Donsker,
Glivenko-Cantelli class and bracketing integral J[ ](1,F , L2(P )) < ∞ can be found
on page 269-270, Chapter 19 of [41].
Theorem A.0.1 Let F be a class of measurable functions such that the bracketing
integral J[ ](1,F , L2(P )) <∞. Then F is a Glivenko-Cantelli class.
Proof: Let N[ ](ε,F , Lp(P )) be the ε-bracketing number in Lp defined on page 270
of [41]. By this definition, every ε-bracket in L2 is also a ε-bracket in L1, we know
that
N[ ](ε,F , L1(P )) ≤ N[ ](ε,F , L2(P )).
173
By the definition of the bracketing integral on page 270, Section 19.2 in [41],
∫ δ√
J[ ](δ,F , L2(P )) = lnN[ ](ε,F , L2(P ))dε, (A.2)
0 √
Therefore given J[ ](1,F , L2(P )) <∞, then lnN[ ](ε,F , L2(P )) is finite for almost
every ε. Consequently, N[ ](ε,F , L2(P )) < ∞ a.e. By Theorem 19.4 in [41], F is
also Glivenko-Cantelli.
The following theorem is established as part of Example 19.20 on page 277 of [41].
It shows how to construct a new Donsker class from two existing Donsker classes
via Lipschitz transformation.
Theorem A.0.2 (Lipschitz Transformation) Let φ : R2 7→ R be a fixed Lip-
schitz function. If f and g range over Donsker classes F and G with integrable
envelope functions, then the set of functions φ(f, g) is Donsker.
Now we are ready show that the convergence in (A.1) is uniform in parameter
θ = (β, t).
Theorem A.0.3 Let x = (z, s) ∈ Z × R+, θ = (β, t) ∈ U × R+β0 , where Z is a
compact and bounded set in Rp and Uβ0 ⊂ Rp is a ball centered at β0. Then
n
1 ∑ { }tr P
ZjI(Vj > t)e
β Zj −→ βtrE Z I(V > t)e Z11 1 , (A.3)
n
j=1
and the convergence is uniform in θ.
Proof: For fixed θ = (β, t) ∈ U +β0 × R , the convergence of (A.3) is guaranteed
by the Law of Large Numbers (LLN). To prove this convergence is uniform in θ, for
174
any fixed θ, define function f(x, θ) with argument x = (z, v) ∈ Z × R+,
f(· tr, θ) : Z × R+ →7 R : f(x, θ) = zeβ zI(v ≥ t). (A.4)
If we can show that
F { tr= f(x, θ) = zeβ zI(s ≥ t), where θ = (β, t) ∈ U × R+β0 }
is Glivenko-Cantelli, then the uniformity is proved. The idea is to consider F as a
Lipschitz transformation of a parametric class and a bounded variation class, then
apply Theorem A.0.2.
For any fixed β, let g(x, β) : (Z × R+)× R+ → R, and
G tr= {g : g(x, β) = zeβ z, where β ∈ Uβ0},
Let us show that G is a parametric class that satisfies conditions for the parametric
class described in Example 19.7 of [41], where β corresponds to θ in Example 19.7.
For any β1 and β2 in Uβ0 ,
| βtrz − βtrz| | βtrz − βtrz | ≤ | || βtrz − βtrze 1 ze 2 = z(e 1 e 2 ) z e 1 e 2 z| (A.5)
Since the exponential function ex is Lipschitz continuous when x is bounded, there
exists a constant C such that
| βtrz − βtre 1 e 2 z| ≤ C|βtr1 z − βtr2 z| = C|(β1 − β tr2) z| ≤ |β1 − β2||z|
Plug the equation above back into (A.5), yielding
| βtrze 1 z − zeβtr2 z| ≤ C|z|2|β1 − β2|.
Hence we have shown that for m(X) = C|Z|2. Consequently, G is a Donsker class
by Example 19.7 of [41].
175
For any fixed t ∈ R+, consider function h(·, t) : R 7→ R of the form of
H = {h(x, t) = I(s ≥ t), where t ∈ R+}.
For any h ∈ H, the variation is 1. Then by the bounded variation class described
in Example 19.11, H is also a Donsker class. Next, we consider the product of
functions from G and H. Let φ(fg) : R2 7→ R be
φ(fg) = fg.
This is a Lipschitz function since z is assumed to be bounded and therefore G is too.
Hence by Theorem A.0.2, the new class of φ(g, h), where g ∈ G and h ∈ H, is also
Donsker, i.e., the class of functions f : Z × R+ × R+ × Uβ0 → R of the form of
F = { trf : f(x, θ) = zeβ ZI(s ≥ t), where θ = (β, t) ∈ Uβ0 × R+}
is also a Donsker class.
176
Bibliography
[1] P. Andersen, O. Borgan, R. Gill and N. Keiding, Statistical Models Based on
Counting Processes, (Springer, New York), Chap. 2.
[2] P. Andersen and R. Gill, “ Cox’s Regression Model for Counting Processes: a
Large Sample Study”, Ann. Statist. 10, (1982).
[3] V. Bagdonavičius and M. Nikulin, “Semiparametric Models in Accelerated Life
Testing,” Queen’s Papers in Pure and Applied Mathematics. 98, (1995).
[4] B. Bibby and M. Sørensen, “Martingale Estimation Functions for Discretely
Observed Diffusion Processes”, Bernoulli 1, (1995).
[5] P. Bickel, C. Klaassen, Y. Ritov and J. Wellner, Efficient and Adaptive Esti-
mation for Semiparametric Models, (Johns Hopkins Univ. Press).
[6] P. Billingsley, Probability and Mearsure, (Wiley, New York).
[7] S. Boyd, and L. Vandenberghe, Convex Optimization, (Cambridge Univ. Press).
[8] J. Buckely and I. James, “Linear Regression with Censored Data”, Biometrika
66, (1976).
[9] D. Clayton, “a Model for Association in Bivariate Life Tables and Its Ap-
plication in Epidemological Studies of Familial Tendency in Chronic Diease
Incidence”, Biometrika 65, (1978).
[10] E. Çinlar, and S.Özekici, “Reliability of Complex Devices in Random Environ-
ments”, Probability in the Engineering and Informational Sciences. 1, (1987).
177
[11] D. R. Cox, “ Regression Models and Life-Tables”, Journal of the Royal Statis-
tical Society. Series B (Methodological). 34, (1972).
[12] D. R. Cox, “ Partial Likelihood”, Biometrika. 62, (1975).
[13] D. R. Cox and O. Oakes, Analysis of Survival Data, (Chapman and Hall/CRC).
[14] B. Efron, “The Efficiency of Cox’s Likelihood Function for Censored Data”,
JASA 72, (1977).
[15] R. Ellis and D. Gulick, Calculus: w/ Concepts in Calculus, (Cengage Learning,
Ohio).
[16] V. Godambe, “an Optimum Property of Regular Maximum Likelihood Estima-
tion”, Ann. Math. Statist. 31, (1960).
[17] G. Heller, “Smoothed Rank Regression With Censored Data,” JASA. 102,
(2007).
[18] T. R. Hettmansperger, Statistical Inference Based on Ranks, (Wiley, Newyork).
[19] N. Hjort, I. Mckeague and I. van Keilegom , “Extending the scope of empirical
likelihood,” Ann. Statist. 37, (2009).
[20] S. Y. Hwang and I. V. Basawa, “A Review Toward a Unifying Tool”, Confer-
ence paper seen from S. Lahiri et al, Contemporary Developments in Statistical
Theory, pp 9 - 28.
[21] J. Jacod, “Un théorème de représentation pour les martingales discontinues,”
Z. Warsh. verw Gebiete, 34, (1976).
[22] I. James and P. Smith, “Consistency Results for Linear Regression iwth Cen-
sored Data”, Ann. Statist. 12, (1984).
[23] J. Jurec̆ková, “Asymptotic linearity of a rank statistic in regression parameter,”
Ann. Statist. 40, (1969).
[24] J. Jurec̆ková, “Nonparametric estimate of regression coefficients,” Ann. Statist.
42, (1971).
[25] J. Koul, V. Susarla and J. Van Ryzin, “Regression analysis with randomly right
censored data.,” Ann. Statist. 9, (1981).
178
[26] T. Lai and Z. Ying, “Large Sample Theory of a Modified Buckley-James Esti-
mator for Regression Analysis with Censored Data”, Ann. Statist. 19, (1991).
[27] E. Lopez, I. van Keilegom and N. Veraverbeke, “Empirical likelihood for non-
smooth criterion functions,” Scand. J. of Stat. 36, (2009).
[28] T. A. Louis, “Nonparametric analysis on an accelerated failure time model,”
Biometrika. 68, (1981).
[29] T. Merkouris, “Transform Martingale Estimating Functions”, Ann. Statist 35,
(2007).
[30] R. Miller, “Least Squares Regression with Censored Data”, Biometrika 63,
(1976).
[31] A. Owen, “ Empirical Likelihood Ratio Confidence Intervals for a Single Func-
tional,” Biometrika. 75, (1988).
[32] A. Owen, Empirical Likelihood, (CRC Press, Raton, Florida).
[33] J. Qin and J. Lawless, “Empirical Likelihood and General Estimating Equa-
tions,” Ann. Statist. 22, (1994).
[34] Y. Ritov, “Estimation in a linear regression model with censored data,” Ann.
Statist. 18, (1990).
[35] Y. Ritov and J. Wellner, “Censoring, Martingales and the Cox Model”, Con-
temp. Math. 80, (1988).
[36] R. Serfling, Approximation Theorems of Mathematical Statistics (Wiley-
Interscience, New York).
[37] J. Shao, Mathematical Statistics (Springer, New York).
[38] A.Tsiatis, “Estimating regression parameters using linear rank tests for cen-
sored data,” Ann. Statist. 18, (1990).
[39] A. Tsiatis, Semiparametric Theory and Missing Data (Springer, New York)
[40] A. Van der Vaart, Statistical Estimation in Large Parameter Spaces (Centrum
voor Wiskunde en Informatica, Amesterdam).
179
[41] A. van der Vaart, Asymptotic Statistics (Cambridge University Press, New
York).
[42] A. van der Vaart and J. Wellner, Weak Convergence and Empirical Processes
(Springer, New York).
[43] J. Vaupel, K. Manton, and E. Stallard, “the Impact of Heterogeneity in Indi-
vidual Frailty on the Dynamics of mortality,” Demography. 16, (1979).
[44] L.J. Wei and M.H. Gail, “Nonparametric estimation for a scale-change with
censored observations,” J. Amer. Statist. Assoc. 78, (1983).
[45] L. Wei, Z. Ying and D. Lin, “Linear Regression Analysis of Censored Survival
Data Based on Rank Tests”, Biometrika 77, (1990).
[46] M. Zhou, “Empirical likelihood analysis of the rank estimator for the censored
accelerated failure time model,” Biometrika. 92, (2005).
[47] M. Zhou, “Empirical likelihood ratio with arbitrarily censored/truncated data
by EM algorithm,” J. Computational and Graphical Statistics. 14, (2005).
[48] M. Zhou, Empirical Likelihood Method in Survival Analysis (Chapman and
Hall/CRC).
180