ABSTRACT Title of dissertation: EXTENDED ESTIMATING EQUATIONS AND EMPIRICAL LIKELIHOOD Xuan Yao, Doctor of Philosophy, 2018 Dissertation directed by: Dr. Eric Slud Department of Mathematics Classic Estimating Equations (CEE) were first introduced by Godambe in [16] and have been widely used under both parametric and nonparametric settings. However, under some prominent semiparametric models, CEE cannot be used to identify certain low-dimensional parameters. We prove that under regularity conditions, for the Cox (1972) survival-time model, a CEE for the structural parameter does not exist; and under more restrictive conditions, a CEE for the structural parameter in the Accelerated Failure Time (AFT) model does not exist when lifetime is subject to random right censoring with unknown distribution. Motivated by this lack of coverage of CEE’s for finite-dimensional parameters in semiparametric problems, we establish a method named Extended Estimating Equation (EEE). The EEE’s relax the requirement in the CEE of which the estimating function must be a function of the independently identically distributed (i.i.d.) summands and instead allow the estimating function to incorporate ratio of the sums of functions depending on two of the i.i.d. arguments. To our knowledge, the broadest class of semiparametric models that can be inves- tigated using EEE is the ϕ-transformation model class that we construct, where ϕ is a given function of covariate, structural parameter and random error with unknown hazard rate. With different choice of ϕ, the model can represent the general trans- formation model, nonlinear location-shift model, models incorporating cumulative integrated functions of times at risk and others. Inspired by Tsiatis’s work in [38], by defining martingale structure on the residual scale, we are able to prove the asymp- totic linearity of the associated EEE, which leads to the asymptotic normality of the structural estimator. Another perspective from which to view EEE is to use it as a constraint in the Empirical Likelihood (EL) method. We first show that under the CEE setting, regardless of the continuity of the criterion function, there exists a neighbourhood of the true structural parameter on which there always exists a probability vector that maximizes the EL. The same conclusion can be generalized to the EEE setting with continuous criterion function as well as the discontinuous criterion function with the martingale structure of the ϕ-transformation model or the Cox model. A point estimator for the structural parameter can be defined via maximizing the Profile Empirical Likelihood (pEL) associated with the EEE. We show that the pEL estimator is asymptotically normal, with asymptotic variance covariance matrix identical to that of the Z-estimator obtained by directly solving for the root of EEE. Finally, we develop algorithms to compute and compare the Z-estimator and pEL estimator associated with the EEE, and decide the minimal sample size for the two estimators to achieve asymptotic normality under three different parametric settings. Simulation shows a more symmetric covariate usually leads to a smaller threshold sample size, and the Z-estimator and pEL estimator are close in value and variance -covariance matrices. We also conclude that the pEL function tends to be much smoother, in settings where the EEE criterion function is non-smooth, than EEE itself, by comparing the plots of the projection of each function. EXTENDED ESTIMATING EQUATION AND EMPIRICAL LIKELIHOOD by Xuan Yao Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2018 Advisory Committee: Dr. Eric Slud, Chair/Advisor Dr. Shuo Chen Dr. Benjamin Kedem Dr. Yan Li, Dean’s Representative Dr. Tingni Sun ©c Copyright by Xuan Yao 2018 Acknowledgments I would like to express my gratitude to everyone who has been there during my graduate life. Thank you for being part of this unforgettable and fruitful experience. I would like to thank my advisor, Dr. Eric Slud. Thank you for giving me all the challenging yet intriguing questions. He is always patient with my questions, and has pulled me out of seemingly unsolvable situations with his advice. He has taught me how to view mathematical and statistical questions not as a student, but as a researcher. It has been an honor to learn from him. I would like to thank Dr. Benjamin Kedem, Dr. Tingni Sun, Dr. Shuo Chen and Dr. Yan Li for serving on my dissertation committee. Thank you for reviewing my work thoroughly and for all the valuable comments and suggestions. Thanks to all the other professors in the Mathematics department who have taught me. I would like to thank my parents and my husband for been supportive consistently during the past seven years. Thanks to Brook Qiu, a member who joined my family in 2014, who has always been the brightest sunshine in my life since then, and who can smooth any difficulties away with her laughter. I would like to thank my peer and my friend Xia Li. Thank you for running towards the finish line with me together, and pushing me with your diligence and intelligence during the hard times. Thanks to all my other friends. ii Table of Contents Acknowledgements ii List of Abbreviations v 1 Introduction 1 1.1 Classic Estimating Equations . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Extended Estimating Equations . . . . . . . . . . . . . . . . . . . . . 5 2 Non-Existence of Classic Estimating Equations 13 2.1 Non-Existence of Classic Estimating Equations of Cox Model . . . . . 15 2.2 Non-Existence of Classic Estimating Equations of Censored AFT . . 17 2.3 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.1 Proof of Theorem 2.1.1 . . . . . . . . . . . . . . . . . . . . . . 28 2.3.2 Proof of Theorem 2.2.1 . . . . . . . . . . . . . . . . . . . . . . 30 3 ϕ-Transformation Model 34 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2 Extended Estimating Equations and Martingales . . . . . . . . . . . . 37 3.2.1 Remarks and Sufficient Conditions for Assumptions . . . . . . 40 3.3 Estimator for Structural Parameter . . . . . . . . . . . . . . . . . . . 43 3.4 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.4.1 Proof of Proposition 3.2.1 . . . . . . . . . . . . . . . . . . . . 49 3.4.2 Proof of Lemma 3.4.2-3.4.5 . . . . . . . . . . . . . . . . . . . . 50 3.4.3 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . 58 3.5 List of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4 Technical Results I: Owen’s Lemmas & Empirical Process 65 4.1 Assumptions and Notations . . . . . . . . . . . . . . . . . . . . . . . 66 4.2 Lemmas Parallel to Owen, and Qin and Lawless . . . . . . . . . . . . 68 4.3 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 iii 5 Technical Results II: Zero in the Convex Hull Theorems 82 5.1 Classic Estimating Equation . . . . . . . . . . . . . . . . . . . . . . . 84 5.1.1 Continuous Criterion Function . . . . . . . . . . . . . . . . . . 84 5.1.2 Discontinuous Criterion Function . . . . . . . . . . . . . . . . 90 5.2 Extended Estimating Equation . . . . . . . . . . . . . . . . . . . . . 93 5.2.1 Continuous Criterion Function . . . . . . . . . . . . . . . . . . 93 5.2.2 Discontinuous Criterion Function . . . . . . . . . . . . . . . . 99 5.3 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.1 Proofs Under the CEE Setting . . . . . . . . . . . . . . . . . . 104 5.3.2 Proofs Under the EEE Setting . . . . . . . . . . . . . . . . . . 115 6 Empirical Likelihood Applied to Extended Estimating Equations 124 6.1 Empirical likelihood of Extended Estimating Equation . . . . . . . . 126 6.2 Asymptotic Normality Associated with the EEE . . . . . . . . . . . . 127 6.3 Some Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7 Computational Results 142 7.1 Description of the Algorithm Associated with the EEE . . . . . . . . 143 7.1.1 Algorithm for Computing the pEL Estimator . . . . . . . . . 144 7.1.2 Algorithm for Computing the Z-Estimator . . . . . . . . . . . 148 7.2 Local Continuity of EEE and pEL . . . . . . . . . . . . . . . . . . . . 149 7.2.1 Plots of Projection of EE and EL . . . . . . . . . . . . . . . . 149 7.2.2 Quantitative Measurement of Local Continuity of l(β) . . . . . 156 7.3 Convergence of the Estimator . . . . . . . . . . . . . . . . . . . . . . 159 7.4 Asymptotic Normality of the Z-Estimator and pEL Estimator . . . . 159 7.4.1 Covariates without Normality . . . . . . . . . . . . . . . . . . 160 7.4.2 Normally Distributed Covariates . . . . . . . . . . . . . . . . . 163 7.4.3 Extreme Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8 Contribution and Future Work 167 8.1 Original Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 A Empirical Process Theory 173 Bibliography 177 iv List of Abbreviations and Notations AFT Accelerated Lifetime Model page 4 CEE Classic Estimating Equation page 1 CLT Central Limit Theorem page 42 EEE Extended Estimating Equation page 5 EL Empirical Likelihood page 1 GTM General Transformation Model page 4 i.i.d. Independently Identiccally Distributed page 1 LIL Law of Iterated Logarithm page 70 LLN Law of Large Numbers page 40 MLE Maximum Likelihood Estimator page 1 Ti Lifetime page 6 Ci Right-Censoring Variable page 6 ∆i = I {Ti ≤ Ci} Non-Censored Indicator page 6 Vi = min(Ti, Ci) Event Time page 6 Zi Covariate page 6 Z Support of Z1 page 34 εi Random Error with Unknown Hazard Rate page 34 Xi i.i.d. Observations page 1 β p-Dimensional Structural Parameter page 1 ν Infinite Dimensional Nuisance Parameter page 1 θ = (β, ν) Parameter in Semiparametric Models page 1 H Infinite Dimensional Nuisance Parameter Space page 1 m(x, β) Classic Estimating Function page 2 mn(x,x, β) Extended Estimating Function page 6 Sn(β) Summation of m(Xi, β) or mn(Xi,X , β) page 2 Conv(β) Convex hull of mn(Xi,X , β) page 83 int (Conv(β)) Interior of Conv(β) page 86 v Chapter 1: Introduction The method of Classic Estimating Equations (CEE), first raised by Godambe in [16], is a powerful tool for constructing estimators for the structural parameter in a semiparametric model and has been extensively discussed in statistical literature. Given a mean zero estimating function depending on data and structural parameter alone, the CEE method defines an estimator as the root of the empirical integral of the criterion function. Such an estimator is known as a Z-Estimator. To give a formal definition of CEE, let us consider independently identically dis- tributed (i.i.d.) observations i.i.d. X1. . . . , Xn ∼ Pθ, θ = (β, ν) ∈ Rp ×H, (1.1) where {Xi}ni=1 are d-dimensional random vectors with support X , β ∈ Rp is a finite-dimensional structural parameter, ν is a nuisance parameter in an infinite- dimensional space H such as a function space. Let β0 and ν0 respectively denote the true structural and nuisance parameter value. 1 1.1 Classic Estimating Equations In our usage, CEE is a summation of mean zero functions of single independent data elements and structural parameter only, called estimating equations. Definition 1.1.1 (Classic Estimating Equation) Let X1. . . . , Xn be i.i.d. ob- servations as in (1.1), and m(x, β) : X × Rd → Rd with Eθ {m(X1, β)} = 0, for all θ = (β, ν), ν ∈ H (1.2) where β is a generally a proper subvector of the whole unknown parameter θ. If there exists a set Uβ0, a neighbourhood of β0 such that β̂n is a unique solution to ∑n Sn(β) = m(Xi, β) = 0, for β ∈ Uβ0 (1.3) i=1 then (1.3) is called an estimating equation for β. CEE provide consistent and asymptotically normal estimators under regularity conditions including Eθ0 {∇βm(X1, β)} is non singular for β ∈ Uβ0 , (1.4) where Uβ0 ⊂ Rp is a neighbourhood of β0. Such conditions can be found in various statistical literature, such as Theorem 5.7 and Theorem 5.23 in [41]. A well known example of estimator that is constructed by means of estimating equation is the Maximum Likelihood Estimator (MLE) when ν is not present. Con- sider a simple parametric case in which the parameter is θ = β. Then the MLE 2 ∏n ∑n maximizes f(x; θ), or equivalently, ln f(Xi; θ), where f(x; θ) is the density i=1 i=1 function of X1 with respect to the Lebesgue measure. If f(x, θ) is differentiable with respect to θ for each fixed value x, then the MLE is a solution to ∑n l(θ;Xi) = 0, where l(θ;x) = ∂ ln f(x; θ)/∂θ (1.5) i=1 Note that under regularity conditions such as those summarized in Section 3.2 of [42], then l(θ;x) has mean zero. Under further regularity conditions guaranteeing the other parts of Definition 1.1.1, (1.5) is a classic estimating equation with the choice of m(x, θ) = l(θ;x). The CEE is also related to the Empirical Likelihood (EL) method as shown by Owen in [31]. Using the criterion function associated with CEE as a constraint, under a non-parametric setting, Owen established the EL ratio confidence intervals for a single functional in [32]. Both the Z-estimator and Owen’s theory in [32] require the dimension of the CEE to be equal to that of the structural parameter. By allowing the former to exceed the latter, Qin and Lawless generalized Owen’s conclusion in [33]. Qin and Lawless also constructed a point estimator for the structural parameter by maximizing the EL, and showed that such an estimator is asymptotically normal with a sandwich-formed asymptotic variance covariance matrix. Despite the positive features of CEE, it has limitations in some prominent semi- parametric models in survival analysis, such as the Cox model (restircted to the non-time-dependent version in this thesis) and Accelerated Lifetime Model (AFT). 3 Cox model, or proportional hazard rate model was proposed by Cox in [11]. It assumes that the conditional hazard rate function given covariates is proportional (as a function of time) to the nuisance or “baseline” hazard function, by a factor depending on a linear combination of the covariates. Cox proposed to estimate the structural regression-coefficient parameter via maximizing the partial likelihood in [12], and Anderson and Gill developed the essential martingale-based large-sample distributional properties in [2]. Efron discussed the effciency of the partial liekli- hood estimator of the Cox model in [14]. In Section 5.2 of [39], Tsiatis showed the structural estimator constructed via maximizing the partial likelihood is globally semiparametric efficient. However, the equation system associated with the partial likelihood does not fit the definition of CEE due to the appearance of the quotient of higher order summations. When the Cox model does not fit a possibly right-censored sample of survival data, an important alternative model is AFT. AFT can be considered as a special case of a General Transformation Model (GTM). The latter assumes that the lifetime transformed via a known monotone function depends linearly on the covariate with unknown regression coefficient plus an independent random error with unknown hazard rate function. When the monotone transforming function is chosen to be natural logarithm, then the GTM becomes AFT. An equivalent construction of AFT is to assume that the lifetime is conditionally given the covariates proportional by a factor equal to an exponential function of a linear combination of the covariates to some unknown baseline lifetime that is independent of the covariates. AFT has 4 been extensively investigated by Miller [30], Buckley and James [8], Koul et al. [25], Louis [28], Wei and Gail [44], James and Smith [22], Ritov and Wellner in [35], Lai and Ying in [26], Wei et al in [45], and Ritov in [34]. Tsiatis proposed a class of linear rank statistic estimators by constructing a martingale on the residual scale in [38]. He also showed that the “estimating equation” through which the structural parameter is defined is asymptotically linear, and he established the asymptotic normality of the structural estimator using the asymptotic linearity and martingale central limit theorem. In [34], Ritov showed that the linear rank statistic estimator is efficient. However, like the equation related to partial likelihood for Cox model, the “estimating equation” in [38] for AFT also involves a quotient of higher order summations and again does not fit the definition of CEE. As shown by Cox and Tsiatis in [13] and [38], the structural estimators constructed under Cox and AFT model assumptions are usually related solving equations that are summations of non i.i.d. distributed summands, and the non-i.i.d violates the usual assumption on CEE. In fact, we prove that under regularity conditions, a CEE does not exist for Cox model. For the AFT, under more restrictive regularity conditions and with right censored data, the CEE does not exist, either. 1.2 Extended Estimating Equations In order to extend the regime of CEE to cover right censored semiparametric models, we define the class of Extended Estimating Equations (EEE) by allowing the estimating function to depend not only on single observations but on quotients 5 of averages with respect to one index of functions of the structural parameter and two observations from the sample. Recall that for Cox model and Accelerated Lifetime Model (AFT), the estim{ators fo∑r β are usually constructed by solving∑ }n n Z I {ln(V )− βtrZ ≥ ln(V )− βtrZ } AFT Model: ∆i{ Zi − ∑ j=1 j j j i i ∑ n (1.6)∑ j=1 I {ln(Vj)− β trZ ≥} ln(V tri=1 j i)− β Zi} n ∑n βtrZj− j=1 ZjI {Vj ≥ Vi} eCox Model: ∆i Zi n tr , (1.7)β Zj i=1 j=1 I {Vj ≥ Vi} e where notations of in (1.6) - (1.7) are defined on page v. Evidently, (1.6) and (1.7) do not satisfy the definition of classic estimating equations, because of the quotient of two i.i.d. summations within the curly brackets. Inspired by the formulation of the summands in (1.6) and (1.7), le{t ∑ }n − j∑=1C(Xj)k(Xi, Xj, β)mn(Xi,X , β) = Q(Xi, β) C(Xi) n . (1.8) j=1 k(Xi, Xj, β) For both the AFT and Cox models, Xi = (Ti, Ci, Zi,∆i), and we can choose Q(Xi, β) = ∆i = I {Ti ≤ Ci} , C(Xi) = Zi and { } for AFT Model: k(Xi, Xj, β) = I ln(Vj)− βtrZj ≥ ln(V )− βtri Zi tr for Cox Model: k(X β Zji, Xj, β) = I {Vj ≥ Vi} e Then an extended estimating equation is defined in the following way, Definition 1.2.1 (Extended Estimating Equations) Let X1, . . . , Xn be random vectors as in (1.1). Let Q(x, β) : Rd×Rp 7→ R, k(x, y, β) : Rd×Rd×Rp 7→ R+ and C(x) : Rd 7→ Rp be measurable functions. Assume that Eβ,ν {mn(X1,X , β)} = 0, for all θ = (β, ν), ν ∈ H. (1.9) 6 where β is generally a proper subvector of the whole unknown parameter θ. If there exists a unique solution β̂n to ∑n Sn(β) = mn(Xi,X , β) = 0, β ∈ Uβ0 (1.10) i=1 then (1.10) is the extended estimating equation. In Definition 1.2.1, it is required that the estimating function mn(X1,X , β0) has mean zero, which is the same assumption as in the classic estimating equation. For the specific example of Cox model and AFT, this assumption will be verified through a martingale property. We will discuss this in detail in Chapter 3. In order to ensure the existence of β̂n and its consistency and asymptotic normality, more assumptions need to be made. For example, the quotient term in (1.8) cannot explode to infinity, as n goes to infinity, and in order to have asymptotic normality, there should exist a neighbourhood of the β0 such that for all β in this neighbourhood, ∂mn(Xi,X , β)/∂β cannot be singular. We include these series of assumptions in Chapter 4. In this thesis, we will also be applying CEE’s and EEE’s in an EL estimation framework. Rather than constructing the EL confidence regions, in this thesis, EL is primarily an approach to computing estimators although it could readily be further developed for its more common use in determining confidence regions. We will give self-contained definitions and proofs of EL constructions in Chapter 4 and 5 using empirical process theory. In Chapter 6, we discuss the asymptotic normality of the two estimators associated with the EEE, namely, the one given by directly solving the EEE, and the one by maximizing the EL, which share the identical sandwich 7 formed variance covariance matrices under regularity conditions. One thing that we would like to point out is that the definition of EEE is differ- ent from the term “martingale estimating equation” in literature like [4] by Bibby et al, [29] by Merkouris, or [20] by Hwang et al, whose primary interest were to construct an estimating function that estimate structural and nuisance parameters simultaneously. To modify the CEE estimator and confidence regions defined through the EL method to become applicable to right censored semiparametric models, Hjort, Mck- eague and van Keilegom proposed a “plug-in” method in [19], i.e., using the empirical √ estimator of the nuisance parameter. They also showed the slower than n -rate of convergence, and settings with large numbers of estimating equations compared to the sample size. In [46] and [48], Zhou extended the Wilks type confidence region in [32] for right censored data by replacing the unknown survival function of right censoring variable with the Kaplan-Meier estimator. However, after replacing the unknown nuisance parameter with its empirical estimator, the estimating functions no longer satisfy the definition of CEE. We develop a broader class of semiparametric models for which EEE definition holds, i.e., the ϕ-Transformation Models with right censored data. These models can be considered as generalizations of the GTM in the sense of characterizing the relation between the dependent variable and covariates by a known function ϕ, where ϕ depend on the covariates, structural parameter and error with unknown 8 hazard rate, and ϕ is monotone with respect to the error term. With different choices of ϕ, the ϕ-transformation models include a series of semiparametric models, including the AFT, the linear model in [34] and [38], and a nonlinear regression model in [39] that allows location and shape to change based according to covariates. The transformation function ϕ can also have a non-analytical form. For example, it can be defined as an integral of a given wear-out rate function of structural parameter and covariate. Such models can be found in [10] and [3]. Following Tsiatis’s work in [38], by constructing a martingale structure on the residual, we show that for the ϕ-transformation model, the estimating equation that defines an estimator of the structural parameter alone satisfies the definition of EEE. We also prove the EEE associated with the ϕ-transformation model is asymptotically linear, which together with the martingale Central Limit Theorem (CLT) implies the asymptotic normality of the structural estimator. Similar to CEE, we can also use EEE as a constraint in the EL method, then construct a structural estimator via maximizing the pEL. Under the CEE setting, an element in Owen’s, or Qin and Lawless’s work in [32] and [33] is specifying a neighbourhood of the structural parameter, in which there exists a unique proba- bility vector that maximizes the EL with probability approaching 1. Therefore we begin, in Chapter 5, by constructing such a neighbourhood for continuous estimating functions under CEE setting, and then generalize the conclusion to discontinuous estimating functions. Finally, we show that for EEE, when the estimating function is continuous with respect to the structural parameter, then the local uniqueness 9 of solutions to EL maximization is also guaranteed with probability approaching 1; when the criterion functions is discontinuous, with the martingale structure as described in the ϕ-Transformation model or the Cox model, then the same conclu- sion can be drawn. Note that the martingale assumption is satisfied by all the EEE examples that we know up to know, including the ϕ-transformation model and the Cox model. After proving lemmas that are parallel to Owen, and Qin and Lawless in [32] and [33] using empirical process theory and some classic examples of Donsker class and Glivenko-Cantelli Class listed in [41], we are able to establish the asymptotic normality of the structural point estimator, of which the asymptotic variance co- variance matrix is identical to the sandwich-formed asymptotic variance covariance matrix of the corresponding Z-estimator. Finally, we validate the EEE theory by simulation under the AFT model assump- tion with R (3.4.1). Since the criterion function corresponding to the EEE of AFT is discontinuous due to the appearance of the indicator function, the EEE may not have a root. Therefore instead, we define the Z-estimator as the value minimizes the Euclidean norm of the EEE. To calculate the maximum empirical likelihood estimator, we first construct the Lagrange multiplier in the maximization problem as a function of structural parameter, then calculate the structural parameter by maximizing the pEL. 10 Despite the lack of continuity of EEE under AFT model assumption, we proved that a unique solution of the EL maximization always exists with probability ap- proaching 1. The intuition is that the pEL function, or equivalently, the summation of the negative logarithm of the pEL appears very smooth for large n. We show this conjecture by plotting the projection of the pEL in randomly generated unit directions. From the pictures, we can see that for a “moderate” sample size depend- ing on the censoring rate and the skewness of covariates, the plots are always very smooth and almost parabolic around the location of the maximum pEL estimator. On the other hand, plots of the projections of the EEE have many jumps even around the true structural parameter value. We also develop a quantitative way to compare the continuity pattern of around the pEL estimator and around the true structural parameter, and we found the pEL acquires a very similar pattern to the true parameter value. From simulation, we can also see that for a moderate sample size, the EEE esti- mator and pEL estimator are very close measured by the L1 distance between the two. The differences between their variance covariance matrices is also very small, evidenced by small magnitude of eigenvalues. Heuristically, this is because the La- grange multiplier associated with pEL has a very small magnitude, which makes the constraint function for pEL “almost the same” as the EEE. Finally, we check the asymptotic normality of the pEL estimator under three different parametric settings with right censored data, including the non-normally 11 distributed, the normally distributed and the severely skewed distributed covariates. We found it takes a larger sample size for pEL estimator to reach normality if the corresponding covariates is not normally distributed, and the sample size needs to be even larger as the skewness of the covariate grows. 12 Chapter 2: Non-Existence of Classic Estimating Equations Cox model and Accelerated Failure Time (AFT) model are two semiparametric models that have been extensively used in survival analysis, especially when data may be subject to various types of censoring. Usual ways of constructing estima- tors for the structural parameters for Cox and AFT involve solving equations as mentioned in [13], [38], and [41]. However, as far as we see from the literature, the equations through which an estimator for the structural parameter is constructed do not satisfy the definition of Classic Estimating Equations (CEE). Therefore a natural question is whether a CEE method exist for these semiparametric models. Despite many advantages of the CEE we discussed on page 1, under some cir- cumstances CEE for a subvector β of parameters may not exist. In this chapter, we prove the nonexistence of the classic EE under the Cox model and randomly right-censored AFT. More specifically, let us consider a statistical model {Pθ, θ ∈ Θ}, where θ = (β, λ) consists of a structural parameter β ∈ Rp and a nuisance parameter λ ∈ H with infinite dimension, wh{ere ∫ ∞ } H ≡ λ(t) : λ(t) > 0 a.e. in t, λ(t)dt =∞ . (2.1) 0 13 Let β0 and λ0 denote the true parameter. The question this chapter aims to answer in some special cases is whether there exists a CEE that depends only on data and the structural parameter, i.e., we would like to know if there exists a function m(X, β) : R× Rp → Rp, such that for any λ0 and β0 Eβ0,λ0 {m(X, β0)} = 0, (2.2) and ∇βEβ0,λ0 {m(X, β)} = Eβ0,λ0 {∇βm(X, β)} is nonsingular for β ∈ Uβ0 , (2.3) under assumptions that the passage of ∇β inside E {·} is allowed, where Uβ0 is a bounded domain in Rp that contains a neighbourhood of β0. We show that an estimating function m(X, β) satisfying (2.2) and (2.3) and the following regularity conditions does not exist for Cox model in Section 2.1, and for censored AFT in Section 2.2. Assumptions (A.1) Z is supported on a bounded set Z ⊂ Rp. ∫ ∫ (A.2) ∇β m(t, z, β)p(z)dz = ∇βm(t, z, β)p(z)dz, for all β ∈ U +β0 and t ∈ R . (A.3) For all h ∈ H∫a∫nd sufficiently small ε, tr βtrz m(t, z, β)eβ zh(t)e−e (1−ε)H(t)p(z)dzdt <∞. (2.4) (A.4) P {T < C} > δ > 0. (A.5) Eβ,λ {‖m(T, Z, β)‖} <∞, for any β ∈ Uβ0 and λ ∈ H. 14 Remark 2.0.1 Given assumption (A.5), assumption (A.3) is sat∫isfied for all h ∈ H. Thi∫s is because for any h ∈ H and ε ∈ (0, 1), since 1− ε > 0, (1− ε)h(t)dt = (1− ε) h(t)dt =∞, th∫er∫efore hε(t) = (1− ε)h(t) ∈ H. Then tr Eβ,hε‖m(T, Z, β)‖ = ‖ tr β z m∫(t∫, z, β)‖e β zhε(t)e −e Hε(t)p(z)dzdt (2.5) − tr β trz = (1 ε) m(t, z, β)eβ zh(t)e−e (1−ε)H(t)p(z)dzdt Under assumption (A.5), the left hand side of (2.5) is bounded, therefore so is the right hand side, which is what (A.3) states. 2.1 Non-Existence of Classic Estimating Equations of Cox Model As proposed in [11], the Cox model assumes that the conditional hazard rate of lifetime T given covariate Z is proportional to an unknown baseline hazard rate function λ(t), i.e., tr λT |Z(t|z) = eβ zλ(t), (2.6) where β ∈ Rp is the regression coefficient to be estimated, and Z is the covariate with density function pZ(z). The Cox model (2.6) is semiparametric, with parameter θ = (β, λ) ∈ Θ = Uβ0 ×H, (2.7) where Uβ0 is an open and bounded subset of Rp that contains the true parameter value β0, and H is defined in (2.1). Different from the Kaplan-Meier estimator that primarily constructs an estimator for the nuisance parameter, with the incorporation of the regression-like argument βtre z, the Cox model (2.6) gives a way of estimating the structural parameter under a 15 semiparametric assumption. Cox introduced the notion of partial likelihood in [12], and β̂n, the maximum partial likelihood estimator for β0, is defined as the solution to ∑ {n ∑∑ }n βtrZj− j=1 Zje I {Vj ≥ Vi}∆i Zi n tr = 0 (2.8)β Zj j=1 e I {V ≥ V }i=1 j i Andersen and Gill derived the large-sample theoretical properties of β̂n in [2]. It can be shown that β̂n constructed via solving equation (2.8) is a semiparametric efficient estimator. Details of semiparametric efficiency can be found in [39]. However, equation (2.8) does not satisfy the definition of classic EE because the summands are not independently identically distributed (i.i.d.). In fact, in this section, we will show that under regularity conditions, a CEE does not exist for the Cox model. Under the Cox model assumption, the survival function of T given Z is { ∫ t } − βtrS (t) = exp e zT |Z λ(s)ds , 0 and the density function of T is ∫ tr βtrz f(t) = eβ ze−e Λ(t)λ(t)pZ(z)dz, (2.9) ∫ t where Λ(t) = λ(s)ds for t ≥ 0. We would like to see whether a function 0 m(t, z, β) : R+ ×Z × U pβ0 → R of lifetime, covariate and structural parameter alone can exist, which satisfies for all θ ∈ Θ and all λ ∈ H, ∫∫ βtrz−eβtrzE Λ(t)β,λ {m(T, Z, β)} = m(t, z, β)e λ(t)pZ(z)dzdt ≡ 0. (2.10) 16 By the dominated convergence theorem applied to difference quotients with re- spect to ε, for each h ∈ H, with g ranging freely over functions bounded by 1, we have the following theorem Theorem 2.1.1 For the Cox model given in (2.6), there does not exist a function m(t, z, β) of data X = (T, Z) supported on R+×Z and structural parameter β ∈ Uβ0 alone, which at the same time satisfies assumptions (A.1)-(A.5) and equations (2.2) and (2.3). The proof of this theorem can be found in Section 2.3. In the next section, we discuss the non-existence of CEE for the AFT model. 2.2 Non-Existence of Classic Estimating Equations of Censored AFT AFT assumes that conditionally given covariate Z, the lifetime T is proportional to some baseline lifetime T0, i.e., −βtrT = e zT0, (2.11) where T0 is a lifetime with unknown hazard rate function λ(t), β ∈ RP is the regression coefficient to be estimated, and Z is the covariate with density function pZ(z). Assume that Z is supported on a bounded subset Z ∈ Rp, hence (2.11) forms a semiparametric model with parameter θ = (β, λ) ∈ Θ = Uβ0 ×H, (2.12) 17 where Uβ0 is an open and bounded subset of Rp that contains the true parameter value β0, and { ∫ ∞ } H ≡ λ(t) : λ(t) ≥ 0 for all t, λ(t)dt =∞ (2.13) 0 is an infinite dimensional space. The survival function of T given Z is given by { } tr S (t) = exp −Λ(teβ zT |Z ) and the density function of T is given by ∫ tr tr f(t) = eβ z−Λ(te β z) trλ(teβ z)pZ(z)dz (2.14) There are two different constructions of the regression parameter β depending on whether β has an intercept term. Case 1 β does not have an intercept term. In this case, λ is unrestricted. We do not know any existing CEE based on β and data alone. There probably is no such estimating equation even though we have not proved this. Case 2 β has an intercept term. Without loss of generality, assume that the expected value of T0 is 1, that is { tr ∣∣∣∣ } ∫E {T0} = E eβ ZT Z = tλ(t)e−Λ(t)dt ≡ 1 (2.15) Therefore for any p(z)dz integrable functions a(z) : Rp → Rp and b(z) : Rp → Rp such that E {a(Z)} = E {b(Z)} =6 0, tr m(T, Z, β) = Teβ Za(Z)− b(Z) (2.16) 18 has mean zero and is an estimating function if there exists Uβ0 such that∇βm(T, Z, β) is nonsingular for all β ∈ Uβ0 . For example, a(z) can be a linear function of Z, i.e., { } tr −1 a(z) = Az, where A = E {b(Z)} · E Teβ ZZ is a p× p matrix. The discussion in previous paragraphs shows we have different conclusions for the non-existence of CEE when lifetime T is always observable. Next, we introduce the concept of censoring, which is a commonly seen situation in practice. When lifetimes are subject to right censoring, instead of observing T and Z, we observe X = (V,∆, Z), where V = min(T,C), ∆ = I {T ≤ C} (2.17) and C is the right censoring variable with hazard function ΛC and hazard rate λC . Koul, Susarla and Van Ryzin proposed a classic estimating equation for β in [25] under the assumptions (K1): C is independent of (T, Z); (K2): The survival function of C is known and is denoted by SC(c) Actually, assumption (K2) is seldom reasonable as a modeling assumption, unless function SC is estimated from another source or censoring is purely “administrative”, i.e., occurs when the study observation period ends and the pattern of times of entry into the survival study does not depend on covariates or survival-time. When the distribution function of C is unknown, Koul, Susarla and Van Ryzin proposed to use the Kaplan-Meier estimator instead. 19 However, the estimating function used to construct the estimator for the structural parameter is does not satisfy the definition of CEE due to the presence of the Kaplan-Meier estimator, because the summands in it are no longer independent. Similarly, neither the general “plug-in” method discussed by Hjort, Mckeague and van Keilegom in [19], or the method proposed by Zhou in Chapter7 of [48] uses estimating functions that satisfies the definition of CEE defined in Chapter 1. In fact, in this section, we show that under regularity conditions, if the expectation of the estimating function is identical to zero when lifetime is arbitrarily right censored, for al possible model parameters, then a CEE does not exist. Let C be the right censoring variable that is independent of the (T, Z), and letm1(t, z, β) when ∆ = 1; m(t, z, β) =  (2.18)m0(t, z, β) when ∆ = 0, We show that when m0(t, z, β) ≡ 0 for all t ∈ R+, z ∈ Z and β ∈ Uβ0 , then a non- trivial CEE does not exist. Note that under most right-censoring CEE formulations, the equations through whi∑ch an estimator for the structural parameter is definedn are usually in the form of mn(Xi,X , β), where Xi = (Vi, Zi,∆i) and i=1 mn(Xi,X , β) = ∆ ·m∗i n(Xi,X , β), (2.19) and the assumption m0(t, z, β) ≡ 0 for all t ∈ R+, z ∈ Z is satisfied by (2.19). 20 Estimating funct∫ion m(T, Z, β) must be mean zero, which implies for all hazardc functions∫Λ∫C∫(c) = λC(s) ds ∈ H, λ ∈ H and β ∈ U .0 β0c tr 0 ≡ m1 tr(t, z, β)eβ z−Λ(teβ z)−Λ trC(c)∫∫∫ λC(c)λ(te β z)p(z) dt dz dc 0 t 0 tr β trz + m (t, z, β)eβ z−Λ(te )−ΛC(c) tr ∫∫ λC(c)λ(te β z)p(z) dc dz dt 0 (2.20) 1 βtr= m (t, z, β)λ(te z tr )eβ z− tr Λ(teβ z)−Λ (t) ∫∫ C p(z) dt dz − βtr+ m0(c, z, β)λ (c)e Λ(ce z)−ΛC(c) C p(z) dc dz, Now, let us present the main theorem of this section Theorem 2.2.1 Under assumptions (A.1), (A.2), (A.4), and (A.5), for AFT given in (2.11) with arbitrarily right censored data (2.17), (a) A CEE does not exist when assuming m0(c, z, β) ≡ 0 for all c ∈ R+, z ∈ Z and β ∈ Uβ0 . (2.21) (b) A CEE does not exist when assuming m1(c, z, β) ≡ 0 for all c ∈ R+, z ∈ Z and β ∈ Uβ0 . (2.22) The conclusion described in (a) is the main result we would like to present. It can be shown by proving { } βtr0 E ∇βm1(T, Z, β0) = 0, which violates the non-singularity assumption of the gradient described in (2.3). 21 2.3 Some Proofs In this section, for simplicity, the ranges of integrations for u and t are equal to (0,∞) unless otherwise specified. Lemma 2.3.1 gives an important identity that will be used to derive the non-existence of CEE under Cox model assumption, and Lemma 2.3.2 shows how the non-existence conclusion can be drawn without impos- ing smoothness assumptions on mn(·, z, β). L∫emma 2.3.1 Under assumption (∫A.1)[∫- (A.6), for a.e. t ∈ R + and all β ∈ Uβ0, tr ∞ ] tr β z tr βtrz m(t, z, β)eβ z−e H(t)p(z)dz = m(u, z, β)e2β z−e H(u)h(u)p(z)dz du t (2.23) Proof To start with, let us consider the following construction of hazard function. Let h(t) ∈ H be a candidate baseline ha∫zard function, andt H(t) = h(s)ds. 0 Suppose that the baseline hazard rate in (2.6) is of the form λ(t) = h(t)eεg(t), where ε is a positive constant, and g(t) ∈ G is a continuous and bounded function, i.e., { } G ≡ g(t) : g(t) is continuous, and |g(t)| ≤M +g <∞ for all t ∈ R (2.24) The boundedness assumption on g guarantees that λ(t) integrates to ∞ and conse- quently is a hazard rate function. This is because if |g(t)| ≤M , for all t ∈ R+, then 22 for any ε > 0, g(t)ε ≥ −ε∫M , therefore for any t ∈∫ R + and h ∈ H, t t h(x)eεg(x)dx ≥ e−εM h(x)dx ∫ 0 0∞ ∫∞ Since h(t)dt =∞, we know that h(t)eεg(t)dt =∞. 0 0 Let us consider λ(t) = h(t)eεg(t), where g ∈ G. (2.25) By (2.10), for all θ ∈∫Θ∫ and g ∈ G, ∫ βtrz−eβtrz t h(s)eεg(s)E ds εg(t)θ {m(T, Z, β)} = m(t, z, β)e 0 h(t)e p(z)dtdz ≡ 0, (2.26) Consider the double integral in (2.26) as a function of ε, then for any fixed h ∈ H, g ∈ G , β ∈ Θ an∫d∫ε ≥ 0, ∫ tr βtrz t r(ε, g, h) = m(t, z, β)eβ z−e 0 h(s)e εg(s)dsh(t)eεg(t)p(z)dtdz = 0. (2.27) For fixed g ∈ G and h ∈ H, r(ε, g, h) = 0, for any ε ≥ 0. Therefore ∇1r(0, g, h) = 0. By (2.26)-(2.27) a∫p∫plied with ε > 0 and ε = 0, ∇ trr(0, g, h) = lim m(t, z, β)eβ z1 { h(t)p(z)ε→0 } (2.28)1 − βtr ∫× e z t εg(s) βtrze 0 h(s)e dseεg(t) − e−e H(t) dtdz = 0 ε To apply the Dominated Convergence Theorem and pass the limit into the double integral of (2.2{8), we first re-write the difference quotient in the second line as 1 { ∫ } { }}− βtrz − βtre e H(t) e z t εg(s)e 0 h(s)(e −1)ds+εg(t) − eεg(t) 1+ eεg(t) − 1 (2.29) ε ε Note that for any fixed g and t ∈ R+, (eεg(t)− 1)/ε = O(g(t)) is uniformly bounded. As for t{he first term in (2.29), it is }equal toeεg(t) { }− βtr ∫ εg(t) ∫e zε t h(s) 1 tre 0 (eεg(s)−1)ds − e1 = e−eβ zε tε 0 h(s)(g(s)+o(1))ds − 1 (2.30) ε ε 23 ∫ t Since g(·) is bounded, denote h(s)(g(s)+o(1))ds = θ(t)H(t), where |θ(t)| ≤M 0 0 ∈ R+ for all t ∈ R+. Therefore the integrand in (2.28) is bounded by ‖ tr β trz m(t, z, β)‖h(t)eβ z−e H(t){1+εθ(t)}p(z), ∫∫ tr βtrz and under assumption (A.3), m(t, z, β)h(t)eβ z−e H(t){1+εθ(t)}p(z)dzdt < ∞. Therefore by the Dominated Convergence Theorem, the limit in equation (2.28) can be pa∫ss∫ed into the integral in (2.28). Hence ∣ ∂ { }tr βtr ∫z t εg(s) ∣ m(t, z, β)eβ zh(t)p(z)e−e 0 h(s)e dseεg(t) ∣∣ dtdz = 0, (2.31)∂ε ε=0 for all θ ∈ Θ and g ∈ G, which implis that ∫∫ { ∫ t } βtrm(t, z, β)e z− tr eβ zH(t) − βtrh(t)p(z) g(t) e z h(s)g(s)ds dtdz ≡ 0, (2.32) 0 for all θ ∈ Θ and g ∈ G. We can r∫e-write (2{.32∫) by the Fubini Theorem and get for all θ ∈ Θ and g ∈ G tr − βtr0 ≡ zg(t)h(t) m(t, z, β)eβ z e H(t)∫ p(z)dz (2.33)∞ ∫ } − 2βtr β trz m(u, z, β)e −e H(u)h(u)p(z)dzdu dt t E∫quation (2.33) implies that for a.e∫. t ∈[∫R + and all β ∈ Uβ0 , ∞ ]tr tr m(t, z, β)eβ trz−eβ zH(t) tr β zp(z)dz = m(u, z, β)e2β z−e H(u)h(u)p(z)dz du, t (2.34) which is the assertion of the Lemma.  Lemm∫a 2.3.2 Under assumptions (A.1) - (A.5), ∇ tr β trz t m(t, z, β)e β z−e sp(z)dz ≡ 0, for all s, t ∈ R+ and β ∈ Uβ0 . (2.35) 24 Proof For fixed β ∈ Uβ0 and h ∈ H, denote∫ tr βtrz f(t) = m(t, z, β)eβ z−e H(t)p(z)dz. Then by (2.34) f(t) is differentiable for all t ∈ R+ and h ∈ H. Differentiating both sides of (2.34), for all β ∈ Uβ0 and h ∈ H,∫ ′ tr βtrzf (t) = − m(t, z, β)e2β z−e H(t)h(t)p(z)dz (2.36) By the definition of the left{h∫and side of (2.36) ′ 1 βtr βtrzf (t) = lim m(t+ δ, z, β)e z−e H(t+δ) p(z) dz δ→0 δ ∫ } (2.37) − βtr tr m(t, z, β)e z−e β zH(t) p(z) dz Rewr∫ite the difference quotient in (2.37) as 1 { }tr tr m(t, z, β) eβ z−e β zH(t+δ) − βtre z− β tr e zH(t) ∫ p(z) dzδ (2.38) 1 tr − βtrβ z e z+ e H(t+δ) {m(t+ δ, z, β)−m(t, z, β)} p(z) dz δ Next, we discuss the two lines in (2.38). The uniform boundedness of h leads to supt |H(t+δ)−H(t)| ≤Mε <∞ for |δ| < ε <∞. So by the Dominated Convergence Theorem, the first term∫ of (2.38) converges to{ } ∇ βtrz−eβ trz m(t, z, β) e H(t)∫ t p(z) dz (2.39) − tr β trz = m(t, z, β)e2β z−e H(t)h(t) p(z) dz. From (2.36), (2.38), and (2.39), we know that as δ → 0, the second line of (2.38) converges to zero for all β ∈ Uβ0 , i.e.,∫ 1 { − } βtrlim m(t+ δ, z, β) m(t, z, β) e z−eβ trzH(t+δ) p(z) dz ≡ 0 (2.40) δ→0 δ 25 Write the left hand side of (2.40) as the sum of ∫ 1 tr{m(t+ δ, z, β)−m(t, z, β)} tr β zeβ z−e H(t) p(z) dz (2.40.a) δ and ∫ 1 { }tr tr{m(t+ δ, z, β)− tr β z tr β zm(t, z, β)} eβ z−e H(t+δ) − eβ z−e H(t) p(z) dz δ {∫ } ∣ (2.40.b)tr βtrz As δ → 0, (2.40.a) converges to ∇t m(t, z, β)eβ z−e sp(z)dz ∣ , whichs=H(t) is the left hand side of (2.35) when H(t) is regarded as a free variable s ∈ R+. Therefore it suffices to prove that as δ → 0, (2.40.b) converges to 0 for all β ∈ Uβ0 and h ∈ H∗, where{H ∗ is also a dense class of functions in L1 defined as H∗ = h : h ∈ H; for each h, there exists ε > 0 such that h(t) = 0 for all t ∈ (0, ε); (2.41) h bounded above and below, fo}r all t ≥ ε; h bounded above, for all t ≥ ε Then co∫nsider t > δ and write (2.40.b) as{ } tr βtrz tr eβ zp(z) m(t+ δ, z, β)e−e H(t+δ) − −eβ zH(t)∫ { m(t, z}, β)e dz − βtrz −eβ trz tr e p(z)m(t, z, β) e H(t+δ) − −eβ ze H(t)∫ dz (2.42) − βtre zp(z)e− β tr e zH(t) {m(t+ δ, z, β)−m(t, z, β)} dz Then using ∫the identity eq∫uation (2.34) ∫proved in Lemma 2.3.1 and the additivity∞ t+δ ∞ property, − `(s) ds = `(s) ds − `(s) ds for any integrable function `(·), t+δ t t 26 we re-write (2.42) ∫as the sum oft+δ ∫ [ ′ ′ C1 = m(u, z, β)e 2β z h(u− δ) exp(−eβ z] H(u− δ))t (2.43) − ′h(u) exp(−eβ zH(u)) p(z)dzdu and ∫ ∞ ∫ [ ′ ′ C = m(u, z, β)e2β z 2h(u) exp(−eβ z2 H(u)) ] (2.44)t − h(u− − β′δ) exp( e zH(u− ′δ))− h(u+ δ) exp(−eβ zH(u+ δ)) p(z)dzdu First, by the Mean Value Theorem, the square-bracketed integrand in C1 is O(δ), bounded by [ ′ ]β z ′ M2δ |h′(u− θδ)| e−e H(u−θδ) 2 − β z+ h (u θδ)e−e H(u−θδ) (2.45) tr where M2 is a uniform upper bound for eβ z, h ∈ H∗, and θ ∈ (0, 1). Let k be a function and C∗ ∗1 be a constant such that for all δ ∈ (0, ε) the lower bound of C1k(u) is given by a constant M1 ∈ R when u < ε (2.46)supθ∈[−1,1] (h′(u− θδ), h2(u− θδ)) when u > ε. Let K(u) ≥ H(u− ε) for all u ≥ ε. Then (2.45) is bounded by ′ M2 ∗ −e β z δ C k(u) e K(u)1 , for all u > ε, |θ| ≤ 1, 0 < δ < ε, So the fact that the outer integral in C1 is taken over a shrinking interval (t, t+ δ) allows us to say that the integral of the integrand given by δ times dz du integrable function ‖ ‖ 2βtr − β tr z e zm(u, z, β) k(u)e K(u)p(z) = o(δ). 27 Now we move on to C2. Using the Mean Value Theorem again, the square- bracketed integrand in C2 is bounded by ∂2 ∣2 − β′δ (h(x) exp( e zH(x)) ∣∣ (2.47) ∂x2 x=u+θ∗δ where h ∈ H∗ and |θ∗| ≤ 1 but cannot be controlled further. Let k be a function and C ∗ 2 be a constant such that C ∗ 2k(u) is lower bounded by a constant M2 ∈ R when u < εsup ′′ ′ 2 ′θ∈[−1,1] (h (u+ θδ), 2h(u+ θδ)h (u+ θδ), h (u+ θδ)h (u+ θδ)) when u > ε. (2.48) for all 0 < δ < ε. Let K(u) ≥ H(u− ε) for all u ≥ ε. Then (2.47) is controlled by δ2C∗ βtrz 2k(u)e −e K(u), for all u > ε, |θ∗| ≤ 1, 0 < δ < ε. Thus , even though the range of integration is now not small, C2 can be shown (with h ∈ H∗) to be O(δ2).  Lemma 2.3.3 (Weierstrass) Suppose f is a continuous real-valued function de- fined on real interval [a, b]. For every ε > 0, there exists a polynomial pf (x) such that for all x in [a, b], |f(x)−pf (x)| < ε, or equivalently, supx∈[a,b] |f(x)−pf (x)| < ε. 2.3.1 Proof of Theorem 2.1.1 Multiply both sides of (2.35) by a(t) ∈ A, where A ≡ {a(t) : R→ R; a is compactly supported , (2.49) and continuously differentiable on R+}, 28 then integrate with respect to t, yielding for all a ∈ A, β ∈ U and s ∈ R+β ∫ { ∫ } 0 , ∂ tr − βtra(t) m(t, z, β)eβ z e zsp(z)dz dt ≡ 0. (2.50) ∂t Integrate (2.50) by parts, yielding for all a ∈ A, β ∈ Uβ0 and s ∈ R+,∫ ∫ tr βtrz a′(t) m(t, z, β)eβ z−e sp(z)dzdt ≡ 0. (2.51) Now, multiply both sides of (2.34) by a′(t), then integrate with respect to t. Together with equation (2.51), we conclude that for all a ∈ A, β ∈ Uβ0 and h ∈ H,∫ {∫ ∞ ∫ } tr βtrz 0 ≡ a′(t) m(u, z, β)e2β z−e H(u)h(u)p(z)dzdu dt (2.52) t Integrating (2.52) by parts, then we know that for all a ∈ A, β ∈ Uβ0 and h ∈ H,∫ ∞ ∫ ≡ 2βtr β trz 0 a(t) m(t, z, β)e z−e H(t)h(t)p(z)dzdt (2.53) 0 Therefore for all a ∈ A, β ∈ Uβ0 and s, t ∈ R+,∫ tr βtrz 0 ≡ m(t, z, β)e2β z−e sp(z)dz, (2.54) where we replaced H(t) by s ∈ R+ since H is free to be any function in H. Inte- grating both sides of (2.54) with respect to s from x to ∞ implies ∫ ≡ βtr − β tr 0 m(t, z, β)e z e zxp(z)dz (2.55) After integrating again with respect to x on s,∞ or (separately) by differentiating equation (2.55) under the integral sign arbitrarily many times, then the same formula tr tr (2.55) holds with the first term eβ z term replaced by ekβ z for any non-negative integer k. 29 Let b(·) be any element of B, the continuous functions on the real line and apply the Weierstrass’ theorem stated in Lemma 2.3.3 to approximate b by polynomials uniformly on∫ the compact set of possible βtrz values, yielding 0 ≡ m(t, z, β)b(βtrz)p(z)dz, for all t ∈ R+, β ∈ Uβ0 , b ∈ B. (2.56) Now, let us demonstrate the singularity of the matrix Eβ0,λ0 {∇βm(t, z, β0)}, which is a violation of (2.3) with β = β0 and thus proves the non-existence of the CEE. Differentiate both sides of (2.55) with respect to βtr, i.e., the Jacobian of function ∫m(t, z, ·), yielding for all t, s ∈ R+ ∫and β ∈ Uβ0 , tr − βtr0 ≡ ∇ zβm(t, z, β)eβ z e sp(z)dz − m(t, z, β)cs(βtrz) z p(z) dz, (2.57) tr ∇ βtrz−eβ trz where c (β z)z = (e s). Multiply both sides of (2.57) by βtrs β and use equation (2.57) by choosing c tr trs(β z)β z ∈ B and then setting s = H(t). Then for all t ∈ R+, h ∈ H and β ∈ U∫β0 tr 0 ≡ βtr ∇ tr β zβm(t, z, β)eβ z−e H(t)p(z)dz. (2.58) Equation (2.58) hold∫s for a specific choice of β = β0, which contradicts the non-trtr β z sigularity of matrix ∇ β 0βm(t, z, β0)e 0 z−e sp(z)dz as described in (2.3).  2.3.2 Proof of Theorem 2.2.1 Note that (2.20) is true for any hazard rate function λC , therefore it is satisfied when C puts its mass at a point c, i.e., for a constant c ∈ R+, ΛC(s) = Λc(s), where 0 when s < c,Λc(s) =  (2.59)∞ when s > c. 30 Then (2.20)∫im + ∫plies that for any c ∈ R , λ ∈ H and β,c tr 0 ≡ m1 tr tr β z(t, z, β)λ(teβ z)eβ z−Λ(te )p(z)dtdz 0 ∫ − βtr0 Λ(ce z ∫∫ + m (c, z, β)e )p(z)dz tr (2.60) ceβ z = m1(se−β trz, z, β)λ(s∫)e −Λ(s)p(z)dsdz 0 − trΛ(ceβ z+ e )m0(c, z, β)p(z)dz, Since (2.60) is satisfied by any λ ∈ H, let Λ(s) = Λt(s), where Λt(s) is defined as in (2.59).∫Then (2.60) implies{ } ≡ βtrz 1 −βtr0 I t∫< ce m (te z { } , z, β)p(z)dz (2.61) βtr+ I t > ce z m0(c, z, β)p(z)dz, for all c ∈ R+, t ∈ R+, β Next, we prove statement (a). Then (b) can be shown with the same strategy. Proof of (a) In this part, we assume m0(c, z, β) ≡ 0 for all c ∈ R+, z ∈ Z and β ∈ Uβ0 . Hence (2.61) becomes ∫ { } 0 ≡ tr trI t < ceβ z m1(te−β z, z, β)p(z)dz, for all t ∈ R+, β (2.62) Since (2.62) holds for all t, c ∈ R+, therefore consider a class of functions A as follows, { } A ≡ a(t) : a(t) is essentially bounded , t ∈ R+ then for any a(t) ∈ A, β, and c ∈ R+, integrating a(t) multiplied by (2.62) against Lebesgue measure on (0,∞) gives for all c ∈ R+, a ∈ A, β ∈ Uβ0 and z ∈ Z, ∫∫ βtrce z 0 ≡ a(t)m1(te−βtrz, z, β)p(z)dtdz. (2.63) 0 31 Differentiate (2.63) with respect to c, yielding for all a ∈ A, β ∈ Uβ0 , z ∈ Z and a.e. in c, ∫ ≡ tr tr0 eβ za(ceβ z)m1(c, z, β)p(z)dz. (2.64) βtrz βtr { z·k βtr }By specific series of choices of a(ce ) = e I ce z ≥ τ for fixed τ, c ∈ R+ and k ∈ Z+, we know th∫at (2.64) implies for all k ∈ Z + and τ ∈ R+, { } tr tr 0 = eβ z·kI ceβ z ≥ τ m1(c, z, β)p(z)dz (2.65) Let τ → 0, for all a ∈ A, β ∈ U∫β0 , k ∈ Z +, z ∈ Z and c ∈ R+ βtr0 = e z·km1(c, z, β)p(z)dz (2.66) By Weierstrass’ theorem described in Lemma 2.3.3, for b ∈ B, β ∈ Uβ0 , k ∈ Z+, z ∈ Z and c ∈ R+, ∫ 0 = b(βtrz)m1(c, z, β)p(z)dz, (2.67) where B is the family of differentiable functions with bounded support. To show the non existence of CEE via deriving the singularity of { } Eβ0,λ0 ∇βm1(T, Z, β0) , (2.68) let us differentiate both sides of (2.67) with respect to β, then for b ∈ B, β ∈ Uβ0 , k ∈ Z+, z ∈ Z and c ∈ R+∫ , ∫ βtr∇βm1(c, z, β)b(βtrz)p(z)dz = − βtrzb′(βtrz)m1(c, z, β)p(z)dz = 0 (2.69) tr tr βtrz Choose b(βtrz) = eβ zh(eβ z)e−H(e c) and integrate against dc. Then (2.69) im- plies { } βtr0 Eβ0,λ0 ∇ m1β (T, Z, β0) ≡ 0. (2.70) 32 This contradiction of non-singularity completes the proof of (a). Proof of (b) In this part, we assume that m1(t, z, β) ≡ 0 for all t ∈ R+, z ∈ Z and β ∈ Uβ0 . Hence (2.61) becomes∫ { } ≡ βtr0 I t > ce z m0(c, z, β)p(z)dz, for all c ∈ R+, z ∈ Z, β ∈ Uβ0 (2.71) In fact, we can show the contradiction of singularity described in (2.70) following the same steps in proving (a) on page 31.  33 Chapter 3: ϕ-Transformation Model 3.1 Introduction In this section, we discuss the ϕ-Transformation model, which serves as the most general worked-out example for the Extended Estimating Equations (EEE). Let Ti be the lifetime for the ith individual for i = 1, . . . , n, and consider the model Ti = ϕ(ε, β0, Zi), i = 1, . . . , n (3.1) where Zi = (Z tr i1, . . . Zip) is a covariate, εi is the error that is independently iden- tically distributed (i.i.d.) with common differentiable distribution F (x), and haz- ard rate function λ(x), β0 ∈ Rp is the unknown coefficient to be estimated, and ϕ(x, y, z) : R×Rp×Rp → R+ is a given function, where ϕ(·, y, z) is strictly increas- ing and continuously differentiable, and ϕ(x, ·, z) is differentiable. We also assume that the support of T and the support of ε do not depend on parameter (β, F ). In practice, sometimes we are not able to observe the complete lifetime Ti. Let us use Ci to denote the right censoring variable, assumed to be identically distributed with a common distribution function denoted by H. Assume that conditionally given Zi, Ti and Ci are independent. When Ti are subject to right censoring, the 34 data collected are random vectors (Vi,∆i, Zi), i = 1, . . . , n (3.2) where Vi = min(Ti, Ci), and ∆i = I {Ti ≤ Ci} is equal to 1 when Ti does not exceed the right censoring variable Ci, and 0 otherwise. The ϕ-transformation model relates to many other important models in statistics, and we mention the following examples. First, when ϕ(x, y, z) depends on the structural parameter y only through ytrz and is linear in ytrz and x, i.e., T tri = β0 Zi + εi, (3.3) then the ϕ-transformation model becomes the usual linear model. Nothing needs to be changed when the lifetime Ti is replaced by h(Ti), where h(·) is a known monotone function. When h(Ti) = lnTi, (3.3) becomes the Accelerated Failure Time (AFT) model. Therefore the AFT model is an example of the ϕ-transformation model. A review of the linear model and the AFT can be found in page 8 of Chapter 1. Another important class of models that is related to the ϕ-transformation model is the nonlinear regression model with additive independent errors ( [39], Chapter 5), namely, Ti = µ(Zi, β0) + εi, (3.4) where µ(·, β0) is given, Ti is continuous, and εi is independent of Zi. This model assumes that there is a basic underlying distribution for the lifetime, but the location shifts according to covariate Z. In (3.4), ϕ(ε, Z, β) = µ(Z, β) + ε, and one way to 35 generalize it is to allow a shape change in Ti, i.e., h(Ti) = b0(β tr 1 Zi) + εb tr 1(β2 Zi), where hR+ 7→ R is a known monotone function, b0 and b1 are both known func- tions, and b1 is strictly positive. This way, both the location and the shape change according to covariates. It is not necessary that the transformation function ϕ(x, y, z) has an analytical form. A possible choice is to let ϕ(x, y, z) be an integral of a known positive rate function, i.e., let b(s, β, z) : R× Rd × Rd → R+ be a given function, then ∫ ε h(Ti) = ϕ(ε, Zi, β0) = b(s, β0, Zi)ds. (3.5) −∞ The time Ti is known as “operational time”, and model (3.5) may have different in- terpretation in practice depending on the choice of b(s, β0, z). For example, b(s, β0, z) can describe the wearing-out rate of a device. This rate can depend on a structural parameter β0, and a covariate Zi that differs from device to device. This type of model is also discussed by Nikulin in [3]. This chapter is organized as follows. In Section 3.2, we establish an EEE for β0 using martingale theory. In Section 3.3, we prove the consistency and asymptotic normality of the estimator of β0. Technical lemmas not given in detail in Section 3.3 are postponed to Section 3.4. 36 3.2 Extended Estimating Equations and Martingales To construct the EEE that yields an estimator of β0 for the ϕ-transformation model (3.1), we follow Tsiatis [38] by building a martingale on the residual scale. Let NTi (v) be the counting process for the ith individual for lifetime Ti, and let Y Ti (v) be the at-risk indicator for lifetime Ti, i.e., NTi (v) = ∆i · I {V ≤ v} ; Y Ti i (v) = I {Vi ≥ v} (3.6) Since ϕ(·, y, z) is strictly increasing, we can define its inverse function, ϕ−1(t, y, z) : R+ × R→ R such that ϕ(u, y, z) = t ⇔ ϕ−1(t, y, z) = u. (3.7) Let ζi be the residual when T −1 i is censored, i.e., ζi = ϕ (Ci, β0, Zi). Then we can write the counting process and at-risk indicator for residual εi as follows, N εi (u) = I {min(εi, ζi) ≤ u, εi ≤ ζi} (3.8) Since ϕ−1(u, y, z) is also strictly increasing, I {min(εi, ζi) ≤ u} = I {min(ϕ(εi, β0, Zi), ϕ(ζi, β0, Zi)) ≤ ϕ(u, β0, Zi)} (3.9) = I {min(Ti, Ci) ≤ ϕ(u, β0, Zi)} , and I {εi ≤ ζi} = I {ϕ(εi, β0, Zi) ≤ ϕ(ζi, β0, Zi)} = I {Ti ≤ Ci} = ∆i, (3.10) From (3.6)-(3.10), we know that N εi (u) = I {min(Ti, Ci) ≤ ϕ(u, β0, Zi),∆i = 1} = NTi (ϕ(u, β0, Zi)) (3.11) 37 As for the at-risk indicator Y εi (u), Y εi (u) = I {min(εi, ζi) ≥ u} = I {min(ϕ(εi, β0, Zi), ϕ(ζ , β (3.12)i 0, Zi)) ≥ ϕ(u, β0, Zi)} = I {min(Ti, Ci) ≥ ϕ(u, β0, Zi)} = Y Ti (ϕ(u, β0, Zi)) . Since the counting process and at-risk indicator for εi and Ti have the relationship described in (3.11) and (3.12), from now on we will mainly use NTi (·) and Y Ti (·) to construct estimators, and define Ni(·) = NTi (·), Yi(·) = Y Ti (·) for the sake of simplicity. Assume that Vi’s are nondegenerate, then there exists a constant T ∗ such that for some ξ > 0, P {V ≥ T ∗i + ξ} ≥ ψ > 0, for all i. (3.13) Let us introduce additional notations to be used throughout this chapter. For a function g(x1, x2, x3), let ∇ig(x1, x2, x3) = ∂g(x1, x2, x3)/∂xi. Define J(u, β1, β2, z) ≡ ϕ−1(ϕ(u, β1, z), β2, z) (3.14) Note that by the chain rule, as shown in Lemma 3.4.1, γ(u, β0, z) = ∇ ∇2ϕ(u, β0, z) 2J(u, β, β0, z)|β=β0 = . (3.15)∇1ϕ(u, β0, z) In addition, assume the following regularity conditions: (A.1) The density function of ε, f(x) = dF (x)/dx, exists and is bounded by K1 on (−∞, T ∗ + ξ], where T ∗, ξ are as in (3.13). The hazard rate λ(x) is twice differentiable. 38 (A.2) The density function for Ci exists, and h(x) = −dH(x)/dx ≤ K2, for all x ≤ T ∗ + ξ. (A.3) There exists θ(u, β0, Z) with E|θ(u, β0, Z)| <∞ such that |λ(J(u, β,β0, Zi))− λ(u)− γ(u, β , Z )tr0 i (β − β ′0)λ (u)| (3.16) ≤ ‖β − β ‖20 θ(u, β0, Zi), for β ∈ Uβ0(n−1/2), and u ∈ R, almost surely in Zi, where U (n −1/2 β0 ) is a n −1/2 neighbourhood of β0. (A.4) There exists a constant c such that P {‖Zi‖ < c} = 1. (A.5) Let µβ0(u, β) = E {Z1I {V1 > ϕ(u, β, Z1)}}/E {I {V1 > ϕ(u, β, Z1)}}. Then P sup ‖Z̄(u, β)− µβ0(u, β)‖ −→ 0, as n→∞, β∈U −1/2 ∗β (n ),u≤T +ξ0 where ∑∑nj=1 ZjYj (ϕ(u, β, Zj))Z̄(u, β) = n . (3.17) j=1 Yj (ϕ(u, β, Zj)){ } (A.6) Let Aβ0(u, β) ≡ E I(V1 > ϕ(u, β, Z1))(Z1 − Z̄(u, β))γ(u, β , Z )tr0 1 , then ∣∣∣∣ ∑n1 { } ∣∣sup Yi (ϕ(u, β, Zi)) Zi − Z̄(u, β) γ(u, β0, Zi)tr−Aβ0(u, β)∣∣ β∈U (n−1/2),u≤T ∗+ξ nβ0 i=1 approaches zero in probability, and for all β ∈ Uβ (n−1/2) ∫ 0T ∗ λ′(u)Aβ0(u, β)du is non singular. −∞ (A.7) For β ∈ U (n−1/2β0 ), E {Λ (J(T ∗, β0, β, Z))} = M <∞. (A.8) ϕ(·, y, z), ϕ(x, ·, z) and ϕ−1(·, y, z) are all Lipschitz continuous; ϕ(·, y, z) is continuously differentiable, and ∇2ϕ(x, ·, z) is Lipschitz continuous. 39 (A.9) ϕ(·, y, z) is strictly increasing. ∥∥ (A.10) E ∥∫ ∥T ∗ ′ ∥−∞ γ(u, β0, Zi)λ (u)du∥ <∞. 3.2.1 Remarks and Sufficient Conditions for Assumptions First, let us discuss assumption (A.3). Note that the left hand side of (3.16) is the Taylor expansion of λ(J(u, ·, β0, Zi)) at β = β0, the remainder of which is (β − β tr 20) ∂ λ(J(u, β, β0, Zi))/∂β 2|β=β∗(β − β0), for β∗. Also recall that J(u, β1, β2, z) = ϕ−1(ϕ(u, β1, z), β2). Therefore one way to guarantee (3.16) in assumption (A.3) is to assume that (i) ∂2ϕ(u, β, z)/∂β2 and ∂2ϕ−1(u, β, z)/∂β2 exists (ii) λ′(u) is Lipschitz continuous continuous. As for the boundedness of E|θ(u, β0, Zi)|, it is guaranteed if the derivatives in (i) are continuous in z. In assumptions (A.5) and (A.6), the pointwise convergence for fixed β and u can be obtained using the Law of Large Numbers (LLN) . We can show that this convergence uniform in β ∈ Uβ0(n−1/2) and u ≤ T ∗ + ξ using empirical process theory. For example, we can show that { } F ≡ f(v, z;u, β) : I v ≥ ϕ(u, β, z)(z − Z̄(u, β)) is a Glivenko-Cantelli class, which can be established under the assumption that function ϕ(x, y, z) is Lipschitz continuous in x and y using the Example 19.11 and Example 19.20 in [41]. 40 The boundedness of the expected value mentioned in (A.7) and (A.10) is guar- anteed by the smoothness of Λ(·) and J(u, β0, ·, z), and the bounded support of Z. Under assumption (A.8), J(u, β1, ·, z) is Lipschitz. Since J(u, β0, β0, z) = 1, we know that |J(u, β0, β, z)| ≤ 1 + c‖β− β0‖, hence (A.7) is satisfied. (A.10) is guaranteed if we assume that γ(u, β0, z) is bounded for any u and z. Define ∑n ∫ T ∗ { } Sn(β) = dNi (ϕ(u, β, Zi)) Zi − Z̄(u, β) , (3.18) i=1 −∞ where Z̄(u, β) is defined in (3.17). From now on, let us use P denote the probability measure under the true nuisance and structural parameter. Then we construct the martingale in the following proposition. Proposition 3.2.1 Mi {ϕ(u, β, Zi)} is a martingale with respect to the filtration Fn(u, β) = σ (Zi, I {Vi ≤ ϕ(s, β, Zi)} ,∆iI {Vi ≤ ϕ(s, β, Zi)} , i = 1, . . . , n, s ≤ u) , (3.19) under P , where Mi (ϕ(u, β, Zi)) =Ni ∫(ϕ(u, β, Zi)) (3.20)J(u,β,β0,Zi) − λ(x)Y −1i (ϕ(x, β0, Zi)) (ρ(x, β0, β, Zi)) dx, −∞ and ∇1ϕ(u, β, Zi) ρ(u, β, β0, Zi) = ∇1J(u, β, β0, Zi) = (3.21)∇1 (ϕ(J(u, β, β0, Zi), β0, Zi)) Proposition 3.2.1 is a direct result of the compensated martingale associated with the counting process Ni(ϕ(u, β, Zi)). The details of calculation can be found in 41 Section 3.4. As a special case of Proposition 3.2.1, when β = β0, dMi (ϕ(u, β0, Zi)) = dNi (ϕ(u, β0, Zi))− λ(u)Yi (ϕ(u, β0, Zi)) du (3.22) is a martingale differentia∑l w∫ith respect to measure{P and filtrati}on Fn(u, β0) definedn T ∗ in (3.19). Since Sn(β) = −∞ dMi (ϕ(u, β, Zi)) Zi − Z̄(u, β) , it follows that i=1 Eβ0 {Sn(β0)} = 0. (3.23) The martingale Central Limit Theorem (CLT) implies that n−1/2Sn(β0) is asymp- totically normal with mean zero and variance σ2(β0), where∫ T ∗ σ2(β0) = (Z1 − Z̄(u, β0))⊗2λ(u)P {V1 ≥ ϕ(u, β0, Z1)} du. (3.24) 0 Now, (3.18) is in the form of extended estimating equations defined in the previous chapter. We can re-{write (∑3.18) as∑ ∑ }n n Z I {ϕ−1(V , β, Z ) ≥ ϕ−1− j=1 j j j (Vi, β, Zi)}Sn(β) = ∆i Zi n −1 − . (3.25)1 i=1 j=1 I {ϕ (Vj, β, Zj) ≥ ϕ (Vi, β, Zi)} With the choice of Xi = (Vi, Zi,∆i), { Q(X , β) = ∆ , C(X ) = Z , k(X ,X , β) = I ϕ−1(V , β, Z ) ≥ ϕ− } 1 i i i i i j j j (Vi, β, Zi) , ∑n equation (3.18) is exactly in the form of mn(Xi,X , β) mentioned in (1.10), where ∑ ∑ {i=1 }n n ∑n j∑=1C(Xj)k(Xi, Xj, β)mn(Xi,X , β) = Q(Xi, β) C(Xi)− n j=1 k(Xi, Xi=1 i=1 j, β) with 0 mean under the true parameter value shown in (3.23). Corollary 3.2.1 Under the ϕ-transformation model assumption, Sn(β) = 0 is an extended estimating equation, where Sn(β) is defined in (3.25). 42 3.3 Estimator for Structural Parameter In the previous section, we constructed an extended estimating equation Sn(β) √ in equation (3.18). The goal of this section is to establish the n-consistency of estimator β̂n, the estimator for β0 estimated via extended estimating equation Sn(β). √ In Theorem 3.3.3, we prove that n(β̂n − β0) is asymptotically normal with mean zero. Note that Sn(β) is a step function, which brings the following two problems. First, there may not exist a root for equation Sn(β) = 0. Second, the usual Taylor expan- sion method does not apply to Sn(β). Consequently, we adopt the definition of β̂n given by Jurec̆ková in Section 4 of [24]. Namely, for Sn(β) = (Sn,1(β), . . . , Sn,p(β)) tr, ∑p β̂ 2n = arg min {Sn,j(β)} , (3.26) β∈Uβ0 j=1 where Uβ0 is a neighbourhood of β0 where there is a unique solution to the minimiza- tion problem (3.26). In the case when the minimization problem (3.26) has more than one solution, estimator β̂n is defined as the one with a smaller lexicographic norm, i.e., if both vectors b̂i = (b̂i,1, . . . , b̂i,p) satisfy (3.26) for i = 1, 2, and b̂1,j = b̂2,j for j = 1, . . . , k, k < p, and if b̂1,k+1 < b̂2,k+1, then β̂n = b̂1. Next, we show that Sn(β) is asymptotically linear in a neighbourhood of β0. Let ∫ T ∗ g(β) = λ′(u)Aβ0(u, β)du, (3.27) −∞ 43 where Aβ0(u, β) is the function defined in assumption (A.6) in the previous section on page 39. Define a linear function of β as follows, S̃n(β) = Sn(β0) + ng(β0)(β − β0). (3.28) Let β∗n be the root of S̃n(β) = 0. Consider β ∈ Uβ0(n−1/2), a n−1/2-neighbourhood of β0. If we can show that Sn(β) is ‘asymptotically equivalent” to S̃n(β), i.e., the l∞ norm of Sn(β) − S̃n(β) converges to zero in probability, then β̂n is also “asymptotically equivalent” to β∗n, namely, they are both asymptotically normal with mean zero and the identical asymptotic variance. From (3.28), we know that if g(β0) is nonsingular, √ n(β∗ − β ) = n−1 {g(β )}−1 S (β ), for β ∈ U (n−1/2n 0 0 n 0 β0 ) is asymptotically normal with mean zero and variance {g(β )}−10 σ(β0) {g(β −10)} . √ P √ If we can show that n(β̂ − β∗n n) −→ 0, then this would imply n(β̂n − β0) is √ asymptotically distributed the same as n(β∗n − β0). Argued by Jurec̆ková in [23] and [24], it would suffice to show that sup n−1/2|Sn(β)− S̃n(β)| − P→ 0. (3.29) β∈U (n−1/2β )0 We will show (3.29) in two steps. In Theorem 3.3.1, we will show the pointwise convergence of (3.29), then we get the uniformity in β in Theorem 3.3.2. Now, let us start with the first step, i.e., the pointwise convergence by writing Sn(β) as the summation of ∑n ∫ T ∗ { } Sn1(β) = dMi {ϕ(u, β, Zi)} Zi − Z̄(u, β) (3.30) i=1 −∞ 44 and ∑n ∫ T ∗ { } Sn2(β) = λ(J(u, β,β0, Zi)ρ(u, β, β0, Zi)− λ(u) (3.31) i=1 −∞ { } × Yi {ϕ(u, β, Zi)} Zi − Z̄(u, β) du. P We show that n−1/2|Sn1(β) − Sn(β0)| −→ 0 for fixed value β ∈ U (n−1/2β0 ) using Lenglart’s Inequality and the predictable variation process in Lemma 3.4.2 and P Lemma 3.4.3, and that n−1/2|Sn2− ng(β0)(β − β0)| −→ 0 in Lemma 3.4.4 and 3.4.5 using the Lipshitz assumption mentioned in (A.8). A careful statement and proof of Lemma 3.4.2-3.4.5 can be found on page 50-57 of Section 3.4. Now, we are ready to discuss the pointwise asymptotic linearity of Sn(β) for any fixed β ∈ U (n−1/2β0 ) as follows, Theorem 3.3.1 U(nder a)ssumptions (A.1), (A.3), and (A.5)-(A.8) for any fixed βn that belongs to Uβ0 n −1/2 , √1 | PSn(βn)− S̃n(βn)| −→ 0. (3.32) n Proof: Note that we can write Sn(β) = Sn1(β) + Sn2(β), and that S̃n(β) = Sn(β0) + ng(β0)(β − β0). Therefore the proof is complete since √1 | PSn1(βn)− Sn(β0)| −→ 0, (3.33) n √1 |Sn2(βn)− 1 P ng(β0)(βn − β0)| = √ · n · ‖βn − β0‖op(1) −→ 0, (3.34) n n where we get (3.33) and (3.34) by Lemma 3.4.3 and 3.4.5, respectively. In Theorem 3.3.1, we have shown the pointwise linearity of Sn(β). The next theorem guarantees that such linearity is uniform for β that belongs to a small 45 neighbourhood of β0. Theorem 3.3.2 Under the assumptions (A.1)-(A.8), for any ε > 0, there exists δ > 0 such that { } lim P sup n−1/2‖Sn(β∗)− Sn(βn)‖ ≥ ε = 0, (3.35) n→∞ 0≤‖β∗−βn‖≤δn−1/2 for any |d| < C. The proof of the theorem uses the same technique as Tsiatis in [38]. The idea is to show that for a choice of sufficiently fine partitions of interval [−C,C], function Sn(β) does not fluctuate too much within the sub-intervals. Details of the proof are √ presented in Section 3.4. Now, we are ready to conclude the n-consistency of β̂n mentioned at the very beginning of this section Theorem 3.3.3 Let β̂n the solution to ∑p β̂n = arg min {S (β)}2n,j , (3.36) β∈Uβ0 j=1 where Uβ0 is a neighbourhood of β0 such that β̂n is unique. Then under (A.1)-(A.8), √ D n(β̂ − β ) −→ N(0, {g(β )}−1n 0 0 σ2(β0) {g(β0)}−1). Proof: Let β∗n be the solution to S̃n(β) = 0, where S̃n(β) = Sn(β0) + ng(β0)(β − β0). (3.37) If σ2(β0) denotes the asymptotic variance of n −1/2Sn(β0) mentioned under (3.23), then √ n(β∗n − β0) = n−1/2 {g(β0)} −1 Sn(β0) (3.38) 46 is asymptotically normal with mean zero and variance {g(β0)}−1 σ2(β0) {g(β )}−10 . On the other hand, by Theorem 3.3.2, P sup n−1/2|Sn(β)− S̃n(β)| −→ 0, (3.39) β∈Uβ (n−1/2)0 √ P which as proved by Jurec̆ková in [23] implies n(β̂ ∗n − βn) −→ 0. As a result, √ √ n(β̂n − β0) follows the same asymptotic distribution as n(β∗n − β0), and we complete the proof. 3.4 Some Proofs Lemma 3.4.1 Let J(u, β , β , z) = ϕ−11 2 (ϕ(u, β1, z), β2, z), then (J1) ∇1J(u, β1, β2, z) = ρ(u, β1, β2, z), and ∇2J(u, β1, β2, z) = γ(u, β1, β2, z)), where ∇1ϕ(u, β1, z) ρ(u, β1, β2, z) = ,∇1ϕ(J(u, β1, β2, z), β2, z) (3.40) ∇2ϕ(u, β1, z) γ(u, β1, β2, z) = .∇1ϕ(J(u, β1, β2, z), β2, z) (J2) Let x = J(u, β1, β2, z), then u = J(u, β2, β1, z), and 1 ρ(x, β1, β2, z) = (3.41) ρ(u, β2, β1, z) (J3) J(u, ·, β2, z) is Lipschitz. (J4) There exists a constant c such that |ρ(u, β1, β2, z)− 1| ≤ c‖β1 − β2‖. (J1) By chain rule, ∇1J(u, β1, β2, Zi) = ∇1ϕ−1(ϕ(u, β1, z), β2, z) · ∇1ϕ(u, β1, z) (3.42) ∇ −12J(u, β1, β2, Zi) = ∇1ϕ (ϕ(u, β1, z), β2, z) · ∇2ϕ(u, β1, z) 47 Let w = ϕ(u, β1, z). Since ϕ(ϕ −1(w, β2, z), β2, z) = w, differentiate both sides with respect to w, yielding ∇1ϕ(ϕ−1(w, β , z), β , z) · ∇ ϕ−12 2 1 (w, β2, z) = 1. Therefore ∇ ϕ−11 (ϕ(u, β1, z), β2, z) = ∇ ϕ−11 (w, β2, z) 1 = ∇ ϕ(ϕ− (3.43)11 (w, β2, z), β2, z) 1 = . ∇1ϕ(J(u, β1, β2, z), β2, z) Then (3.40) is obtained by plugging (3.43) into (3.42). (J2) If x = J(u, β , β , z) = ϕ−11 2 (ϕ(u, β1, z), β2, z), then u = ϕ −1(ϕ(x, β2, z), β1, z) = J(x, β2, β1, z). As for (3.41), it is true since ∇1ϕ(u, β1, z) ρ(u, β1, β2, z) = ∇1ϕ (J(u, β1, β2, z), β2, z) ∇1ϕ(J(x, β2, β1, z), β1, z) 1 = = . ∇1ϕ(x, β2, z) ρ(x, β2, β1, z) (J3) Let b1 and b2 be two distinct points in Uβ0 . Under assumption (A.8), ϕ −1(·, β2, z) and ϕ(u, ·, z) are both Lipschitz. Therefore there exist constants c1 and c2 such that |J(u, b −1 −11, β2, z)− J(u, b2, β2, z)| = |ϕ (ϕ(u, b1, z), β2, z)− ϕ (ϕ(u, b2, z), β2, z) | ≤ c1|ϕ(u, b1, z)− ϕ(u, b2, z)| ≤ c1c2‖b1 − b2‖ (J4) By the definition of ρ(u,∣∣β1, β2, z) in (3.21) ∣ | ∇1ϕ(u, β1, Zi) ∣ ρ(u, β1, β2, Zi)− 1| = ∣∣ − 1∣∇1ϕ (J(u, β1, β2, Zi), β2, Z ) ∣i (3.44) |∇1ϕ(u, β1, Zi)−∇1ϕ (J(u, β1, β2, Zi), β2, Zi) | = |∇1ϕ (J(u, β1, β2, Zi), β2, Zi) | Since ϕ(·, x, y) is assumed to be strictly increasing, the denominator of (3.44) is strictly greater than a positive constant m. Use D denote the numerator of 48 (3.44), then (3.44) is bounded byD/m. Since both∇1ϕ(·, y, z) and∇2ϕ(x, ·, z) are Lipschitz continuous under assumption (A.8), D ≤ |∇1ϕ(u, β1, Zi)−∇1ϕ (J(u, β1, β2, Zi), β1, Zi) | + |∇1ϕ (J(u, β1, β2, Zi), β1, Zi)−∇1ϕ (J(u, β1, β2, Zi), β2, Zi) | ≤ c1‖u− J(u, β1, β2, Zi)‖+ c2‖β1 − β2‖ (3.45) Note that we can write u = J(u, β2, β2, Zi), therefore by (J3), the first term in (3.45) is bounded by c1c3‖βn−β0‖. Then the conclusion holds with the choice of c = c1c2 + c2.  3.4.1 Proof of Proposition 3.2.1 Proof: Recall that we would like to prove dMTi (ϕ(u, β, Zi)) is a martingale with filtration Fn(u, β) in (3.19), where dMTi (ϕ(u, β, Zi)) =dN T i (ϕ(u, β, Zi)) (3.46) − λ (J(u, β, β0, Z Ti))Yi (ϕ(u, β, Zi)) ρ(u, β, β0, Zi)du, and NTi (u) and Y T i (u) are defined in (3.6), λ(u) is the hazard rate function for ε, and ρ(u, β, β0, Zi) is defined in Lemma 3.4.1. Change the variable in (3.46) by setting u = J(v, β0, β, Zi) = ϕ −1 (ϕ(v, β0, Zi), β, Zi) (3.47) then ϕ(u, β, Zi) = ϕ(v, β0, Zi), and NTi (ϕ(u, β, Zi)) = N T i (ϕ(v, β ε 0, Zi)) = Ni (v), (3.48) 49 where we get the second equation in (3.48) since N εi (v) = N T i (ϕ(v, β0, Zi)) as shown in (3.11). By the compensated counting process martingale established in [21], M εi (v) is a martingale, where M εi (v) = N ε i (v)− λ(v)Y εi (v)dv (3.49) By (3.47), v = J(u, β, β −10, Zi) = ϕ (ϕ(u, β, Zi), β0, Zi), which implies that dv = ∇1J(u, β, β0, Zi)du = ρ(u, β, β0, Zi)du (3.50) where the last equation in (3.50) guaranteed by (J1) of Lemma 3.4.1. Then the conclusion is a result of (3.48), (3.49) and (3.50). 3.4.2 Proof of Lemma 3.4.2-3.4.5 In Lemma 3.4.2 and 3.4.3, we will show that for any fixed β in a n−1/2 neighbour- hood of β0, the L 2 distance between (3.30) and Sn(β0) converges to 0 in probability whereas Lemma 3.4.4 and 3.4.5 will show the same conclusion for (3.31) and the term ng(β0)(β − β0). Lemma 3.4.2 Let βn be a sequence of nonrandom vectors converging to β0, then under the assumptio{ns (A∫.5) and (A.7),n ∗ √1 ∑ T { ) dMi (ϕ(u, βn, Zi)} Zi − Z̄(u, βn) n i=1 ∫∞ (3.51)T ∗ } − dMi (ϕ(u, βn, Zi)) {Zi − µβ0(u, βn)} ∞ converges to 0 in probability. 50 Proof: Let us first consider the scalar case when β ∈ R. The expression in (3.51) is equal to R(T ∗), where by Proposition 3.2.1, ∑n ∫1 u { } R(u) = √ dMi (ϕ(x, βn, Zi)) Z̄(x, βn)− µβ0(x, βn) .n i=1 ∞ is a martingale under P with respect to Fn(u, βn) = σ {Zi, I [Vi ≤ ϕ(u, βn, Zi)] ,∆iI [Vi ≤ ϕ(u, βn, Zi)] , i = 1, . . . n} Therefore using the vector form o{f Lengla∫rt’s Inequality (see Appendix I, I.2 of [2]),n δ 1 ∑ T ∗ P {|R(T ∗)| > ε} ≤ + P (Z̄(u, βn)− µβ0(u, βn))2ε2 n i=1 −∞ × λ (J(u, βn, β0, Zi)) ρ(u,}βn, β0, Z (3.52)i) × Yi (ϕ(u, βn, Zi)) du > δ By assumption (A.5), we can find N(ε,K) such that for any n > N(ε,K), { } P sup |Z̄(u, βn)− µβ0(u, βn)| > K < ε, u≤T ∗ hence with probability exceeding 1− ε, the integral in (3.52) is bounded by ∫ T ∗ K2 λ (J(u, βn, , β0, Zi)) ρ(u, βn, β0, Zi)Yi (ϕ(u, βn, Zi)) du (3.53) −∞ Let x = J(u, βn, β0, Zi), then by property (J1) and (J2) in Lemma 3.4.1, u = J(x, β0, βn, Zi), du = ρ(x, β0, βn, Zi)dx, and 1 ρ(u, βn, β0, Zi) = ρ(x, β0, βn, Zi) Therefore (3.53) is bounded by ∫ J(T ∗,β0,βn,Zi) K2 λ(x)dx = K2Λ (J(T ∗, β0, βn, Zi)) , −∞ 51 Consequently, the average of the integral in (3.52) over n is bounded by ∑n K2n−1 Λ (J(T ∗, β0, βn, Zi)) i=1 with probability greater than 1−ε. By assumption (A.7), if we choose K ≤ (δ/M)1/2 and δ = ε3 , then the probability in (1.48) is smaller than  for n > N(ε,K). Now, consider the vector-valued parameter case, i.e., β ∈ Rp. Let t = (t1, . . . , tp)tr be a unit vector, let ∑n R(T ∗) = ηi, where ηi = (ηi1, . . . , ηip) tr (3.54) i=1 then ∣∣ ∣ ∣∣∑ntr ∗ ∣ ∣∣∣ tr ∣∣∣ ∣ ∣∣∣∣∑∑ ∣∣∣ ∣∣∣∑ (∑ )∣n p p n ∣∣ ∑ ∣p∣ ∣ ∣∣∑ ∣ n ∣∣ t R(T ) = t ηi = tjηij∣∣ = ∣∣ tj ηij ∣∣ ≤ |tj| ∣∣ ηij∣∣ i=1 i=1 j=1 j=1 i=1 j=1 i=1 (3.55) Since ‖t‖ = 1 implies tk ≤ 1 for k = 1, . . . , p, t∣herefor∣e∑p ∣∑n ∣ |ttr ∣ ∣R(T ∗)| ≤ ∣∣ ηij∣∣ (3.56) j=1 i=1 There{fore } {∑∣∣ ∣} {∣ ∣ }p n n P sup |ttrR(T ∗)| ≥ ε ≤ ∣P ∣ t∈Rp,‖t‖=1 ∣∑ ∣ ∣ ηij∣∣ ∣∑ ∣≤ ∣ ∣ εpmaxP ∣∣ ηij∣∣ ≥j p j=1 i=1 i=1 (3.57) The probability{i∣∣n∑(3.57∣∣) conv}erges to zero by applying the univariate Lenglart’sn inequality to P ∣∣ η ∣ij∣ ≥ ε/p . Therefore the conclusion is true for β ∈ Rp. i=1 52 Lemma 3.4.3 Let βn be a sequence of nonrandom vectors converging to β0, then under ass{umpti∫ons (A.4), (A.5), (A.7) and (A.8),∑n T ∗1 { } }√ PdMi (ϕ(u, βn, Zi)) Zi − Z̄(u, βn) − Sn(β0) −→ 0. (3.58) n i=1 −∞ Proof: (3.58) can[b∑e w∫ritten as the summation of the following three terms,n1 T ∗ { }√ dMi (ϕ(u, βn, Zi)) Zi − Z̄(u, βn) n i=1∑−n ∫∞ (3.59)T ∗ ] − dMi (ϕ(u, βn, Zi)} (Zi − µβ0(u, βn)} , [∑i=∫1 −∞n ∗ √1 T dMi (ϕ(u, βn, Zi)) {Zi − µβ0(u, βn)}n i=1∑−∞ (3.60)n ∫ T ∗ ] − dMi (ϕ(u, β0, Zi)) {Zi − µβ0(u, β0)} , i=1 −∞ and [∑ ]n ∫1 T ∗√ dMi (ϕ(u, β0, Zi)) {Zi − µβ0(u, β0)} − Sn(β0) . (3.61)n i=1 −∞ By Lemma 3.4.2, (3.59) and (3.61) converges to zero in probability. We focus on the asymptotic behavior of (3.60). Let u = J(x, β0, βn, Zi), then the first integral in equatio∫n (3.60) isJ(T ∗,βn,β0,Zi) dMi (ϕ(x, β0, Zi)) {Zi − µβ0 (J(x, β0, βn, Zi), βn)} . (3.62) −∞ Write (3.60) [is equ∫al to the summation of A, B and C, where∑ ]n T ∗ A = −√1 dMi (ϕ(x, β0, Zi)) {µβ0 (J(x, β0, βn, Zi), βn)− µβn 0 (x, β0)} , i=1 −∞ [ (3.63) 1 ∑n ∫ J(T ∗,βn,β0,Zi) B = √ dMi (ϕ(x, β0, Zi)) n i=1 T ∗ ] (3.64) × I {T ∗ > J(T ∗, β0, βn, Zi)} {Zi − µβ0 (J(x, β0, βn, Zi), βn)} 53 and [∑n ∫ ∗1 T C = √ dMi (ϕ(x, β0, Zi)) n ∗ i=1 J(T ,βn,β0,Zi) ] (3.65) × I {T ∗ < J(T ∗, β0, βn, Zi)} {Zi − µβ0 (J(x, β0, βn, Zi), βn)} . Since A, B and C integrated up to u are all Fn(u, β0) martingales, condition on Zi, n ∫ ∗ 1 ∑ T V ar(A) = {µ ⊗2β0 (J(x, β0, βn, Zi), βn)− µβ0(u, β0)}n i=1 −∞ ×λ(x)P {Vi ≥ ϕ(x, β0, Zi)} dx ∑n ∫ ∗1 J(T ,βn,β0,Zi) V ar(B) = {Zi − µβ0 (J(x, β , β , Z ), β ⊗2 0 n i n)} n ∗ i∈κ T1 ×λ(x)P {Vi ≥ ϕ(x, β0, Zi)} dx and ∑n ∫ T ∗1 V ar(C) = {Z − µ ⊗2i β0 (J(x, β0, βn, Zi), βn)}n ∗ i∈κ J(T ,βn,β0,Z2 i) ×λ(x)P {Vi ≥ ϕ(x, β0, Zi)} dx where we calculate the variances using the predictable variation process 〈dMi, dMi〉 = λ(u)Yi (ϕ(u, β0, Zi)) . By definition of µβ0(x, β) in assumption (A.5) and the boundedness of Zi assumed in assumption (A.4), we know that µβ0(x, β) is bounded by 1 for all x and β, which implies 4 ∑n ∫ T ∗‖V ar(A)‖ ≤ λ(x)S(x)dx ≤ 4. n i=1 −∞ 54 Therefore by the continuity of µβ0(u, β), ϕ(x, y) and the dominated convergence theorem, ‖V ar(A)‖ converges to 0. therefore (3.63) converges to 0 in probability. As for (3.64), note th∫at 1 ∑ J(T ∗,βn,β0,Zi) V ar(B) = {Zi − µβ0 (J(x, β0, βn, Zi), β )} ⊗2 n n ∈ T ∗i κ1 × λ(x)P {Vi ≥ ϕ(x, β0, Zi)} dx Since Vi = min(Ti, Ci) { } P {Vi ≥ ϕ(x, β0, Zi)} ≤ P {Ti ≥ ϕ(x, β0, Zi)} = P ϕ−1(Ti, β0, Zi) ≥ x = P {εi ≥ x} = S(x), and that λ(x)S(x) = f(x), we know that ∑n ∫1 J(T ∗,βn,β0,Zi) V ar(B) ≤ λ(x)S(x)dx. (3.66) n κ T ∗ 1 Since λ(x)S(x) = f(x) is bounded by 1, (3.66) implies that ∑n V ar(B) ≤ 4K n−11 {J(T ∗, β ∗n, β0, Zi)− T } (3.67) i=1 Note that T ∗ = J(T ∗, β0, β0, Zi). Since J(u, ·, β2, z) is Lipschitz as shown in (J3) of Lemma 3.4.1, there exists a constant c such that the terms in the summand of the right hand side of (3.67) is bounded by c‖βn − β0‖, which implies that V ar(B) goes to 0 in probability. Using the same approach, we conclude that V ar(C) also approaches 0 in probability. Hence the proof is complete. Lemma 3.4.4 Let Sn2 be∫defi{ned as in (3.31), and Sn3 }be∑n T ∗ Sn3 = λ(J(u, βn, β0, Zi))− λ(u) (3.68) i=1 −∞ { } × Yi (ϕ(u, βn, Zi)) Zi − Z̄(u, βn) du, 55 then under assumption (A.1) and (A.8), n−1 · | PSn2(β)− Sn3(β)| −→ 0. (3.69) Proof: By (3.31) and (3.68), ∑n ∫ T ∗ n−1|Sn2(β)− Sn3(β)| ≤ n−1 λ(J(u, βn, β0, Zi)) i=1 −∞ × |ρ(u∑, βn∫, β0, Zi)− 1|Yi (ϕ(u, βn, Zi)) ‖Zi − Z̄(u, βn)‖dun T ∗ ≤ 2n−1 λ(J(u, βn, β0, Zi)) i=1 −∞ × |ρ(u, βn, β0, Zi)− 1|Yi (ϕ(u, βn, Zi)) du (3.70) By (J4), |ρ(u, βn, β0, Zi)− 1| ≤ c‖βn − β0‖,∑ther∫efore by (3.70), we can see thatn T ∗ n−1|Sn2(β)− Sn3(β)| ≤ 2c‖β −1n − β0‖n λ (J(u, βn, β0, Zi)) i=1 −∞ (3.71) × Yi (ϕ(u, βn, Zi)) du Since E {Yi (ϕ(u, βn, Zi))} ≤ P {Ti ≥ ϕ(u, βn, Zi)} = P {ε ≥ J(u, βn, β0, Zi)} (3.72) = S (J(u, βn, β0, Zi)) , we know that ∑n ∫ T ∗ n−1|Sn2(β)− Sn3(β)| ≤ 2c‖βn − β ‖n−10 λ (J(u, βn, β0, Zi)) i=1 −∞ ∑∫× S (J(u, βn, β0, Zi)) dun T ∗ = 2c‖βn − β −10‖n f (J(u, βn, β0, Zi)) du ∑i=1 −∞n = 2c‖βn − β0‖n−1 F (J(T ∗, βn, β0, Zi)) (3.73) i=1 56 By assumption (A.1), F (J(T ∗, βn, β0, Zi)) is bounded by {c1c2‖βn − β0‖+ 1}, there- fore by (3.73), n−1|Sn2(β)− Sn3(β)| approaches 0 in probability. ∫ T ∗ Lemma 3.4.5 Let g(β ) = λ′0 −∞ (u)Aβ0(u, β)du, and Sn2(β) as shown in (3.31), then under assumptions (A.1), (A.3), (A.6) and (A.8), n−1Sn2(β) = g(β0)(β − β0) + oP (‖βn − β0‖) (3.74) Proof: By Lemma 3.4.4, it suffices to show n−1Sn3 = g(β0)(β − β0) + oP (‖β − β0‖). (3.75) Note that the left hand side of (3.75) is the summation of ∑n ∫ T ∗ − ′ { }n 1 λ (u)Yi (ϕ(u, β0, Zi)) Zi − Z̄(u, βn) γ(u, β tr0, Zi) du(βn − β0) (3.76) i=1 −∞ and ∑n ∫ T ∗ − { }n 1 Yi (ϕ(u, β0, Zi)) Zi − Z̄(u, βn) (3.77) i=1 −∞ × {λ (J(u, βn, β0, Zi))− λ(u)− λ′(u)γ(u, β0, Z )tri (βn − β0)}du where γ(u, β0, Zi) is defined in (3.15). By assumption (A.6), (3.76) converges to∫ T ∗ λ′(u)Aβ0(u, β0)du(βn − β0) = g(β0)(βn − β0). (3.78) −∞ As for (3.77), it is bounded by ∑n ∫ T ∗ 2n−1‖βn − β0‖2 |θ(u, βn, Zi)|du (3.79) i=1 −∞ Since the average over n in (3.79) is bounded in probability by assumption (A.3), we finish the proof by (3.78) and (3.79). 57 3.4.3 Proof of Theorem 3.3.2 First, let us consider a uni-variate β ∈ R. We form a mesh with space approaching zero from −C to C using a finite number points d0, . . . , dm. By Theorem 3.3.1, for √ βn,i = β0 + di/ n, { } max n−1/2‖ PSn(βn,i)− S̃n(βn,i)‖ −→ 0. (3.80) i≤m In order to prove (3.35), we must show that n−1/2Sn(β) does not fluctuate too much from βn,i to βn,i+1 for i = 1, . . . ,m for any choice of partition points d0, . . . , dm. More specifically, for any ε > 0, there exists a positive δ such that for βn = β0 + δn−1/2, { } lim P sup n−1/2‖S (β∗n )− Sn(βn)‖ ≥ ε = 0, (3.81) n→∞ βn≤β∗≤β +δn−1/2n for any |d| < C. As for the case when β ∈ Rp, we consider a p-dimensional mesh by allowing β to change its coordinate one at a time. More specifically, for a fixed β∗, define β̌ = (β∗(1), . . . , β∗(j), β(j+1), . . . , β(p))trn,j n n , then ∑p−1 { } Sn(β ∗)− Sn(βn) = Sn(β̌n,j+1)− Sn(β̌n,j) . (3.82) j=0 Therefore ∥ ∥ ‖Sn(β∗)− Sn(βn)‖ ≤ p ·max∥Sn(β̌n,j+1)− Sn(β̌ ∥n,j) (3.83) j 58 In the right hand side of (3.83), the change only occurs in the (j + 1)th coordinate, and the other coordinates are fixed. Consequently, the uniform convergence for β ∈ Rp will follow if we can show (3.81) for β ∈ R. For the rest of this subsection, without loss of generality, assume that β ∈ R. Recall that Vi = min(Ti, Ci). Define the residuals ri as ri ≡ r(Vi, β, β0, Zi) = ϕ−1(Vi, β, Zi) = J (εi, β0, β, Zi) , We can complete the proof of (3.81) by putting a probabilistic bound on the maxi- mum change of S (β∗) as β∗ varies from β to β + δn−1/2n n n . Recall that ∑ { }n ∑n j=1 ZjI {Vj ≥ ϕ(εi, Zj, βi)} Sn(β) = ∆i Zi − ∑n (3.84) i=1 j=1 I {Vj ≥ ϕ(εi, Zj, βi)} so S (β∗n ) is a function of the ranks of residuals ri, hence change in Sn(β ∗) occurs whenever the change of β∗ from β −1/2n to βn+δn leads to a change of ri, i = 1, . . . , n. Therefore the maximum change of S ∗n(β ) can be calculated by computing F1 × F2, where L1 = #[pairs of interchanged ranks ], L =   (3.85) the maximum change of S (β∗n )  2 for each such interchange In Lemma 3.4.6, we investigate the two factors of (3.85). Then, using Lemma 3.4.6, we prove the uniform linearity as stated in Theorem 3.3.2. Lemma 3.4.6 Let T ∗ be {a value such that P {Vi ≥ T ∗ + ξ}}≥ ψ > 0. Define n B ≡ 1 ∑ X : I {V ≥ T ∗ ψn ∞ i + ξ} ≥ , (3.86) n 2 i=1 59 where X∞ = {Xi}ni=1. Under assumption (A.4) and (A.8), for any ε > 0, there exists Nε such that for any n > Nε, (i) P {Bn} > 1− ε; (ii) for n > Nε and X∞ ∈ Bn, P {nL2 ≤ 6c/ψ} ≥ 1− ε. ∑n ∑n (iii) L1 = I(Aij), where Aij is the event |Vi − Vj| ≤ c · δn−1/2 i=1 j=6 i Proof: The conclusion (i) in is guaranteed by the Law of Large Numbers. Next, let us consider (ii) which shows how to bound L2 in (3.85). Note that whenever the change of β∗ from β to β −1/2n n+δn causes an interchange in ranks of the residual, the inter{chan}ge must happen between two adjacent order statistics of ri, i = 1, . . . , n.n Let r(i) be the set of order statistics of {ri}ni=1 i=1, and denote the corresponding covariate and failure indicator by Z(i)(β ∗) and ∆ ∗(i)(β ), then S (β ∗ n ) can be written as ∑n { } ∑n Z (β∗∗ )∆i(β ) Z(i)(β∗)− (k)Z̄(i) , where Z̄ ∗(i)(β ) = . (3.87) n− i+ 1 i=1 k=i Now, assume that the change of β∗ from β −1/2n to βn + δn causes an interchange in ranks between two adjacent order statistics r(j) and r(j+1), then the new S (β ∗+ n ) is ∑j−1 ∆ (β∗ { } { } i ) Z(i)(β ∗)− Z̄ ∗(i)(β ) + ∆(j+1)(β∗) Z ∗ ∗{ (j+1) (β )− Z̄ }(j) (β ) i=1 ∗ ∗ Z̄(j+2)(β ∗)(n− j − 1) + Z ∗ − (j) (β ) + ∆(j)(β ) Z(j)(β ) (3.88) ∑ n− jn + ∆ (β∗ { } i ) Z ∗ (i)(β )− Z̄ (β∗(i) ) . i=j+2 60 Hence the difference of S (β∗n ) before and after the interchange in r(j) and r(j+1) is (3.87) minus (3.88), which equals to { ∗ }{ }Z̄ (β∗)(n− j − 1)∆(j+1)(β )− ∗ (j+2)∆(j)(β ) − Z̄(j)(β∗) n− j (3.89) ∆ ∗(j+1)(β )Z ∗ (j+1)(β ) ∆(j)(β ∗)Z(j)(β ∗) + − , n− j n− j where we use the fact∑thatn ∗ ∗ ∗ ∗ k=j+1 Z(k)(β ) Z̄(j+2)(β )(n− j − 1) + Zj+1(β )Z̄(j+1)(β ) = = . n− j n− j Then (3.89) is equal toZ − Z (j+1) (j) if ∆(j) = ∆(j+1) = 1 n− j  Z(j) Z(j+1) Z̄(j+2)(n− j − 1)− + + if ∆(j) = 1 and ∆ (j+1) = 0  (n− j)(n− j + 1) n− j + 1 (n− j)(n− j + 1)Z̄(j+1) − Z(j) if ∆(j) = 0 and ∆(j+1) = 1 n− j + 1 (3.90) For any of the three cases in (3.90), since ‖Zi‖’s are bounded by a constant c with probability 1 under assumption (A.4), the change in S (β∗n ) is bounded by c/(n−j), where n− j is the number of ri’s at risk at the point where the interchange occurs. Consider X∞ ∈ Bn on which ∑n1 I {V ≥ T ∗i + ξ} ≥ ψ . n 2 i=1 Since Sn(β) is computed for ri’s that are less than ϕ −1(T ∗, βn, Zi), forX∞ ∈ Bn, the number of the ri’s at risk will exceed nψ/2 if an interchange occurs, i.e., n−j ≥ nψ/2. Consequently, the change in Sn(β) is bounded by (6c/ψ)n −1. 61 Then we try to find L1 in (3.85), i.e., the number of interchanges when β ∗ varies from βn to βn + δn −1/2. An interchange between (i, j) will occur for βij if εi = εj. Therefore by assumption (A.4) and (A.8), an interchange occurring for values of β∗ between β and β + δn−1/2n n ∣ implies∣ ( ) ∣∣ |V ∣i − Vj| = ∣ϕ (εi, βn, Zi)− δ δϕ εi, βn + √ , Z ∣j ∣ ≤ C2 · √ (3.91)n n The total number of interchanges equates ∑n ∑n L1 = I(Aij), i=1 j=6 i where Aij denotes the event in (3.91). Proof of Theorem 3.3.2 In Lemma 3.4.6, we have shown that the maximum change of Sn(β) after each interchange in ranks is bounded by (4/ψ)n −1, and the the number of interchanges as β∗ varies from β to β +δn−1/2n n is L1. If we can show that { } lim P n−3/2L1 ≥ ε = 0 (3.92) n→∞ for some δ > 0 that{is to be chosen properly, then the proof is co}mplete, i.e., lim P sup n−1/2‖S ∗n(β )− Sn(βn)‖ ≥ ε = 0. n→∞ 0≤‖β∗−βn‖≤δn−1/2 For 1 ≤ i < j ≤ n, Let Wij =∑I(Aij) + I(Aji),n Ui = {E(Wij|Vi)− E(Wij)} , j=6 i Uij = Wij − E(Wij|Vi)− E(Wij|Vj) + E(Wij). 62 Then ∑n ∑n ∑n L−E {L1} = Ui + Uij i=1 i 0; for positive γ that is close to 1, E k̄(Xi, β −γ 0) <∞ 67 (A.5) For each x and x, ∇βmn(x,x, β) exists for β in Uβ0 , and is continuous at β0. (A.6) E {∇βmn(x,x, β0)} is of full rank p. (A.7) E[mn(X,X, β )m tr 0 n (X,X, β0)] is positive definite. (A.8) There exists a constant M such that the jth component ηj of η = η(x,X, β0) in (4.10) satisfies |ηj| ≤M , and { } E e|ηj |/M − − ηj1 M2 ≤ 1var(ηj). M 2 (A.9) There exists M such that for any β, ‖k(x, y, β)‖ < M . (A.10) Let α = (x, β) ∈ Rd × Uβ0 with Euclidean norm, and k(x, y, β) = kα(y). There exists a measurable function b(y) : Rd 7→ R such that for any α1 6= α2, |kα1(y)− kα2(y)| ≤ b(y)‖α1 − α2‖, and E {|b(X1)|} <∞. { } (A.11) Let γ > 0 be close to 1, then E k̄(X1, β0) −γ <∞. 4.2 Lemmas Parallel to Owen, and Qin and Lawless In this section, we provide some lemmas that are parallel to Lemmas 11.2 through 11.4 in Chapter 11 of [32], which are used by Owen to establish the Nonarametric Maximum Empirical Likelihood (NPMELE) estimator and the Wilks type theorem when the dimension of the estimating function r equals the dimension of the struc- tural parameter p. Qin and Lawless also applied these lemmas when generalizing 68 Owen’s work to the case when r > p. We will show that under regularity conditions, even though estimating functions of Extended Estimating Equations (EEE) denoted by mn(Xi,X , β) are no longer i.i.d., the parallel versions of these Lemmas continue to hold. The way to overcome loss of independence in EEE is to refer to the tools in empirical process theory. Compa{re the follo∑wing two expressions, }n j=1C(Xj)k(Xi, Xj, β) mn(Xi,X , β) = Q(Xi, β) C(Xi)− ∑n (4.12) j=1 k(Xi, Xj, β) versus { } E {C(Y )k(Xi, Y, β)|Xi} V (Xi, β) = Q(Xi, β) C(Xi)− . (4.13) E {k(Xi, Y, β)|Xi} Applying the LLN to the numerator and denominator in equation (4.12), then ap- plying Slutsky’s lemma, we can see that (4.12) and (4.13) are close when n is large. In order to take advantage of the similarity of (4.12) and (4.13) uniformly over in- dices i, we need to show the higher order summations in the former converge to the corresponding terms in the latter in probability, and uniformly in Xi and β ∈ Uβ0 . More strictly, Lemma 4.2.1 Suppose k(x, y, β) : Rd×Rd×Uβ0 7→ R+ and C(x) : Rd 7→ Rp satis- fies assumptions (A.1), (A.2), (A.9) and (A.10), where Uβ0 is a open and bounded set in Θ that contains the true parameter value β0. Then sup ∣∣∣∣ ∑n ∣1 ∣ Pk(x,X ∣j, β)− E {k(x,X, β)} ∣ −→ 0 (4.14) (x,β)∈X×U nβ0 j=1 69 and ∣ sup ∣∣∣ n ∣1 ∑ ∣C(Xj)k(x,Xj, β)− E {C(X)k(x,X, β)} ∣∣ −P→ 0 (4.15) (x,β)∈X×U nβ0 j=1 The proof of the lemma involves constructing Glivenko-Cantelli classes, and an application of parametric class discussed by Van der Vaart in Chapter 19 of [41]. The following Proposition is parallel to Lemma 11.2 in Chapter 11 of [32] Proposition 4.2.1 Let Z∗n(X, β) = max1≤i≤n ‖mn(Xi,X , β)‖. Under assumptions (A.3), (A.4) and (A.6), for any fixed β ∈ Θ, Z∗n(X, β) = o(n 1/2). (4.16) To prove Proposition 4.2.1, we decompose mn(Xi,X , β) into an i.i.d part and a non-i.i.d. quotient part. Then the conclusion can be drawn by applying Lemma 11.2 in [32] to the two parts separately. The following two lemmas play roles that are equivalent to the Law of Iterated Logarithm (LIL) in [32]. They will be applied to prove the EEE version of Lemma 11.4 in [32]. Lemma 4.2.2 Let 0 < δ < 1/2, then under assumptions (A.2), (A.4), (A.5), (A.8), (A.9) and (A.10) ∑n n−1/2 mn(Xi,X , β0) = OP (n δ) (4.17) i=1 Lemma 4.2.3 Let 0 < δ < 1/2, then under assumptions (A.2), (A.4), (A.5), (A.8), (A.9) and (A.10) 1 ∑n ‖m (X ,X, β )‖2 = O(nδn i 0 ). (4.18) n i=1 70 The proofs of Lemmas 4.2.2 and 4.2.3 involve two steps. First, mn(Xi,X , β0) is split into the summation of an i.i.d. part and a higher order summation quotient part. The magnitude of the i.i.d. term is given by the LIL, and the quotient bounded by Berstein’s Inequality listed as a proposition in Section 4.3. We also prove both lemmas first for univariate β, then generalize the conclusion to the multivariate case. With the conclusion in Lemma 4.2.3, we present Proposition 4.2.2 which is parallel to Lemma 11.3 in Chapter 11 of [32]. Proposition 4.2.2 Under assumptions of Lemma 4.2.3, ∑n n−1 ‖mn(Xi,X , β0)‖3 = o(n1/2). (4.19) i=1 The proof of Proposition 4.2.2 is a direct application of Proposition 4.2.1 and Lemma 4.2.3. 4.3 Some Proofs Proof of Lemma 4.2.1 The convergence of (4.15) and (4.14) for fixed (x, β) is guaranteed by the LLN. Therefore the main concern is to prove the uniformity in parameter (x, β). Let α = (x, β), kα(y) = k(x, y, β) and K = {kα(y), α ∈ X × Uβ0}, where X and Uβ0 are bounded subsets of Rd and Rp as defined in Definition 1.2.1. Under assumption (A.10), K forms a parametric class mentioned in Example 19.7 on page 271 of [41], which refers to a class of functions that are Lipschitz in a finite-dimensional parameter on a bounded region, but the function domain may be unbounded. Therefore K is a Donsker class, and as in Theorem A.0.1, is also a 71 Glivenko-Cantelli class. Hence the convergence in (4.15) is uniform in (x, β). As for (4.14), note that KC = {C(y)kα(y), α ∈ X × Uβ0}, for any α1, α2 ∈ X×Uβ0 , ‖C(y)kα1(y)− C(y)kα2(y)‖ ≤ ‖C(y)‖ · ‖kα1(y)− kα2(y)‖ (4.20) Under assumption (A.2), the right hand side of (4.20) is bounded by ‖C(y)‖ · b(y)‖α1 − α2‖. Under assumption (A.2) and (A.10), E| b(Y ) · ‖C(Y )‖ | = E|b(Y )| · E‖C(Y )‖ <∞ (4.21) ThereforeKC also forms a parametric class mentioned in Theorem A.0.2 of Appendix A. So with the same reasoning we made for (4.15), the convergence (4.14) is also uniform in (x, β) ∈ X × Uβ0 .  Proof of Proposition 4.2.1 Let(Z ∗ n(X, β)∑= max1≤i≤n ‖mn(Xi,X ,)β)‖. Sincen j∑=1 C(Xj)k(X,Xj, β)mn(X,X, β) = Q(X, β) C(X)− n , j=1 k(X,Xj, β) For a fixed β, Z∗n(X, β) is bounded by the sum of A and B, where A = max ‖Q(Xi, β)C(Xi)‖, i and ∥ ∑∥∥∥ nj∑ ∥=1C(Xj)k(Xi, Xj, β)∥B = max Q(X , β) ∥i n .i j=1 k(Xi, Xj, β) ∥ By Lemma 11.2 in [32], under assumption (A.6), A = o(n1/2). As for B, it is bounded by ∥∥∑n ∥ | j=1 C(Xj)k(Xi, Xj, β)∥ max Q(Xi, β)| ·max∥∥ ∑ ∥n .i i j=1 k(Xi, X , β) ∥ j 72 Since th∥e∑k(·, ·, β)’s are assumed t∥o be non(ne∑gative in assumption (A.4),∥ )nj=1C(Xj)k(X ni, Xj, β)∥ j=1 k(Xi, Xj, β) ·maxj ‖C(Xj)‖ max∥∥ ∑ ∥n ∥ ≤ max ∑ni i j=1 k(Xi, Xj, β) j=1 k(Xi, Xj, β) = max(‖C(Xj)‖). j Then under the boundedness assumption in (A.2), B ≤ max |Q(Xi, β)|max ‖C(Xi)‖ = o(n1/2). i i Therefore for any fixed β ∈ Uβ0 , Z∗n(X, β) = o(n1/2).  Before giving the proof of Lemma 4.2.3, we state the Bernstein Inequality as follows. This is a well-known theorem and can be found in references like [42]. Proposition 4.3.1 (Bernstein’s Inequality) Let X1, . . . , Xn be independent vari- ables with zero mean such that E|X |m ≤ m!Mm−2i vi/2, for every m ≥ 2 and all i and some constant M and vi. Then { } 1 x2 P (|X1 + · · ·+Xn| > x) ≤ 2 exp − · , 2 v +Mx for v ≥ v1 + · · ·+ vn. Proof of Lemma 4.2.2 Let us start with the univariate case that β0 ∈ Θ ⊂ R √ and C(x) : Rd 7→ R. We can split nm̄n(X, β0) as the difference of the following two terms, ∑n { } √1 kc(Xi, β0)Q(Xi, β0) C(Xi)− (4.22) n k̄(Xi, βi=1 0) and ∑ {n ∑ }n √1 j∑=1C(Xj)k(Xi, Xj, β0) kc(Xi, β0)Q(Xi, β0) n n − . (4.23) j=1 k(Xi, Xj, β0) k̄(Xi=1 i, β0) 73 Note that (4.22) is a summation of i.i.d. terms, therefore under assumption (A.5), by √ the LIL, it is O( ln lnn), and is asymptotically normal by CLT. Next, we evaluate the order of magnitude of (4.23). To begin with, let us rewrite (4.23) as ∑n √1 Q∑(Xi, β0) n k̄(X{i, β ni=1 0) j=∑1 k(Xi, Xj, β0)n ∑n } × k̄(Xi, β) C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) k(Xi, Xj, β0) (4.24) j=1 j=1 Then we can split (4.24) into the difference of the following two terms, n n ( ) √1 ∑ ∑Q(Xi, β ) 1 ∑0 n √1 n √ C(Xj)k(Xi, Xj, β0)− k̄c(Xi, β0) (4.24.a) j=1 k(Xi, Xj, β0) ni=1 n j=1 and ∑n ∑ n ( )√1 Q(Xi, β0)kc(Xi, β0) √1 ∑ k(Xi, Xj, β0)− k̄(Xi, β0) n k̄(X , β )√1 ni 0 k(X ,X , β ) ni=1 n j=1 i j 0 j=1 (4.24.b) Let us consider the following term in the numerator of (4.24.a), n √1 ∑{ } C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) (4.24.a.1) n ∑j=1n1 { } 1 { } =√ C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) + √ C(Xi)k(Xi, Xi, β0)− kc(Xi, β0) n n j:j 6=i Note that by Proposition 4.2.1, for any i = 1, . . . , n, the norm of the second term in equation (4.24.a.1) is bounded by √1 1max ‖C(X )k(X ,X , β )− k (X , β )‖ = √ · o(n1/2i i i 0 c i 0 ) = o(1). n i n Therefore for i = 1, . . . , n, we can rewrite (4.24.a.1) as 1 ∑{ }√ C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) + o(1). (4.24.a.2) n j:j 6=i 74 Then by substituting (4.24.a.2) for (4.24.a.1), we know that (4.24.a) is equal to ∑n √1 ∑Q(Xi, β0) n √1 { nj=1∑k(X(i, Xi=1 n j, β0) ) } (4.24.a.3) × √1 C(Xj)k(Xi, Xj, β0)− k̄c(Xi, β0) + o(1) n j:j 6=i By (4.8), { } E kc(Xi, β0) = E {C(Xj)k(Xi, Xj, β0)} . Therefore the terms inside the summation of (4.24.a.2) have expectation zero. More- over, conditioned on Xi, (4.24.a.2) is the summation of i.i.d. terms with zero mean, under assumption (A.8), we can apply Bernstein’s inequality in Proposition 4.3.1 with the choice of constant x equals to k = cn(δ+1)/2n . Since δ is between 0 and 1/2, k2 n1+δ nδn = ∼ as n→∞ nσ +Mkn σn+Mn(1+δ)/2 σ{ } where σ = supx∈X var C(X1)k(x,X1, β0)− k̄c(x, β0) , and ∼ means that the ratio of the two ex{pr∣essions converges to 1.Hence by Pro∣position 4}.3.1, for i = 1, . . . , n, ∣∣∣∣∑ ∣∣ ∣ ∣ δ P C(Xj)k(Xi, Xj, β0)− kc(Xi, β ∣ −cn0)∣∣ ≥ kn∣Xi ≤ 2e . j:j=6 i By putting{togeth∣er all such sets for different i, we get ∣∣∣∣∑ − ∣∣ ∣ } δ P max C(Xj)k(Xi, Xj, β0) kc(Xi, β0)∣∣ ≥ k −cnn ≤ 2n e .i j:j 6=i Therefore ∑ { ∣∣∑ ∣∣ }∞ P max √1 ∣∣∣ ∣ knC(Xj)k(Xi, Xj, β0)− kc(Xi, β0)∣ ≥ √i n n=1 j:j 6=i ∑ ∣ n (4.25)∞ ≤ 2 ne−cnδ <∞. n=1 75 By Borel-Cantelli lemma, the inequality above implies that for all n sufficiently large, ∣∣∣∣∣ 1 ∑ ∣ max √ C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) i n ∣∣∣∣ ≤ √kn = nδ, a.s. (4.26)n j 6=i Now, let us discuss the denominator in (4.24.a), namely, 1 ∑n√ k(Xi, Xj, β0). (4.27) n i=1 In the following proof, we would like to show that under the assumptions that we have made up to now and assumption (A.11), (4.27) is O(nδ). Split (4.27) in to 1 ∑ 1 ∑√ k(Xi, Xj, β0) + √ k(Xi, Xj, β0), (4.28) { n nj:i∈J } j:i∈/J where J ≡ k̄(X , β ) < c′n(δ−1)/2i 0 . With the same argument used to deduct (4.25), by the Berstein ine∣quality, for δ∗ = δ/3,∣∣∣∣ 1 ∑ ∣ ∣∣ ∗ max √ k(Xi, Xj, β)− k̄(Xi, β0)∣∣ ≤ cnδ /2, , a.s. (4.29)i n j:j 6=i which implies that we can re-write (4.28) as 1 ∑ { } {√ ∗ }k(Xi, Xj, β )+I k̄(X , β ) ≥ c′n(δ−1)/20 i 0 · k̄(Xi, β ) +O(n(δ −1)/20 ) (4.30) n J Next, we discuss the order of magnitude of the two terms in (4.30). Since k(·, ·, ·) is bounded assumed to be bounded by M in assumption (A.9), and the first term in (4.30) is symmetric with respect to i, we know that the first term in (4.30) 2 √ { }√ME { number of i ∈ J } = 2M nP k̄(X , β ) < c′n(δ−1)/2i 0 n √ { } = 2M n{P k̄(X , β ) −γ > c′−γnγ(1−δ)/2 (4.31)i 0} δ−1 1 ≤ 2ME k̄(X −γ γ· · δi, β0) n 2 2 ≤ n 76 where γ is the positive constant that is close to 1 defined in (A.11) and the inequality is attained by the Chebyshev’s inequality. So in (4.30), the first term is O (nδP ). Combining (4.26) - (4.31) we know that (4.24.a.3) is bounded in probability by 2 ∑n { Q(Xi, β0)} nδ. (4.32) n O (nδ) + I k̄(X , β ) ≥ c′n(δ−1)/2 k̄(X , β ) +O(n(δ∗−1)/2) i=1 P i 0 i 0 With the ch√oice of δ ∗ = δ/3, by the LIL for i.i.d. summands, the order of (4.32) is given by O( ln lnn/n). By far we have shown that (4.24.a.3) is O(nδ). Since (4.24.a.3) equals to (4.24.a), the latter is also O(nδ). Using the same strategy, we can also prove that (4.24.b) is √ O(nδ). Therefore nm̄n(X, β0) = O(n δ). Next, we generalize our conclusion to vector valued β0 and C(x). It suffices to show that (4.26) holds for β0 ∈ Rp and C(x) : Rd 7→ Rp. Define a p-dimensional vector η trj = (ηj1, . . . , ηjp) , where ηj = C(Xj)k(Xi, Xj, β0)− kc(Xi, β0), j = 1, . . . , n and i 6= j. Then by (4.26), fo{r each component ηjk of ηj,∑ } P lim sup √1 ηjk ≥ √ kn = 0, k = 1, . . . , p. (4.33) n→∞ n n j:j=6 i Let t∣ = (t1, . . ∣. , tp)∣tr be a unit vector in Rp,∣∣∣∑n ∣∣∣ ∣∣∣∑n ∑p ∣∣∣ ∣∣∣∑ ( p ∑ )n ∣∣∣ ∑p ∣∣∑n ∣∣ttrηi = t ∣ ∣kηjk∣ = ∣ tk ηjk ∣ ≤ |tk|∣ ηjk∣ (4.34) j:j 6=i j:j 6=i k=1 k=1 j:j 6=i k=1 i=1 where we attain the inequality using triangle inequality. Since ‖t‖ = 1, |tk| ≤ 1 for k = 1, . . . p, then together wi∣th (4.34),∣∣∣∑n ttrηi∣∣∣∣ ∑p ∣∣∣∣∑n ∣∣∣∣≤ ηjk (4.35) j=1 k=1 j:j=6 i 77 Note that {(4.35) hold∣s for any ∣unit ve}ctor in{R p, w∣hich ind∣icates}that P sup ∣∣∣∑n ∣ttrη ∣∣ ∑p ∣j ≥ kn ≤ P ∣∣∑n ∣η ∣jk∣ ≥ kn t∈Rp,‖t‖=1 j:j 6=i k=1 {j:j∣6=∣∣∣∑ i n ∣ ≤ pmaxP η ∣∣∣ } kn jk ≥ (4.36) k p j:j 6=i Then combine (4.33){and (4.36), we kn∣ow that } P lim sup sup →∞ ∣∣∣ ∑ n ∣ √1 ∣ tr knt η ∣j ≥ √ = 0 (4.37) n t∈Rp,‖t‖=1 n ∣ n j:j=6 i  Proof of Lemma 4.2.3 Note that we can split the left hand side of (4.18) as the summation of the following three expre(ssions, 1 ∑ √ )n ln lnn Q(X , β )2C(X )tri 0 i C∑(Xi) = O , (4.18.a)n∑ ni=1n n2 C(X )k(X ,X , β )− Q(X , β )2 j=1 j i j 0i 0 C[ (Xi) tr ∑n ] [ = O(n δ), (4.18.b) n k(X ,X , β ) ∑i=1 ∑ j=1 i j 0 tr n ∑ ]n 1 j=1C(Xj)k(Xi, Xj, β0)/n n j=1C(Xj)k(Xi, Xj, β0)/n Q(X 2i, β0) [∑ ]n 2 ,n i=1 j=1 k(Xi, Xj, β0)/n (4.18.c) where we get the order of (4.18.a) and (4.18.b) applying the LIL and the same strategy used in the proof of (4.26), respectively. Next, we investigate the order of (4.18.c). By the pr[oof of Lemma 4.2.2, an alm]os[t sure upper bound of (4.18.c),∑n ∑n tr4 Q(X , β )2 1 1 ∑ ]ni 0 C(Xj)k(Xi, Xj, β0) C(Xj)k(Xi, Xj, β0) (4.38) n k̄(X , β )2 n n i=1 i 0 j=1 j=1 Similar to the proof of Lemma 4.2.2, let us assume that C(X) and β0 are scalar valued then generalize the conclusion into vector valued case. Applying the same 78 method in the proof of Lemma 4.2.2 where equation (4.24.a.2) was attained, we can show that for i = 1, . . . , n, ∑n n1 1 ∑ C(Xj)k(Xi, Xj, β0) = C(Xj)k(Xi, Xj, β0) + o(1). n n j=1 j:j=6 i 1 5 Next, we show that for k δ+n = O(n 2 4 ). ∑n1 kn C(Xj)k(Xi, Xj, β0) ≤ , a.s., for i = 1, . . . , n. n n j:j 6=i 1 5 Case 1 If C(X)k(x,X, β0) is centered at 0 under β0, Let k = O(n δ+ n 2 4 ) be the constant mention(ed in Berns)tein’s I(nequality. Since)δ is between 0 and 1/2, x2 nδ+ 5 ( ) 2 1 O = O = O n δ+ 5 2 4 1 5 .nσ +Mx nσ +Mn δ+2 4 Hence by{Proposition 4.3.1, for i = 1, . . . , n∑ }n ∣ P C(X )k(X ,X , β ) ≥ k ∣∣∣ 1 5X ≤ e−k2n/(nσ+Mkn) = e−cn 2 δ+4j i j 0 n i . j:j 6=i By putting t{ogether all such sets for i = 1,}. . . , n, we get∑n 1 2 δ+ 5 P C(Xj)k(Xi, Xj, β0) ≥ k ≤ e−kn/(nσ+Mkn) = e−cn 2 4n . j:j 6=i Therefore ∑ {n ∑ }n ∑n1 k 1n δ+5 P C(Xj)k(Xi, Xj, β ) ≥ ≤ e−n 2 40 <∞ n n i=1 j:j=6 i i=1 By Borel-Cantelli le{mma, the inequality above indicates that 1 ∑ }n kn P lim sup C(Xj)k(Xi, Xj, β0) ≥ = 0, (4.39) n→∞ n n j:j 6=i which shows that 1 ∑n ( )k ( )n 1 1 C(Xj)k(Xi, Xj, β0) = O = O n δ+ 2 4 , n n j:j=6 i 79 and that 1 ∑n kn C(Xj)k(Xi, Xj, β0) ≤ , a.s. (4.40) n n j:j 6=i Case 2 C(X)k(x,X, β0) is centered at E {C(X)k(x,X, β0)} = µ(x, β0) 6= 0, then by (4.39) 1 ∑n ( )k ( )n 1 δ+ 1{C(Xj)k(Xi, Xj, β0)− µ(Xi, β0)} = O = O n 2 4 , (4.41) n n j:j 6=i and 1 ∑n { knC(Xj)k(Xi, Xj, β0)− µ(Xi, β0)} ≤ , a.s. (4.42) n n j:j=6 i From (4.41) and (4.42), we know that 1 ∑n 1 δ+ 1 1 1C(Xj)k(Xi, Xj, β0) = µ(X1, β0) +O(n 2 4 ) = O(n δ+2 4 ) (4.43) n j:j 6=i and ∑n1 kn + nµ(X1, β) kn C(Xj)k(Xi, Xj, β0) ≤ = , a.s. (4.44) n n n j 6=i Plug (4.44) into (4.38), yielding ∑ ( an√almost )sure u(pper bou)nd fo(r (4.18.c)) n 2 1 δ+ 5 1 δ+ 54 Q(Xi, β0) kn kn ln lnn n 2 4 2 4· · n= O ·O ·O = O(nδ) n k̄(X , β )2 n n n n n i=1 i 0 ∑ (4.45)n Therefore ‖m 2n(Xi,X , β)‖ /n is also O(nδ). This conclusion can be extended for i=1 β ∈ Rp0 using the same strategy as the proof of Lemma 4.2.2.  Proof of Proposition 4.2.2 Since Z∗n(X, β0) = max1≤i≤n ‖mn(Xi,X , β0)‖, write ∑n1 1 ∑n‖mn(X 3 ∗ 2i,X , β)‖ ≤ Zn(X, β0) · ‖mn(Xi,X , β0)‖ , (4.46)n n i=1 i=1 80 By Proposition 4.2.1, Z∗n(X, β0) = o(n 1/2). By Lemma 4.2.3, the second factor on the right hand side of (4.46) is O(nδ) with δ between 0 and 1/2. Therefore ∑n1 ‖mn(Xi,X , β 30)‖ = o(n1/2). n i=1  81 Chapter 5: Technical Results II: Zero in the Convex Hull Theorems In this chapter, we continue to discuss the technical results for Classic Estimat- ing Equations (CEE) and Extended Estimating Equations (EEE), with the same notations and assumptions in Chapter 4, from pages 65 - 68. We define a neigh- bourhood of β0 such that for all β in this neighbourhood, the Empirical Likelihood (EL) method has a unique maximizer with probability approaching 1. Under the CEE setting, consider random samples X1, . . . , Xn from distribution family Pθ, where θ = (β, λ) ∈ Θ × H, Θ ⊂ Rp and H infinite dimensional. Using the Lagrange multiplier method, it can be shown that for fixed β, if there exists a unique solutionto ∏nmaxp pi where p = (p1, . . . , p n );  i=1 ∑ (5.1)n ∑nsubject to pi = 1, pi ∈ (0, 1), pim(Xi, β) = 0, i=1 i=1 then the solution is given by 1 1 p̂i(β,Xi,X ) = · , n 1 + ttrm∑(Xi, β)n (5.2)m(Xi, β) where t = t(β,X ) solves = 0. 1 + ttrm(X , β) i=1 i 82 A typical interpretation of p̂i(β,Xi,X ) i∑s that the distribution of X1 is approxi-n mated by the modified empirical measure piδX . This is why the conditioni i=1 Eθ0 {m(X1, β0)} = 0 ∑n is rendered through the approximating distribution of X1 as pim(Xi, β0) = 0 i=1 connecting the pi’s and β. Owen in [32], and Qin and Lawless in [33], claimed that for a fixed β, a sufficient condition for problem (5.2) to have a local unique solution is “zero in the convex hull”, i.e., for fixed β{∈ Uβ0 , 0 ∈ Conv(β) ⊂ R r, where ∑ }n ∑n Conv ≡ pim(Xi, β) : pi = 1, pi ∈ (0, 1), β ∈ Uβ0 . (5.3) i=1 i=1 However, they did not state explicitly how the set Uβ0 in (5.3) is constructed, and whether (5.3) is a deterministic fact, or an asymptotic result. In the following section, we answer the two questions in the previous paragraph. Then we generalize these conclusions to discontinuous estimating function m(x, ·). In Section 5.2, we prove the existence and uniqueness of solutions to (5.1) under the EEE setting with estimating function mn(x,x, β) that is continuous with respect to β. For the discontinuous case, we prove the conclusion for the ϕ-transformation model, which to our knowledge, is the broadest class of semiparametric models satisfying the EEE definitions. 83 5.1 Classic Estimating Equation In this section, under the CEE setting, we establish the uniqueness of solutions to maximization problem (5.1) initially for continuous estimating functions, then for discontinuous ones. 5.1.1 Continuous Criterion Function We prove the existence and uniqueness of solutions to (5.1) in the following steps. First, we demonstrate that (5.3) is true when β = β0 and that p̂i(β0, Xi,X ) is the calculus maximizer of (5.1). Then, using the continuity of m(x, ·), we apply Rolle’s theorem on the gradient of the Lagrangian of the negative logarithm of the Profile Empirical Likelihood (pEL) function to prove the uniqueness of solution to the second equation in (5.2). Finally, combining the results in the previous two steps with a continuation method, we prove that there exists a neighbourhood of β0 on which (5.1) has a calculus maximum with probability approaching 1. Let us start with the first step described in the previous paragraph, which is showing 0 ∈ Conv(β0) with probability approaching 1. Lemma 5.1.1 For i.i.d. random variables X1, . . . , Xn, assume that m(x, ·) is con- 84 tinuous, and E {{m(X1, β0)} =} 0; (5.4) E m(X , β )⊗21 0 = Σ is positive definite. (5.5) E‖m(X1, β0)‖3 <∞. (5.6) For any constant K > 0, let X = {X }∞i i=1, t0 be a unit vector in Rp. (a) Let λ0 be the smallest eigenvalue of Σ, and I ∈ Rp×p be the identity matrix, define { } n A(1) √1 ∑ tr K,n ={X : inf t0 m(Xi, β0) ≥ −K} , (5.7)∥t∥ 0 n i=1 A(2) ∥ε,n ={X : ∥ 1 ∑n ∥∥ m(X , β )⊗2 − Σ∥i 0 ∥ ≤ ε , (5.8)n 1 ∑i=1 2 }nA(3)n = {X : ‖m(Xi, β0)‖ 3 ≤ 2 · E‖g(X , β )‖3} 1 0 , (5.9)n∑i=1n A(4) 1= X : m(X , β ⊗2 1n i 0) ≥ λ0I . (5.10)n 2 i=1 Then for any ε ∈ (0, λ0), there exist Kε and Nε, such that { } ≥ A(1)(i) for any n Nε, P K ≥ 1− ε;ε,n (ii) the follow{ing limits a}re all identic{al to 1⋂ ⋂ } { ⋂ } lim P A(2) (3) (4) →∞ ε,n , lim P A , lim P A . n1 n1→∞ n →∞ nn1 n≥n1 n≥n1 n≥n1 (b) Let Conv(β0) be the convex hull of m(Xi, β0). For any ε > 0, there exist Kε and Nε such that P {A∗ε} ≥ 1− ε, where ⋂ { } A∗ A(1)= ∩ A(2) (3) (4)ε Kε,n ε,n ∩ An ∩ An , (5.11) n≥Nε 85 and for n > N and X ∈ A∗ε ε, P {0 ∈ int (Convn(β )) |X ∈ A∗0 } ≥ 1−ε, where in general, int(·) denotes the interior of a set. Now, let usgo back to the EL problem, which is essentially solving ∏ max n p  i=1 pi, where p = (p1, . . . , pn),  ∑ ∑ (5.12)n nsubject to pi > 0, pi = 1, pim(Xi, β) = 0, i=1 i∑=1n for any fixed β. Given β, by the concavity of ln(pi) in p, a unique maxi- i=1 mum exists provided that 0 is in the interior of Convn(β), the convex hull of m(X1, β), . . . ,m(Xn, β). By Lemma 5.1.1, for any ε, there exists Nε such that for any n > Nε and X∞ ∈ Aε, P {0 ∈ int (Conv(β0)) |X ∈ A∗} > 1− ε. Thus for n > Nε, with β = β0, there exists a unique solution p̂0 to the maximization problem in (5.12) with probability greater than 1− ε. To conclude that the unique maximizer over p’s for fixed β = β0 is a calculus maximizer, we need to show that there is a ball of dimension n−1−p for p’s within which to take the derivative, where p is the dimension of m(X1, β0). Let 1 be an 1 × n vector with all entries equal to 1, 0 be a p × 1 vector with all entries equal to 0, M0 = (m(X1, β0), . . . ,m(Xn, β0)) be a p × n matrix, and M = (1tr,M tr)tr0 be a (p + 1) × n matrix. Lemma 5.1.1 guarantees the existence and uniqueness of solutions to maximization problem (5.12). Use p̂0 to denote that solution, then p̂0 86 belongs to { ∑ }n P ≡ p : Mp = (1,0tr)tr, all pi > 0, pi = 1 (5.13) i=1 For any v in the null space of M and α ∈ R, define 1 p = · (p̂ + αv), 1 + αv · 1 0 we know that Mp = (1,0tr)tr, hence for small α such that the entries for p are all positive, we know that p belongs to P . Furthermore, v is in the null space of M , (2) and rank(M) = p + 1, so dim(P) = n − p − 1. Observing that X∞ ∈ Aε,n for ε < λ0, there is a relative open set P of dimension n − p − 1 within which we can take derivative. Next, we give the form of the unique solution to (5.12) when β = β0 and n > Nε, using Lagrange multipliers λ ∈ R{and t ∈ Rp}. Define G(p, λ, t) as∑n ∑n ∑n G = ln p − λ p − 1 − nttri i pim(Xi, β0) i=1 i=1 i=1 To maximize the concave function G, differentiate G with respect to pi, i = 1, . . . , n, then set the derivatives to be zero, ∂G 1 = − λ− nttrm(Xi, β0) = 0, i = 1, . . . , n. (5.14) ∂pi pi Multiply the equations above by pi, then add them together,∑n ∑n n− λ pi − nttr pim(Xi, β0) = 0 (5.15) i=1 i=1 By the constraints in (5.12), (5.15) implies that n − λ = 0, so n = λ. Therefore from equation (5.14), 1 p̂i = (5.16) n {1 + ttr0 m(Xi, β0)} 87 ∑n Since p̂im(Xi, β0) = 0, the vector t must solve the following equation, i=1 n 1 ∑ m(Xi, β0) = 0. (5.17) n 1 + ttrm(X i=1 i , β0) Important Event In Lemma 5.1.1, we proved that for any ε, there exists Nε such that P {A∗ε} ≥ 1− ε, where ⋂ { } A∗ = A(1) ∩ A(2) ∩ A(3)ε K ε,n n ∩ A (4) . ε,n n n≥Nε Furthermore, let Σ(β, β ) = E {m(X , β)⊗20 β0 1 }. Since Σ(β, β0) is continuous, and { } Σ = Eβ0 m(X1, β ) ⊗2 0 = Σ(β0, β0) is positive definite, there exists Uβ0 , a neighbourhood of β0 such that for all β in Uβ0 , matrix Σ(β, β0) is positive definite, i.e., Uβ0 ≡ {β : Σ(β, β0) is positive definite} . (5.18) Then, consider the following class of functions, G ≡ {m(x, β) : β ∈ Uβ0} , In Chapter 4, we assumed (A.11) Let α = (x, β) ∈ Rd×Θ with Euclidean norm, and k(x, y, β) = kα(y). There exists a measurable function b(y) : Rd 7→ R such that for any α1 6= α2, |kα1(y)− kα2(y)| ≤ b(y)‖α1 − α2‖, 88 Since m(x, β) is Lipschitz continuous with respect to β, by Example 19.7 on page 271 of [41], we know that∥G is a Donsker class, which implies that∥∥ ∑n1 ∥∥∥ a.s. sup ∥∥ m(X ⊗2i, β) − Σ(β, β0)∥∥ −→ 0,{ } β∈U nβ0 i=1 2 (5) therefore P Aε,n ≥ 1{− ε, where∥∥∥∥ ∑ ∥ }n ∥A(5) 1 ⊗2 ∥ εε,n = X : sup ∥ m(Xi, β) − Σ∥ ≤ ,β∈U n ∥ 2β0 i=1 2 Let λ0 be the smallest eigenvalue of Σ, and 0 < ε < 2λ0/3. For Lemma 5.1.2 th⋂rough(5) Theorem 5.1.1, we restrict our discussion to X ∈ A+ε , where the set A+ε = A∗ε Aε,n and P {A+ε } > 1− 2ε. The following lemma states the uniquness of t = t(β) for β ∈ Uβ0 and X ∈ A+ε . The proof is given by contradiction using Rolle’s theorem. Lemma 5.1.2 Under the assumptions of Lemma 5.1.1, and (A.11), for X ∈ A+ε and any fixed β in set Uβ0 defined in (5.18), if there exists a solution t to equation g(t,X, β) = 0, then the solution must be unique a.s., where 1 ∑n m(Xi, β) g(t,X, β) = , n 1 + ttrm(X i=1 i , β) Note that the unique solution of (5.12), denoted by p̂i in (5.16), must lie in (0, 1), therefore 1 + ttrm(Xi, β) > 1/n, for all i. Define{ } 1 Dβ(X ) = t : for all i, 1 + t trm(Xi, β) > . (5.19) n Our goal is to show that for any fixed β ∈ Uβ0 , there is a unique solution t(β) to the maximization problem (5.12). 89 Theorem 5.1.1 Under the assumptions of Lemma 5.1.1, for any β ∈ Uβ0 and for any X ∈ A+ε , there exists a unique solution t(β) ∈ Dβ(X ) to the maximization problem (5.12). 5.1.2 Discontinuous Criterion Function When the criterion function m(x, β) is no longer continuous with respect to β, many methods we used in the the continuous case no longer applies, for example, Rolle’s theorem in Lemma 5.1.2 and the continuation method in Theorem 5.1.1. Therefore we seek different ways to attain the existence and uniqueness of solutions to maximization problem (5.1) for β in some neighbourhood of β0. We prove a more general version of the “zero in the convex hull” theorem in the previous section, namely, we demonstrate that the statement is true not only for convex hull Conv(β0), but also true for Conv(β) with probability approaching 1, when β belongs to some neighbourhood of β0 that is to be specified, i.e., 0 ∈ Conv(β), for β ∈ U∗β (5.20)0 If we can prove (5.20), then for any β ∈ U∗β , the maximization problem (5.1) is0 guaranteed to have a unique solution for any β ∈ U∗β .0 To reach this goal, instead of assigning m(Xi, β) with the constant probability mass 1/n in Lemma 5.1.1, we construct a random probability vector as follows. Let Wi be i.i.d. random variables, i = 1, . . . , n that follow an exponential distribution 90 with E(W1) = 1, and also assume that {Wi}ni=1 is independent of {X } n i i=1. Let dPβ(x) q(x, β) = (5.21) dPβ0(x) be the Radon-Nikodym derivative of Pβ with respect to Pβ0 , and define W ∗i = Wiq(Xi, β), i = 1, . . . , n (5.22) Then ∫∫ ∫ {∫ } { ∗} dPβ(x)E W1 = ∫ w dPβ0(x)dPW (w) = w dPβ(x) dPW (w)dPβ0(x) = wdPW (w) = 1 and E {W ∗1 |X1} = E {W1q(X1, β)|X1} = q(X1, β) (5.23) Define V = (V1, . . . , Vn) be a vector in the simplex ∆n defined in (5.72), where W ∗ Vi = ∑ in . (5.24) W ∗i i=1 We would like to show that 0 ∈ in (Convn(β)) with weights in (5.23). Assume that (A.12) Assume that Σ2(β, β0) and Σ3(β, β0) are two continuous functions given by { } Σ2(β, β0) = Eβ0 {q 2(X1, β)m(X , β) ⊗2 1 } (5.25) Σ 3 33(β, β0) = Eβ0 q (X1, β)‖m(X1, β)‖ (5.26) with Σ2(β0, β0) = Eβ0 {m(X1, β0)⊗2} and Σ3(β, β0) = E 3β0 {‖m(X1, β0)‖ } Since Σ2(β, β0) and Σ3(β, β0) are both continuous with respect to β, we define U∗β ≡ {β : Σ2(β, β0) is positive definite, and Σ3(β, β0) is bounded} . (5.27)0 91 Now we are ready to establish the “zero in the convex hull” theorem for discontin- uous estimating function. The following lemma is a generalization of Lemma 5.1.1, which essentially states 0 ∈ Conv(β) for all β ∈ U∗β .0 Lemma 5.1.3 Assume that (5.4)-(5.6) in Lemma 5.1.1, and (A.12) are satisfied. For β ∈ U∗β defined in (5.27), for any constant K > 0, let X = {Xi} ∞ 0 i=1 , t0 be a unit vector in Rp. (a) Let λ2(β, β0){be the smallest eigenvalue of Σ2(β, β0). De}fine∑n B(1) 1K,n ={X : inf √ t tr 0 q(Xi, β)m(Xi, β) ≥ −K , (5.28)∥t∥ 0∥∥ 1 ∑ n i=1 n ⊗ ∗ ∥∥ }B(2)ε,n = {X : q(X 2 i, β)m(Xi, β) − Σ (β, β ∥0)∥ }≤ ε , (5.29)n∑i=1 2n B(3) 1n = {X : ‖q(Xi, β)m(Xi, β)‖ 3 ≤ 2 · Σ3(β, β0) , (5.30) n∑i=1 }n B(4) 1 ⊗2 1n = X : q(X ∗i, β)m(Xi, β) ≥ λ (β, β0)I . (5.31)n 2 0 i=1 Then for any ε, there exists Kε and Nε, such that { } (1) (i) for any n ≥ Nε, P BK ,n ≥ 1− ε;ε (ii) the followin{g limits a}re all identical to 1,⋂ { ⋂ } { ⋂ } lim P B(2) (3) (4) →∞ ε,n = lim P B = lim P B . n1 n1→∞ n n1→∞ n n≥n1 n≥n1 n≥n1 (b) Let Convn(β) be the convex hull of m(Xi, β). For any ε > 0, there exist Kε and Nε such that P {B∗ε} ≥ 1− ε, where ⋂ { } B∗ B(1)= ∩ B(2) ∩ B(3) (4)ε K ,n ε,n n ∩ Bn , (5.32)ε n≥Nε and for n > Nε and X ∈ B∗ε , P {0 ∈ Convon(β) |X ∗∞ ∈ Bε} ≥ 1− ε. 92 5.2 Extended Estimating Equation In this section, we extend the conclusions in the previous section when the con- straints of the EL maximization is an EEE. In Section 5.2.1, we prove the uniqueness of solutions to the EL maximization in (5.1) for mn(x,x, β) that is continuous with respect to β. In Section 5.2.2, we discuss the case when mn(x,x, β) is no longer continuous with respect to β. 5.2.1 Continuous Criterion Function We can make the same conclusion for mn(Xi,X , β0) if we can show that (a) in the Lemma 5.1.1 is true. Note that the con∑clusions of (5.7) and (5.9) aren gua∑ranteed by the asymptotic normality of n −1/2 mn(Xi,X , β0) and the order i=1 n of ‖mn(Xi,X , β0)‖3, respectively, which has been proved in Lemma 4.2.2 and i=1 Proposition 4.2.2. As for the conclusion regarding (5.8) and (5.10), it suffices to show 1 ∑n m (X ,X, β )⊗2n i 0 − P→ Σ (5.33) n i=1 We will discuss the conditions for (5.33) in the following proposition, then state and prove a result for continuous mn(x,x, β) that is parallel to Lemma 5.1.1. Proposition 5.2.1 Let Yi, i = 1, 2 be random variables that are i.i.d. as X1, and are independent of X = (X1, . . . , Xn). Under assumptions (A.2), (A.4), (A.9), and (A.10), ∑n1 P mn(Xi,X , β0) ⊗2 −→ Σ, (5.34) n i=1 93 where { [ ] }⊗2 2 − E {C(Y2)k(Y1, Y2, β0)|Y1}Σ = E Q(Y1, β0) C(Y1) . (5.35) E {k(Y1, Y2, β0)|Y1} Next, we present a result that is parallel to Lemma 5.1.1 for extended estimating equations. Lemma 5.2.1 For i.i.d. random variables Xn = (X1, . . . , Xn), assume that the es- timating function mn(Xi,X , β0) satisfies (A.2), (A.9) and (A.10). For any constant K > 0, let X∞ = {Xi}∞i=1, and t0 be a unit vector in Rp. (a) Let λ0 be the smallest eigenvalue of Σ, and I ∈ Rp×p be the identity matrix, define { ∑ }n C(1) 1K,n ={X∞ : inf √ t tr 0 mn(Xi,X , β0) ≥ −K , (5.36)∥t0∥∥∥ 1 ∑ n i=1 ∥∥ }n C(2)ε,n = {X∞ : mn(Xi,X , β0) ⊗2 − Σ∥∥} ≤ ε , (5.37)n∑i=1 2n C(3) 1n = {X∞ : ‖mn(Xi,X , β 3 0)‖ ≤ C ,} (5.38)n∑i=1n C(4) 1 1n = X∞ : mng(Xi,X , β ⊗20) ≥ λ0I . (5.39)n 2 i=1 where C in (5.38) is a constant in R that is greater than E‖mn(X1,X , β0)‖3. Then for any ε, there exists Kε and Nε, such that { } (1) (i) for any n ≥ Nε, P CK ,n ≥ 1− ε;ε (ii) the followin{g limits a}re all identic⋂ { al to 1, ⋂ } { ⋂ } lim P C(2) = lim P C(3) (4) →∞ ε,n →∞ n = lim P Cn . n1 n1 n1→∞ n≥n1 n≥n1 n≥n1 94 (b) Let Conv(β0) be the convex hull of mn(Xi,X , β0). For any ε > 0, there exist Kε and Nε such that P {C∗ε} ≥ 1− ε, where ⋂ { } C∗ (1)= C ∩ C(2)ε K ,n ε,n ∩ C (3) n ∩ C(4)ε n , (5.40) n≥Nε and for n > Nε and X∞ ∈ C∗ε , P {0 ∈ Convn(β0) o} ≥ 1− ε. Now, let us go back to the EL problem, which for any fixed β, is essentially the followingmaximization problem ∏n maxp i=1 pi, where p = (p1, . . . , pn) ∑ (5.41)n ∑nsubject to pi ≥ 0, pi = 1, pimn(Xi,X , β) = 0 i=1 i=1 Note that for a given β, a unique maximum exists provided that 0 is in the inte- rior of set Conv(β), the convex hull of mn(X1,X , β), . . . ,mn(Xn,X , β). Since by Lemma 5.2.1, for any ε, there exists Nε such that for any n > Nε and X∞ ∈ Cε, P {0 ∈ Convo(β0)} > 1− ε, we know that for n > Nε, there exist a unique solution denoted by p̂0 to the maximization problem in (5.41) with β = β0, with probability greater than 1− ε. To know that the unique maximizer is a calculus maximizer, we need to know that there is a ball of dimension n−1−p for p’s within which to take the derivative, where p is the dimension of mn(X1,X , β). Let 1 be an 1× n vector with all entries equal to 1, and 0 be a p× 1 vector with all entries equal to 0, M0 = (mn(X1,X , β), . . . ,mn(Xn,X , β)) 95 be a p×n matrix, and M = (1tr,M tr0 )tr be a (p+1)×n matrix. Suppose there exists one solution to the m{aximization problem (5.41) denoted by p̂0. T}hen p̂∑ 0 belongs to n P ≡ p : Mp = (1,0tr)tr, all pi > 0, pi = 1 (5.42) i=1 For any v in the null space of M and α ∈ R, define 1 p = · (p̂ + αv), 1 + αv · 1 0 we know that Mp = (1,0tr)tr, hence for small α such that the entries for p are all positive, we know that p belongs to P . Furthermore, v is in the null space of M , and dim(M) = p + 1, so dim(P) = n − p − 1. Therefore we know that there is a relative open set P of dimension n− p− 1 within which we can take derivative. Next, we give the form of the unique solution to (5.41) when β = β0 and n > Nε, using Lagrange multipliers λ ∈{R and t ∈}Rp. Define G(p, λ, t) as∑n ∑n ∑n G = ln pi − λ pi − 1 − nttr pimn(Xi,X , β0) i=1 i=1 i=1 To maximize the concave function G, differentiate G with respect to pi, i = 1, . . . , n, then set the derivatives to be zero, ∂G 1 = − λ− nttrmn(Xi,X , β0) = 0, i = 1, . . . , n. (5.43) ∂pi pi Multiply the equations above by pi, then add them together,∑n ∑n n− λ p − nttri pimn(Xi,X , β0) = 0 (5.44) i=1 i=1 By the constraintts in (5.41), (5.44) indicates that n − λ = 0, so n = λ. Therefore from equation (5.43), 1 p̂i = (5.45) n {1 + ttr0 mn(Xi,X , β0)} 96 ∑n Since p̂imn(Xi,X , β0) = 0, the vector t must solve the following equation, i=1 n 1 ∑ mn(Xi,X , β0) = 0. (5.46) n 1 + ttrmn(Xi,X , β )i=1 0 Important Event In Lemma 5.2.1, we proved that for any ε, there exists Nε such that P {C∗ε} ≥ 1− ε, where ⋂ { } C∗ C(1)ε = K ∩ C (2) ∩ C(3) ∩ C(4) ε,n ε,n n n . n≥Nε Furthermore, let { [ ] } 2 − E { ⊗2 C(Y2)k(Y1, Y2, β)|Y1} Σ(β, β0) = E Q(Y1, β) C(Y1) . (5.47) E {k(Y1, Y2, β)|Y1} Since Σ(β, β0{) is continuous, and by (5.35),[ ] }⊗2 2 − E {C(Y2)k(Y1, Y2, β0)|Y1}Σ = E Q(Y1, β0) C(Y1) = Σ(β , β ) E { 0 0k(Y1, Y2, β0)|Y1} is positive definite, there exists Uβ0 , a neighbourhood of β0 such that for all β in Uβ0 , matrix Σ(β, β0) is positive definite, i.e., Uβ0 ≡ {β : Σ(β, β0) is positive definite} . (5.48) Then, consider the following class of functions, G ≡ {mn(Xi,X , β) : β ∈ Uβ0} , By Example 19.7 on Page 271 of [41] and Lemma 4.2.1, we know that family G is a Donsker class, which im∥plies that∥∥ ∑ ∥n1 ∥∥ a.s. sup ∥∥ mn(Xi,X , β)⊗2 − Σ(β, β0)∥ −→ 0,β∈U n ∥β0 i=1 2 97 { } C(5)therefore P ε,n ≥{1− ε, wher∥e∥∥ ∥ }nC(5)ε,n = X : sup ∥ β∈U ∥ 1 ∑ ∥ ∥ ε mn(Xi,X , β) ⊗2 − Σ∥∥ ≤ ,n 2 β0 i=1 2 Let λ0 be the smallest eigenvalue of Σ, and 0 < ε < λ0. For Lemma 5.2.2 through Theorem 5.2.1, we restrict our discussion to X ∈ C+∞ ε , where the set ⋂ C+ = C∗ C(5)ε ε ε,n (5.49) and P {C+ε } > 1− 2ε. Lemma 5.2.2 Under the assumptions of Lemma 4.2.1 and Lemma 5.1.1, for X∞ ∈ C+ε and any β in set Uβ0 defined in (5.48), if there exists a solution t to equation g(t,X, β) = 0, then the solution must be unique, where n 1 ∑ mn(Xi,X , β) g(t,X, β) = , (5.50) n 1 + ttrmn(Xi,X , β)i=1 Note that the unique solution of (5.41), denoted by p̂i in (5.45), must lie in (0, 1), therefore 1 + ttrmn(Xi,X , β0) > 1/n, for all i. Define{ } 1 Dβ(X ) = t : for all i, 1 + t trmn(Xi,X , β) > . (5.51) n Our goal is to show that for any fixed β ∈ Uβ0 , there is a unique solution t(β) to the maximization problem (5.41). Theorem 5.2.1 Under assumptions of Lemma 5.2.1, for any β ∈ Uβ0 and X ∈ C+ε , there exists a unique solution t(β) to the maximization problem (5.41). 98 5.2.2 Discontinuous Criterion Function Similar to the strategy with used in the proof under CEE settings, when the criterion function mn(Xi,X , β) is no longer continuous with respect to β, we use random variables instead of the fixed constant 1/n as the probability mass assigned to each mn(Xi,X , β), i = 1, . . . , n. More specifically, let Pβ0(x) be the cumulative density function of X1, and q(x, β) be the Radon-Nicodym derivative of Pβ(x) with respect to Pβ0(x), i.e., dPβ(x) q(x, β) = , (5.52) dPβ0(x) Let Y1 ∼ Pβ0 be independent of X ∞ {∞ = {Xi}i=1, and } E {C(Y1)k(Xi, Y1, β)|Xi} V (Xi, β) = Q(Xi, β) C(Xi)− (5.53) E {k(Xi, Y1, β)|Xi} We claim that E {V (Xi, β0)} = 0 because, 0 = E {{mn(Xi,X , β0)} = l}im E {{mn({Xi,X , β )}n→∞ 0 }} = E lim mn(Xi,X , β0) = E E lim mn(Xi,X , β0)|X (5.54)1 n→∞ n→∞ = E {V (X1, β0)} , therefore ∫ { dPβ(x)Eβ0 q(X1, β)V (X1, β)} = V (x, β) · dPβ0(x) = 0.dPβ0(x) Assume that { } E q2β0 (X1, β) <∞, for all β ∈ Uβ0 , (5.55) then by the CLT, for any t0 ∈ Rp with ‖t0‖ = 1 and Σ∗(β, , β0) = E 2β0 {q (X1)V (X1, β)⊗2}, n √1 ∑ q(X )ttr D i 0 V (Xi, β) −→ N(0, ttr0 Σ∗(β, , β0)t0) (5.56)n i=1 99 At this point, we would like to point two facts. First, by (5.54), V (Xi, β) is eligible as a criterion function of CEE, and based on the discussion in the previous sections, there exists Uβ0 such that 0 ∈ Conv(β), for all β ∈ Uβ0 . Another heuristic observa- tion is that mn(Xi,X , β) and V (Xi, β) are “close” in some sense because they only distinct by the quotient term, and by LLN and Slutsky’s lemma, the quotient term in mn(Xi,X , β) converges to that in V (Xi, β) in probability. However, in order to pass the desirable feature of V (Xi, β) to mn(Xi,X , β), for any fixed β ∈ U −1/2β0 , we need n ‖mn(x,X )−V (x, β)‖ to be bounded uniform in x, i.e., sup ‖mn(x,X, β)− V (x, β)‖ = OP (n−1/2). (5.57) x∈X By Example 19.11 in [41] by Van der Vaart, if we know that (i) function k(x, ·, ·) has bounded variation (ii) for any ε > 0 and δε > 0, let Xε ≡ {x : E {k(x,X, β)} > δε}, then for any constant cε > 0, there∥exists Nε such that for any n > N ,∥ ε ∥ ∥  P ∥∥ ∑ ∥ ∥ ∥√1 m ∥n(Xi,X , β)∥∥ ≥ cεn  < ε, (5.58)i:X ∈X ci ε then (5.57) is guaranteed. Since up to now, the broadest class of semiparametric models for which the struc- tural parameter can be defined via EEE is the ϕ-transformation model discussed in Chapter 3, for the rest of this chapter, we restrict our attention to the mn(Xi,X , β) under the ϕ-transformation model assumptions. 100 Recall that under the ϕ-transformation model assumption, k(x, ·, ·) is an indicator function, therefore (ii) on page 100 is satisfied. Let us consider a martingale assump- tion that guarantees (ii) in the previous lemma. Assume thatXi = (Ti, Ci, Zi), where the lifetime Ti and the right censoring variable Ci are independent conditional on covariate Zi. Recall that under the ϕ-transformation model assumption, Ti = ϕ(εi, β tr 0 Zi), (5.59) where ϕ(x, ·) is strictly increasing, εi is the residual with distribution function F (x) and hazard rate λ(x). Let ζi be the residual when lifetime is censored, i.e., Ci = ϕ(ζ , βtri 0 Zi). Use the classic method in [21] and construct the compensated counting process martingales { } { } ∫ u { } Mi ϕ(u, β tr 0 Zi) = Ni ϕ(u, β tr 0 Zi) − λ(x)Yi ϕ(x, βtr0 Zi) dx (5.60) −∞ is a martingale with respect to measure Pβ0 and filtration Fn(u) = σ(Xi, Ni(ϕ(s, βtr tr0 Zi)), Yi(ϕ(s, β0 Zi)) : s ≤ u, i = 1, . . . , n) where Vi = min(Ti, Ci) is the observed time, ∆i = I {Ti ≤ Ci} is the non-censored indicator, Ni(x) = I {Ti ≤ x,∆i = 1}, Yi(x) = I {Vi ≥ x} is the at-risk indicator, Q(Xi, β) = ∆i, C(Xi) = C(Zi), k(Xi, X tr j, β) = Yj(ϕ(εi, β0 Z )), (5.61)j Then we can write ∑n Sn(β0) = mn(Xi,X{, β0)∑i=1n ∑n j∑ }=1C(Xj)k(Xi, Xj, β)= Q(X{i, β) C(Xi)− n (5.62)∑i=1 j=1 k(Xi, Xj, β) }n ∑n − j= ∆i C(Zi) ∑=1C(Zj)Yj {ϕ(εi, βtr0 Zj)}n Y tr i=1 j=1 j {ϕ(εi, β0 Zj)} 101 By (5.60), together with the∫model assumption (5.59), on the residual scale,∑n { }{ } Sn(β0) = dNi ϕ(u, β tr 0 Zi) C(Zi)− C̄(u, β ) ∑i=1n ∫ 0 { }{ } (5.63) = dMi ϕ(u, β tr 0 Zi) C(Zi)− C̄(u, β0) i=1 where ∑n∑C(Z )Y {ϕ(u, βtrj=1 j j 0 Zj)}C̄(u, β0) = n j=1 Y tr j {ϕ(u, β0 Zj)} Note that Eβ0 {Sn(β0)} = 0 due to the martingale property. Recall that the pre- dictable variation process 〈dMi, dMi〉 = λ(u)Y tri {ϕ(u, β0 Zi)} du as mentioned in [1], hence {∫ ∑n { } ∣ } { } − 2 { } ∣ V ar Sn(β0) = E C(Zi) C̄(u, β0) λ(u)P Vi ≥ ϕ(u, βtr0 Z ∣i) du ∣ Z i=1 Under assumption (A.2), ‖C(x)‖ < b, for all x ∈ X , therefore ∥∥ { }∥∥ 2 ∑n ∫4bV ar n−1/2Sn(β0) ≤ λ(u)S(u)du ≤ 4b2. n i=1 Now that we have established (5.57), for any ε > 0, there exists Cε and Nε such that for any n > Nε, ‖ Cεsup mn(x,X, β)− V (Xi, β)‖ ≤ √ , w.p. no less than 1− ε {x } n therefore P C(6)ε,n > 1− ε, where { } C(6)ε,n ≡ X : sup ‖ Kε mn(x,X, β)− V (Xi, β)‖ ≤ √ , (5.64) x n Let q(x, β) be he Radon-Nikodym derivative defined in (5.52), and as we have as- sumed in (5.55), E {q(X , β)21 } < ∞, for any β ∈ Uβ0 . Then by the law of large 102 { } (7) numbers, any fixed β ∈{Uβ0 , we know that P Cε,n > 1− ε, w}here n C(7) ≡ 1 ∑ { } ε,n X : q(Xi, β) 2 < E 2β n 0 q(X1, β) , (5.65) i=1 (6) (7) Now, we can update the definition of C+ε in{(5.49) by}including Cε,n and Cε,n, namely,⋂ ⋂7 C+ε ≡ C∗ε C(i)ε,n , (5.66) i=5 and P {C+ε } > 1− 4ε for n > Nε. Theorem 5.2.2 Under assumptions of (5.57), for any β ∈ U +β0 and X∞ ∈ Cε , 0 belongs to int (Conv(β)). 103 5.3 Some Proofs 5.3.1 Proofs Under the CEE Setting Let us state two widely used lemmas without proof. The Lyapunov Central Limit Theorem was found in Chapter 27 of [6], and the Hyperplane Separation Theorem was found in Chapter 2 of [7]. Lemma 5.3.1 (Lyapunov Central Limit Theorem) Suppose Z1, . . . , Zn are in- dependent random variables, each with finite expected value µi and variance σ 2 i . Define ∑n s2 = σ2n i i=1 If for some δ > 0, the Lyapunovs condition 1 ∑n [ ] E |Z − µ |2+δi i → 0 (5.67) s2+δn i=1 as n → ∞ is satisfied, then a sum of (Zi − µ)/sn converges in distribution to a standard normal random variable, as n goes to infinity: 1 ∑n D (Zi − µi) −→ N(0, 1). (5.68) sn i=1 Lemma 5.3.2 (Hyperplane Separation Theorem) Let A and B be two dis- joint nonempty convex sets. If A is open, then there exist a nonzero vector v and real number c such that 〈x, v〉 > c and 〈y, v〉 ≤ c for all x in A and y in B. 104 In general, for two matrices A and B in Rp×p A > B ⇔ A−B is positive definite. Proof of Lemma 5.1.1 First, let us show that (a) is true. Under assumptions (5.4)-(5.6), by law of large numbers, we know that (ii) is true. Note that by the multivariate central limit theorem, ∑n Sn = √ 1 D m(Xi, β0) −→ S∞ ∼ N(0,Σ). (5.69) n i=1 Since S∞ follows a multivariate normal distribution, for any ε > 0 and any unit vector t0 ∈ Rp, there exists Kε{such that } ε P inf ttr0 S∞ < −Kε < . (5.70) t0 2 By (5.69), there∣ ex{ists Nε such that}for an{y n > Nε,∣∣∣ } ∣∣P inf ttr0 Sn < −K − P inf ttrε 0 S∞ < −Kεt t ∣∣ ε< (5.71)0 0 2 Combining (5.70) and (5.71), we conclude that P {inf trt0 t0 Sn < −Kε} < ε, hence the conclusion for (i) is also true. Next, we prove (b) using Lemma 5.3.2. Let Convn(β0) be the convex hull of m(Xi, β0), then we want to show that the two{sets∑ }n A = {0} , B = int (Convn(β0)) = vim(Xi, β0); v ∈ ∆n i=1 are not separated, whe{re ∑ }n ∆n = v = (v1, . . . , vn) : vi = 1, vi ∈ (0, 1) . (5.72) i=1 105 By Lemma 5.3.2, the statement 0 ∈ Convon(β0) is equivalent to ∀t0 ∈ Rp with ‖t0‖ = 1 and ∀a ≤ 0, (5.73) ∃z ∈ int (Convn(β0)) such that ttr0 z > a. To prove (5.73), we will show in the following paragraphs that for any ε > 0, √ nonpositive constant a = −c/ n and vector t p0 ∈ R with ‖t0‖ = 1, there exists Nε such that for n{> Nε, given X ∈ A ∗ ε ∑n ∣∣ } inf P ttr c ∣ 0 Vim(Xi, β0) ≥ −√ ∣ X > 0, for X ∈ A∗ε, (5.74)t0 n i=1 where we consider V = v as a continuously distribut∑ed random vector in symplexn denoted by ∆n, and z in (5.73) for each t0 is a value Vi(ω)m(Xi, β0) for ω in the i=1 event where the probability is positive in (5.74). Note that (5.74) is true if we can prove the case when{a = 0, namely, for n > Nε,∑ ∣∣ }n inf P ttr0 Vim(Xi, β0) ≥ 0 ∣∣ X > 0, for X ∈ A∗ε (5.75)t0 i=1 We prove (5.75) by constructing V in the following way. Let Wi be i.i.d. random variables that follow an exponential distribution with mean that equals to 1, and let V = (V1, . . . , Vn) with Vi = ∑Win . (5.76) Wi i=1 106 ∑n Let W = Wi/n and t0 be a unit vector in Rp, then the probability in (5.75) can i=1 be written{as ∑ }n P tr{t0 Vim(Xi, β0) ≥ 0 ∣ ∣∣∣ X i=∑1n1 ∣∣ } =P {√ t tr 0 Wim(Xi, β0) ≥ 0n ∣∣ X∑i=1 }n1 ∑n ∣∣ =P √ ttr 1 tr ∣{ n 0 m(Xi, β0) + √ t0 (Wi − 1)m(Xi, β0) ≥ 0 ∣ X∑ ni=1 i=1 ∣ }n n1 1 ∑ ∣ =P √ ttr0 (Wi − 1)m(Xi, β0) ≥ −√ ttr0 m(X ∣i, β0) ∣ X (5.77)n n i=1 i=1 Combining (5.77) and the conclusion (i) in (a), we obtain for n > Nε, any nonneg- ative constant K , and{X ∈ A∗ε ε∑n ∣∣ } P ttr{ 0 Vim(Xi, β0) ≥ 0 ∣∣ X i=∑1 ∣∣ } (5.78)n ≥ 1P √ ttr0 (Wi − 1)m(Xi, β0) ≥ K ∣εn ∣ X i=1 ∑n Next, we show that the term n−1/2 ttr0 (Wi−1)m(Xi, β0) in (5.78) satisfies (5.67) i=1 in Lemma 5.3.1, with the choice of δ = 1, Zi = t tr 0 (Wi − 1)m(Xi, β0), and { } µ = E (W − 1) · ttri { i 0 m(Xi, β0)|Xi }= 0; σ2{ i = var (Wi −}1) · t tr 0 m(Xi, β )|X = ttr ⊗20 i 0 m(Xi, β0) t0. ∑n Let s2 tr ⊗2n = t0 m(Xi, β0) t0, then the left hand side of (5.67) becomes i=1 n 1 ∑ { } E ‖(W tr(i − 1) · t0 m(Xi, β)‖ 3|X s(3 in i=1 ) ∑ )n ( ) (5.79) ≤ 2 · 1 ‖ 1m(Xi, β0)‖3 · √ n−3/2s3n n ni=1 107 ∑n By Lemma 5.3.1, given X ∈ A∗ε, it follows that n−1/2 ttr0 (Wi − 1)m(Xi, β0) is i=1 asymptotically normal, i.e., for any unit vector s ∈ Rp, ∑n n− ( ) 1/2 D(Wi − 1) · strm(Xi, β tr0) −→ N 0, s Σs , with probability 1. (5.80) i=1 Note that by the strong law of large numbers, W converges to EW1 = 1 almost surely, therefore{for n[> Nε, constant Kε > 0, and ]X ∈ A ∗ ε ∑ }n ∣1 ∣ P inf √ (W − 1)stri m(Xi, β ∣0) ≥ Kε ∣ X > 0. (5.81)s n i=1 Combining (5.77), (5.78) and (5.81), we get that for any n > Nε, constant a = √ −c/ n, unit{vector t0 ∈ Rp, and X ∈ B∗∣ ε ,∑ }n ∣∣P tr{t0 Vim(Xi, β0) ≥ 0 ∣ Xi=1 ∣ } (5.82)n ≥ √1 ∑ ∣ P inf (Wi − 1)strm(Xi, β0) ≥ K ∣ε X > 0, s∈Rp:‖s‖=1 n ∣ i=1 and hence we complete the proof.  Lemma 5.3.3 (Implicit Function Theorem) Let f : Rn+m → Rm be a contin- uously differentiable function, and let Rn+m have coordinates (x, y). Fix a point (a, b) = (a , . . . , a , b , . . . , b ) with f(a, b) = c, where c ∈ Rm1 n 1 m . If the Jacobian matrix Jf,y(a, b) = [(∂fi/∂yj)(a, b)]is invertible, then there exists an open set U containing a, an open set V containing b, and a unique continuously differentiable function g : U → V such that {(x, g(x))|x ∈ U} = {(x, y) ∈ U × V |f(x, y) = c}. 108 Proof of Lemma 5.1.2 For ε smaller than λ0, the smallest eigenvalue of Σ, on the event A+ε , for any fi∥xed β ∈ Uβ∥ 0 , ∥∥∥ 1 ∑ n ∥∥∥ ε m(X ⊗2i, β) − Σ(β, β0) n ∥∥ < 2 i=1 2 Since β ∈ Uβ0 , we know that Σ(β, β0) is positive definite, hence 1 ∑n m(Xi, β) ⊗2 is positive definite for β ∈ Uβ0 . (5.83)n i=1 By contradiction, we can show that the conclusion is true for any fixed β ∈ Uβ0 . Suppose that for a fixed β ∈ Uβ0 , there exist distinct t1 and t2 such that g(t1,X , β) = g(t2,X , β) = 0. Therefore by Rolle’s theorem from [15], there exists s ∈ (0, 1) such that for t3 = st1 + (1− s)t2, ∂g(t3,X , β{)/∂s = 0, i.e.,∑ }n ⊗2 (t − t )tr 1 m(Xi, β)1 2 (t1 − t2) = 0, (5.84) n (1 + ttr 2 i=1 3 m(Xi, β)) which implies 1 ∑n m(X , β)⊗2i is singular, n i=1 contradicting (5.83).  Proof of Theorem 5.1.1 Define C in the following way: C ≡ {β∗ ∈ Uβ0 : ∃ rβ∗ > 0 such that for any β ∈ B ∗r ∗ (β ),β (5.85) there exists t = t(β) such that m(t,X, β) = 0}, where 1 ∑n m(Xi, β) g(t,X, β) = . (5.86) n 1 + ttrm(X i=1 i , β) 109 First, we show that for any fixed β∗ such that the equation g(t,X, β∗) = 0 has a solution t = t∗, then β∗ ∈ C. By (5.86), g(t∗,X , β∗) = 0 implies 0 belongs to int (Convn(β ∗)), consequently, there exists a unique solution to the maximizing problem (5.12) with the choice of β = β∗. Since this unique solution is bounded by 0 and 1, i.e., 1 · 1∗ ∈ (0, 1) for i = 1, . . . , n,n 1 + t trm(X , β∗i ) we know that t∗ ∈ Dβ∗(X ). Since g(t∗,X , β∗) = 0, ∂g(t,X, β)/∂t|(t∗,β∗) is neg- ative definite. By the implicit function theorem, there exists rβ∗ such that for β ∈ B ∗r ∗ (β ), equation g(t,X, β) = 0 has a solution t = t(β). Therefore t∗ ∈ C.β In particular, since we have shown in (5.16) that when β = β0, there exists t0 such that g(t0,X , β0) = 0, we know that the conclusion in the previous paragraph is true for there is β = β0, i.e., β0 ∈ C. Next, we show that C = Uβ0 , i.e., for any β∗ ∈ Uβ0 , β∗ also belongs to C. Let ρmax = sup {s : β0 + r(β∗ − β0) ∈ C, for all 0 < r < s} , (5.87) By contradiction, we can show that ρmax ≥ 1. Otherwise, suppose ρmax < 1. By equation (5.87), we know that β = β0 + r(β ∗ − β0) ∈ C, for all r < ρmax (5.88) Let {r ∞k}k=1 be a series of increasing positive numbers that are bounded by ρmax, and βk = β ∗ 0 + rk(β − β0), for 0 < rk < ρmax. (5.89) 110 Therefore by (5.88), βk ∈ C, which together with (5.85) - (5.87) implies for each βk, ∑n1 m(Xi, βk) 1 = 0 and 1 + ttr(βk)m(Xi, βk) ≥ (5.90) n 1 + t(βk)trm(Xi=1 i, βk) n Let βmax = lim β ∗ k = β0 + ρmax(β − β0) (5.91) k→∞ then by taking the limit of (5.90) as k goes to ∞, ∑n1 m(Xi, βmax) = 0. (5.92) n 1 + t(β trmax) m(Xi, βmax)i=1 Therefore 0 belongs to int (Convn(βmax)). Consequently, there exists a unique solu- tion to the maximizing problem (5.12) with β = βmax. Since the unique solution is in (0, 1), we know that t(βmax) ∈ Dβmax(X ). Similar to the previous proof, βmax ∈ C. By the definition of C, there exists rmax such that for all β ∈ Brmax(βmax), we know that t = t(β) solves g(t,X, β) = 0. Therefore β = β0 + r(β ∗ − β0) ∈ C, where r = ρmax + rmax/4 > ρmax contradicts the definition of ρmax, hence ρmax ≥ 1 and we conclude that β∗ ∈ C.  Proof of Lemma 5.1.3 First, let us show that (a) is true. By the LLN, we know that statement (ii) is true. Note that ∫ { } dPβ(x)E q(X1, β)m(X1, β) = m(x, β) dPβ dP (x) 0 (x) = 0, β0 by the multivariate central limit theorem, ∑n1 Dβ Sn = √ q(Xi, β)m(Xi, β) −−→0 S ∗∞ ∼ N(0,Σ (β, β0)). (5.93) n i=1 111 Since S∞ follows a multivariate normal distribution, for any ε > 0 and any unit vector t0 ∈ Rp, there exists Kε{such that } tr − εP inf t0 S∞ < Kε < . (5.94) t0 2 By (5.93), there∣ ex{ists Nε such that}for an{y n > Nε,∣∣∣ }P inf ttrS < −K − P inf ttrS∞ < −K ∣∣∣∣ ε0 n ε 0 ε < (5.95)t0 t0 2 Combining (5.94) and (5.95), we conclude that P {inf ttrt0 0 Sn < −Kε} < ε, hence the conclusion for (i) is also true. Next, we prove (b) using Lemma 5.3.2. Let Convn(β) be the convex hull of m(Xi, β), then we want to show that the tw{o sets∑ }n A = {0} , B = Convn(β) = vim(Xi, β); v ∈ ∆n i=1 are not separated, whe{re ∑ }n ∆n = v = (v1, . . . , vn) : vi = 1, vi ∈ (0, 1) . i=1 By Lemma 5.3.2, the statement 0 ∈ int (Convn(β)) is equivalent to ∀t0 ∈ Rp with ‖t0‖ = 1 and ∀a ≤ 0, (5.96) ∃z ∈ int (Convn(β)) such that ttr0 z > a. To prove (5.96), we will show in the following paragraphs that for any ε > 0, √ nonpositive constant a = −c/ n and vector t0 ∈ Rp with ‖t0‖ = 1, there exists Nε such that for n{> Nε, given X ∈ B∗ε∑n c ∣∣ } inf P ttr0 Vim(Xi, β) ≥ −√ ∣∣ X > 0, for X ∈ B∗ε , (5.97)t0 n i=1 112 where we consider V = v as a continuous∑ly distributed random vector in ∆n, and zn in equation (5.96) for each t0 is a value Vi(ω)m(Xi, β) for ω in the event where i=1 the probability is positive in (5.97). Note that (5.97) is true if we can prove the case when choosing a ={0, namely, for n > Nε,∑ ∣∣ }n inf P ttr0 Vim(Xi, β) > 0 ∣∣ X > 1− ε, for X ∈ B∗ε (5.98)t0 i=1 ∑n We prove (5.98) using V defined in (5.24). Let W ∗ = W ∗i /n and t0 be a unit i=1 vector in Rp{ , then the probabilit}y in (5.98) can be written as∑n ∣∣ P ttr V m(X , β) ≥ 0 ∣{ 0 i i ∣ Xi=1 n ∣ } 1 ∑ ∣ =P tr ∗ ∣{√ t0 Wi m(Xi, β) ≥ 0 ∣ X (5.99)n 1 ∑i=1n 1 ∑n ∣∣ } =P √ ttrq(X , β)m(X , β) + √ ttr ∗{ 0 i i 0 {Wi − q(Xi, β)}m(Xi, β) ≥ 0n n ∣ ∣ X ∑i=1 i=1 ∑ }n n ∣ √1 tr { ∗ − } ≥ −√1 ∣ =P t0 W tr ∣ i q(Xi, β) m(Xi, β) t0 q(Xi, β)m(Xi, β) ∣ X ,n n i=1 i=1 Combining (5.99) and conclusion (i) in (a), we obtain for n > Nε, any nonnegative constant c and K{ε, and X ∈ B∗ε∑ ∣∣ }n P {t tr 0 Vim(Xi, β) ≥ 0 ∣∣ X i= 1 ∑1n ∣∣ } (5.100)≥ P √ ttr0 {W ∗i − q(Xi, β)}m(X , β) ≥ K ∣∣i ε Xn i=1 ∑n Next, we show that the term n−1/2 ttr ∗0 {Wi − q(Xi, β)}m(Xi, β) in (5.100) sat- i=1 isfies equation (5.67) in Lemma 5.3.1, with the choice of δ = 1, Zi = t tr ∗ 0 {Wi − q(Xi, β)}m(Xi, β), 113 by (5.23) { } µ = E {W ∗ − q(X , β)} · ttri { i i 0 m(Xi, β)|Xi }= 0; σ2i = v{ar {W ∗ i − q(X , β)} · ttr tri 0 m}(Xi, β)|Xi = t0 q(X , β) 2m(X , β)⊗2i i t0. ∑n Let s2 = ttr q(X , β)2m(X , β)⊗2n 0 i i t0, then the left hand side of (5.67) becomes i=1 n 1 ∑ { E ‖ {W ∗ } (i − q(Xi, β)} · t tr 0 m(Xi, β))‖ 3|X (3 ) (isn i=1 ∑n ) (5.101) ≤ 2 1 1− · ‖q(Xi, β)m(Xi, β)‖ 3 · √ n 3/2s3n n ni=1 ∑n By Lemma 5.3.1, givenX ∈ B∗ε , it follows that n−1/2 ttr {W ∗0 i − q(Xi, β)}m(Xi, β) i=1 is asymptotically normal, i.e., for any unit vector s ∈ Rp, ∑n −1/2 D ( )n {W ∗ βi − q(Xi, β)} · strm(X , β) −−→0 N 0, str ∗i Σ (β, β0)s , (5.102) i=1 with probability 1. Note that by the strong law of large numbers, as n goes to infinity, W ∗ converges to EW ∗1 = 1 almost surely, therefore for any n > Nε, positive constant{Kε, a[nd X ∈ B ∗ ε 1 ∑ ]n ∣∣ } P inf √ {W ∗i − q(Xi, β)} strm(Xi, β) ≥ K ∣ε ∣ X > 0. (5.103)s n i=1 Combining (5.99), (5.100) and (5.103), we get that for any n > Nε, constant a = √ −c/ n, {unit vector t ∈ Rp0 , and∣X ∈}B ∗ ε , ∑n ∣ P tr{t0 Vim(Xi, β) ≥ 0 ∣∣ X i=1 ∑ ∣∣ } (5.104)n ≥ P inf √1 {W ∗ − q(X , β)} stri i m(X , β) ≥ K ∣i ε ∣ X > 0,s∈Rp:‖s‖=1 n i=1 and hence we complete the proof.  114 5.3.2 Proofs Under the EEE Setting In order to prove Proposition 5.2.1, we first prove the following lemma. Lemma 5.3.4 Let {U ni,j}i,j=1 and {Vi}ni=1 be p-dimensional random vectors, and {Zi,j}n ni,j=1 and {Wi}i=1 be random variables with finite means and variances, respec- tively. Suppose that Wi > 0 a.s., and that as n→∞, n n 1 ∑ 1 ∑P P sup (Ui,j − Vi) −→ 0 and sup (Zi,j −Wi) −→ 0, (5.105) i n i n j=1 j=1 If for any i, there exists∥∥a∑constant∥∥b ∈ R such that∥ n∣∣∣∑j=1 Ui,j∥∣∣ ≤ ‖Vi‖b and ≤ b (5.106)n |Wi| j=1 Zi,j∣ then for any i, as n→∞, ∥∥∥∥∑∑ ∥n∥ ∥j=1 Ui,j Vi ∥ Pn − ∥ −→ 0. (5.107)j=1 Zi,j Wi∥ Proof: First, by the triangle inequality and (5.106), we know that for any i, the left hand side of ∥(5.10∑7) must be bounded; al∑so, note that it equates to∥∥∥∥ ∥ W ni j=1(Ui,j − Vi)∑/n− V ni j=1(Zi,j −W )/n∥∥i W n ∥ , i j=1 Zi,j/n ∥ which is bound∥ed∥∥∥∑ by ∥ ∥ ∥ ∥ ∥n ∑ ∥ j∑=1(U ∥ ∥ n ∥ i,j − Vi)/n∥∥∥ ∥∥∥ Vi ∥∥∥ · ∥∥∥ j=∑1(Zi,j −Wi)/n∥n + ∥ (5.108)j=1 Zi,j/n W ni j=1 Zi,j/n ∥ Moreover, notice that by the law of large numbers, for any i = 1, 2, . . . , ∑n1 P Zi,j −→Wi, a.s., (5.109) n j=1 so the conclusion follows by (5.105), (5.106) and (5.109). 115 Proof of Proposition 5.2.1 Note that we can write the left hand side of (5.34) as 1 ∑ {n ∑ }nj∑ ⊗2=1 C(Xj)k(Xi, Xj, β2 − 0)Q(Xi, β0) C(Xi) (5.110) n n∑i=1 j=1 k(Xi, Xj, β0)n1 = Q(Xi, β0) 2C(X ⊗2∑i) (5.111)n i=1 1 ∑n nj=1 C(Xi)C(X )trj k(Xi, X− j, β0)Q(Xi, β0)2∑ ∑n (5.112)n∑ j=1 k(Xi, Xj, βi=1 0)n n1 j=1 C∑(Xj)C(X )tri k(X− 2 i, Xj, β0)Q(Xi, β0) { n } (5.113)n ∑ j=1 k(Xi, Xj, βi=1 0)∑ ⊗2n {n1 j=∑1C(Xi)k(Xi, Xj, β0)+ Q(Xi, β 20) }2 , (5.114)n n i=1 j=1 k(Xi, Xj, β0) and the right hand{side of (5.34) as[ ] } 2 − E { ⊗2 C(Y2)k(Y1, Y2, β0)|Y1} Σ = E Q(Y1, β0) C(Y1) { } E {k(Y1, Y2, β0)|Y1} = E Q{(Y1, β0)2C(Y )⊗21 } (5.111’) − 2E {C(Y1)C(Y tr 2) k(Y1, Y2, β0)|Y1} E {Q(Y1, β0) (5.112’)E {k(Y1, Y2, β0)|Y1} } tr − E {C(Y2)C(Y1) k(Y1, Y2, β0)|Y1}E{Q(Y1, β )20 E {k(Y1, Y2, β0)|Y1} } (5.113’) ⊗2 − [E {C(Y2) trk(Y1, Y2, β0)|Y1}] E Q(Y1, β 2 0) 2 (5.114’)[E {k(Y1, Y2, β0)|Y1}] By law of large numbers, (5.111) converges to (5.111’) in probability. Now, let us show that (5.112) converges to (5.112’) in probability. Note that by adding and subtracting the term E {C(Xi)C(Y tr2) k(Xi, Y2, β0)|Xi} Q(Xi, β ) 2 0 , (5.115) E {k(Xi, Y2, β0)|Xi} 116 we can write (5.112) as th{e∑summation of 1 ∑n nj=1 C∑(X )C(X )tri j k(Xi, Xj, β0)A = Q(Xi, β )20 n nj=1 k(Xi, Xj, β0)i=1 } (5.112.a) − E {C(Xi)C(Y ) tr 2 k(Xi, Y2, β0)|Xi} E {k(Xi, Y2, β0)|Xi} and ∑n { }1 2 E {C(Xi)C(Y )tr2 k(Xi, Y2, β0)|Xi}B = Q(Xi, β0) { | } (5.112.b)n E k(Xi, Y2, β ) Xi=1 0 i By LLN, (5.112.b) converges to (5.112’) in probability. Next, we show that (5.112.a) converges to zero in pr∑obability. Letn j=1C(Xi)C(Xj) trk(Xi, Xj, β0) d(Xi,X , β0) = ∑n j=1 k(Xi, Xj, β0) (5.116) − E {C(Xi)C(Y ) tr 2 k(Xi, Y2, β0)|Xi} . E {k(Xi, Y2, β0)|Xi} First, we apply Lemma 5.3.4 to (5.116) by setting Ui,j = C{(Xi)C(X tr j) k(Xi, Xj, β0), } Vi = E C(X )C(Y ) tr i 2 k(Xi, Y2, β0)|Xi (5.117) Zi,j = k(Xi, Xj, β0), Wi = E {k(Xi, Y2, β0)|Xi} . The assumption (5.106) in Lemma 5.3.4 is guaranteed by Lemma 4.2.1 under as- sumptions (A.2), (A.9) and (A.10). By (A.2), there exists a constant b such that ‖C(x∑)‖ ≤ b for all x, hence ‖∑n ∑n trj=1 Ui,j‖ ‖ j=1C∑(Xi)C(Xj) k(Xi, Xj, β0)‖n = ≤ b2‖ j=1 Zi,j‖ ‖ nj=1 k(Xi, Xj, β0)‖ ‖V ‖ ‖E {C(X )C(Y )tri i 2 k(Xi, Y2, β0)|Xi} ‖ ≤ 2 |E {k(Xi, Y2, β0)|Xi} |= b = b2 |Wi| E {k(Xi, Y2, β0|Xi} E {k(Xi, Y2, β0)|Xi} 117 Therefore the conclusion (5.107) of Lemma 5.3.4 holds. Moreover, notice that (i) C(x) and k(x, y, β) are both bounded by a fixed constant under assumptions (A.2) and (A.9); (ii) X ≡support(X1) and Uβ0 are both bounded in Rp; The convergences in (4.15) and (4.14) actually hold in Lp norm, where p = 1, 2, . . . , therefore E {‖ Pd(Xi,X , β0)‖ | Xi} −→ 0, for any i. (5.118) Note that as a random variable, the left hand side of (5.118) also satisfied (i) and (ii), hence it also converges in Lr norm, and E {{‖d(Xi,X , β0)‖}}→ 0, for any i. (5.119) E ‖d(Xi,X , β )‖20 → d <∞, for any i. Recall that under assumption (A.2), |Q(x, β0)| < M <∞, for any x, hence ∑n1 1 ∑n‖A‖ = Q(X 2 2i, β0) ‖d(Xi,X , β0)‖ ≤M · ‖d(Xi,X , β0)‖ (5.120) n n i=1 i=1 Therefore for any ε > 0, { ∑ }n P {‖ ‖ 1A > ε} ≤ P M2 · ‖d(Xi,X , β0)‖ > ε ( { n i=1 }) (5.121) ≤ ε ‖ ‖2 1/2E d(X ,X, β ) → 0, M2 i 0 where we got the last inequality by Chebyshev’s Inequality, and the convergence to zero by (5.119). Hence the conclusions follows by the definition of convergence in probability.  118 Proof of Lemma 5.2.1 First, let us show that (a) is tr∑ue. Under assumptionn (A.2) and (A.9), apply Proposition 5.2.1, we know that n−1 mn(Xi,X , β ⊗2 0) con- i=1 verges to Σ in probability. Therefore the conclusions for (5.37) and (5.39) are true. By Proposition 4.2.2, the conclusion for (5.38) is true. A∑s for the conclusion inn (5.36), it is guaranteed by the asymptotic normality of n−1/2 mn(Xi,X , β0) shown i=1 in Lemma 4.2.2. To prove (b), we follow the steps in the proof of part (b) in Lemma 5.1.1. Recall that by Lemma 5.3.2, the conclusion that 0 belongs to the int (Convn(β0)), which is the convex hull of mn(Xi,X , β0), can be drawn by showing for any ε > 0 and unit vector t p0 ∈ R {, there exists Nε such that for n >}N ∗ ∣ ε , given X∞ ∈ Cε , ∑n ∣ inf P ttr0 Vimn(Xi,X , β0) ≥ 0∣∣X∞ > 1− ε, for X∞ ∈ C∗ε ,t0 i=1 where V = v is as constructed in (5.76). Note that with g(Xi, β0) replaced by the term mn(Xi,X , β0), (5.77) and (5.78) are both true , and (5.79) is true by Proposition 5.2.1 and Lemma 4.2.2. Therefore the conditions for the Lyapunov central limit theorem mentioned in Lemma 5.3.1 are verified with respect to {Wi} variables under the extended estimating equation setting, and given X ∈ C∗∞ ε , it follows that ∑n n−1/2 ttr0 (Wi − 1)mn(Xi,X , β0) i=1 is asymptotically normal, i.e., for any unit vector s ∈ Rp, ∑n n−1/2 tr D t0 (W − 1)stri mn(Xi,X , β tr0) −→ N(0, s Σs), with probability 1. i=1 119 Therefore (5.81) and (5.82) are both true with g(Xi, β0) replaced by mn(Xi,X , β0), and we finish the proof.  Proof of Lemma 5.2.2 For ε smaller than λ0, the smallest eigenvalue of Σ, on the event C+ε , for any∥∥fixed β ∈ Uβ0 ,∥∥∥ 1 ∑ n ∥∥∥ ε mn(Xi,X , β) ⊗2 − Σ(β, β0)∥ n ∥ < 2 i=1 2 Since β ∈ Uβ0 , we know that Σ(β, β0) is positive definite, hence 1 ∑n mn(Xi,X , β) ⊗2 is positive definite for β ∈ Uβ n 0 . (5.122) i=1 By contradiction, we can show that the conclusion is true for any fixed β ∈ Uβ0 . Suppose that for a fixed β ∈ Uβ0 , there exist distinct t1 and t2 such that g(t1,X , β) = g(t2,X , β) = 0. Therefore by Rolle’s theorem from [15], there exists s ∈ (0, 1) such that for t3 = st1 + (1{− s)t2, ∂g(t3,X , β)/∂s = 0, i.∑ } e., n − tr 1 m (X ,X, β) ⊗2 n i (t1 t2) (t1 − t2) = 0, (5.123) n (1 + ttr3 m 2 i=1 n (Xi,X , β)) which implies ∑n1 m (X ,X, β)⊗2n i is singular, n i=1 contradicting (5.122).  Proof of Theorem 5.2.1 Define C in the following way: C ≡ {β∗ ∈ Uβ ∗0 : ∃ rβ > 0 such that for any β ∈ B (β∗r ∗ ),β (5.124) there exists t = t(β) such that g(t,X, β) = 0}, where 1 ∑n mn(Xi,X , β) g(t,X, β) = . (5.125) n 1 + ttrm i=1 n (Xi,X , β) 120 First, we show that for any fixed β∗ such that the equation g(t,X, β∗) = 0 has a solution t = t∗, then β∗ ∈ C. By (5.125), g(t∗,X , β∗) = 0 implies 0 belongs to int (Convn(β ∗)), consequently, there exists a unique solution to the maximizing problem (5.41) with the choice of β = β∗. Since this unique solution is bounded by 0 and 1, i.e., 1 · 1 ∈ (0, 1) for i = 1, . . . , n, n 1 + t∗trm ∗n(Xi,X , β ) we know that t∗ ∈ D ∗(X ). Since g(t∗,X , β∗β ) = 0, ∂g(t,X, β)/∂t|(t∗,β∗) is neg- ative definite. By the implicit function theorem, there exists rβ∗ such that for β ∈ B ∗r ∗ (β ), equation g(t,X, β) = 0 has a solution t = t(β). Therefore t∗ ∈ C.β In particular, since we have shown in (5.45) that when β = β0, there exists t0 such that g(t0,X , β0) = 0, we know that the conclusion in the previous paragraph is true for there is β = β0, i.e., β0 ∈ C. Next, we show that C = Uβ0 , i.e., for any β∗ ∈ U ∗β0 , β also belongs to C. Let ρmax = sup {s : β0 + r(β∗ − β0) ∈ C, for all 0 < r < s} , (5.126) By contradiction, we can show that ρmax ≥ 1. Otherwise, suppose ρmax < 1. By equation (5.126), we know that β = β0 + r(β ∗ − β0) ∈ C, for all r < ρmax (5.127) Let {r ∞k}k=1 be a series of increasing positive numbers that are bounded by ρmax, and βk = β0 + rk(β ∗ − β0), for 0 < rk < ρmax. (5.128) 121 Therefore by (5.127), βk ∈ C, which together with (5.85) - (5.87) implies that for each βk, 1 ∑n mn(Xi,X , βk) 1 = 0 and 1 + ttr(βk)mn(Xi,X , βk) ≥ (5.129) n 1 + t(β )trm (X ,X, β ) n i=1 k n i k Let βmax = lim βk = β0 + ρ ∗ max(β − β0) (5.130) k→∞ then by taking the limit of (5.129) as k goes to ∞, n 1 ∑ mn(Xi,X , βmax) = 0. (5.131) n 1 + t(β )trm (X ,X, β ) i=1 max n i max Therefore 0 belongs to int (Convn(βmax)). Consequently, there exists a unique solu- tion to the maximizing problem (5.41) with β = βmax. Since the unique solution is in (0, 1), we know that t(βmax) ∈ Dβmax(X ). Similar to the previous proof, βmax ∈ C. By the definition of C, there exists rmax such that for all β ∈ Brmax(βmax), we know that t = t(β) solves g(t,X, β) = 0. Therefore β = β0 + r(β ∗ − β0) ∈ C, where r = ρmax + rmax/4 > ρmax contradicts the definition of ρmax, hence ρmax ≥ 1 and we conclude that β∗ ∈ C.  Proof of Theorem 5.2.2 Let Ui be i.i.d. random variables that has a uniform distribution over the span of ttr0 X , and let Ui be independent of X∞, hence by definition, there exists a constant M such that |Ui| ≤M . Define V ∗i = ∑U∗i where U∗n i = Uiq(Xi, β), i = 1, . . . , n (5.132) U∗i i=1 122 Let 1 ∑n√ U∗ttri 0 mn(Xi,X , β)n∑i=1n1 ∑ (5.133)n =√ U∗ttri 0 V (Xi, β) + √ 1 U∗i t tr 0 ((mn(Xi,X , β)− V (Xi, β))n n i=1 i=1 Note that on the event C+ε , by the discussion between page 100 and 102, and the boundedness of U∗∣ i , the second term in (5.13∣3) is OP (1), namely,∣∣∣∣ n 1 ∑ ∣∣ 1 ∑n√ U∗ttr Kεi 0 ((mn(Xi,X , β)− V (Xi, β))∣∣ ≤ √ M · √ = MKε. (5.134)n n n i=1 i=1 Th{erefore on event C + ε , for a fixed p }β ∈ Uβ0 and any unit vector t0 ∈ R 1 ∑ {n ∑ }n P √ 1U∗ tri t0 mn(Xi,X , β) > 0 ≥ P √ U∗ tri t0 V (Xi, β) > MKεn n i=1 i=1 { ∑ (5.13}5)n1 = P √ Uiq(Xi, β)ttr0 V (Xi, β) > MKεn i=1 Using the same procedure as in the proof of Lemma 5.1.3, for n > Nε and X ∈ C+ε , by the Lyap{unov central limit theorem mentioned in Lemma}5.3.1∑n1 P inf √ Uiq(Xi, β)ttr0 V (Xi, β) > MKε > 0 (5.136) s∈Rp:‖s‖=1 n i=1 Combine (5.135) and (5{.136), yielding,∑ }n1 P √ U∗ttri 0 mn(Xi,X , β) > 0 > 0. (5.137)n i=1 Note that equation (5.137) implies that under the condition of this lemma, 0 ∈ int (Convn(β)) with weight V ∗ i defined in (5.132) with a positive probability. Hence we complete the proof.  123 Chapter 6: Empirical Likelihood Applied to Extended Estimating Equations Consider d-dimensional independently identically distributed (i.i.d.) observations i.i.d. X1, . . . , Xn ∼ Pθ, i = 1, . . . , n (6.1) with support X , and parameters θ = (β, ν) ∈ Rp ×H, (6.2) and H is a infinite dimensional space such as function space. The Empirical Likeli- hood (EL) method is an estimation method that maximizes the empirical distribution subject to constraints. For example, in a classic setting, if there exists a estimating function m(x, β) : Rd × Rp 7→ Rr (6.3) such that Eβ0 {m(X1, β0)} = 0, then the EL method seeks solution to the maxi- mization problem for a fix β in a neighbourhood of β0 ∏narg maxp pi, where p = (p1, . . . , p n ),  i=1 ∑ (6.4)n ∑nsubject to pi = 1, pi ∈ (0, 1), pim(Xi, β) = 0. i=1 i=1 124 The solution of (6.4), given by p̂i(β,Xi) can be used to construct the Profile Empirical Likelihood. A point estimator for β0 can be constructed via maximizing the pEL, or equivalently, its negative logarithm value, i.e., ∑n β̂n = arg min l(β), where l(β) = ln p̂i(β,Xi). (6.5) β i=1 EL method has been extensively researched in statistical literature. Owen in [32] established the Wilks type confidence region when the dimension of β0 equals the dimension of m(x, β), i.e., r = p. The Wilks type Confidence Region (CR) does not require calculating the variance covariance matrix of m(X1, β0), and the E {∇βm(X1, β0)}, and can usually provide a narrower CR then the Wald type statis- tics. Another convenient feature of EL method is that it can by pass some regularity conditions that is essential to estimating equations. For example, the embedded constraints m(x, β) can have a higher dimension than parameter, i.e., r > p. This result can be found in Qin and Lawless’s work in [33]. Continuity is a necessary condition for the existence of solution to estimating equation Sn(β) = 0, where ∑n Sn(β) = m(Xi, β) i=1 and thus cannot be ignored. Under the CEE setting, Owen, and Qin and Lawless both assumed that the criterion function m(x, ·) is continuous in [32] and [33]. However, we found in some prominent right censored semiparametric model like 125 ϕ-transformation model and Cox model, mn(x,x, ·) is usually discontinuous due to the appearance of indicator functions. Therefore in Chapter 5, we show that under Classic Estimating Equation (CEE) setting with continuous or discontinuous m(x, ·), and Extended Estimating Equation (EEE) setting with continuous mn(x,x, ·), there exists Uβ0 , a non shrinking neighbourhood of β0 such that for any β ∈ Uβ0 , (6.4) has a unique solution with probability approaching 1. The same result can be shown for EEE when mn(Xi,X , ·) loses continuity if we restrict to the ϕ-transformation model discussed in Chapter 3, which is the broadest type of semiparametric model of which the structural parameter can be constructed via EEE to our knowledge. 6.1 Empirical likelihood of Extended Estimating Equation In this section, we consider i.i.d. observations i.∼i.d.X1, . . . , Xn Pθ, θ = (β, ν) ∈ Rp ×H, (6.6) described in (6.1) - (6.2). In Definition 1.2.1 of page 6, we defined EEE as ∑n Sn(β) = mn(Xi,X , β) = 0, β ∈ Uβ0 , (6.7) i=1 where Q(x, β) : Rd × Rp →7 R, k(x{, y, β) : R d × Rd p∑ × R 7→ R +, C(x) :}R d 7→ Rp, n j∑=1C(Xj)k(X− i, Xj, β)mn(Xi,X , β) = Q(Xi,X , β) C(Xi) n , (6.8) j=1 k(Xi, Xj, β) with Eβ0,ν {mn(X1,X , β0)} = 0 for all ν ∈ H, and β̃n is the unique solution to (6.7). In the following sections, we investigate the asymptotic normality of β̂n, the pEL estimator, and compare its asymptotic variance matrix of β̃n. 126 6.2 Asymptotic Normality Associated with the EEE In this section, we discuss the asymptotic normality of β̃n via solving EEE, and β̂n via minimizing the negative logarithm of pEL, i.e., ∑n β̃n : solution to Sn(β̂n) = 0, where Sn(β) = mn(Xi,X , β); ∑ ( ) i=1 (6.9)n β̂n : arg min ln 1 + t tr(β)mn(Xi,X , β) β i=1 √ First compute the asymptotic variance matrix of nSn(β0), which lead to the sand- wich form variance of β̃0. Then we compute the variance for β̂n following the idea of Qin and Lawless’s Lemma 1 and Theorem 1 in [33]. At the end of this section, we show that when the dimension of the EEE r equals to the dimension of β0, then the two variances from β̃n and β̂n are identical. ∑n Theorem 6.2.1 Under assumptions of (A.1) - (A.10), n−1/2 mn(Xi,X , β0) → i=1 N(0,Σ), where Σ{= Σ1 + Σ2,{and for }} E {C(X2)k(X1, X2, β0)|X1} Σ1 = V ar Q(X1, β0) C(X1)− E {k(X1, X2, β0)|X1} (6.10) Σ2 = V ar {q1(X1, β0)C(X1)}+ V ar {q2(X1, β0)} − 2ΣAB, and { ∣ } Q(X2, β0)k(X2, X1, β0) ∣ q1(X1, β0) = E { ∣∣ X1 ,E {k(X2, X3, β0)|X2} ∣ } Q(X2, β0)k(X2, X1, β0)E {C(X3)k(X2, X3, β0)|X2} ∣ q ∣2(X1, β0){= E X{ 1E {k(X 2 ∣2, X3, β0)|X2}} Q(X1, β0) ΣAB = E {C(X2)k(X1, X2, β0)− E {C(X4)k(X1, X4, β0)|X1}} k̄(X1, β0) } × Q(X3, β0)kc(X3, β0) {k(X ,X 2 3 2 , β0)− E {k(X3, X4, β0)|X3}} . k̄(X3, β0) 127 Asymptotic Varaince Covariance of β̃n Using the conclusion in Theorem 6.2.1, we can calculate the asymptotic variance matrix of β̃n. Let β̃n be the solution to Sn(β) = 0. We can derive the sandwich formed asymptotic variance covariance √ matrix for n(β̃n − β0) using the same Taylor expansion method in Section 5.3 of [41] by Van der Vaart to conclude √ D n(β̃n − β0) −→ N(0, Σ̃), (6.11) where for Σ defined in Theorem 6.2.1, Σ̃ = E {∇βmn(X1,X , β )}−10 ΣE {∇βmn(X1,X , β0)}−1 (6.12) Next, we discuss the asymptotic normality of β̂n. The following lemma is parallel Lemma 1 by Qin and Lawless in [33]. Lemma 6.2.1 Under assumptions {(A.1)-(A.10), as n → }∞, with probability, l(β) is minimized at β̂n in the interior of β : ‖β − β0‖ ≤ n−1/3 , with β̂n and t̂n = t(β̂n) given by Q1n(β̂n, t̂n) = 0, Q2n(β̂n, t̂n) = 0, (6.13) where n 1 ∑ 1 Q1n(β, t) = mn(Xi,X , β), n∑ 1 + t trm (X ,X, β) i=1 n i n ( )tr 1 1 ∂mn(Xi,X , β) Q2n(β, t) = t. n 1 + ttrmn(Xi,X , β) ∂βi=1 The proof of this Lemma is almost identical to Lemma 1 by Qin and Lawless in [33] after we developed the proceding parallel lemmas in Chapter 4. With Lemma 6.2.1, √ we can show the asymptotic normality of n(β̂n − β0). 128 √ D Theorem 6.2.2 Under assumptions (A.1) - (A.10), n(β̂n − β0) −→ N(0, V ), where V is defined in (6.12), where ( ) ( )tr V = Ŝ−1 Ŝ Σ̂−1 Σ Ŝ−122.1 21 1 22.1Ŝ −1 21Σ̂1 (6.14) with Σ and Σ1 defined in Theorem 6.2.1, Ŝ21 = E {∇βmn(Xi,X , β0)}, and Ŝ = ŜtrΣ−122.1 21 1 Ŝ21. (6.15) This is an extension of Theorem 1 by Qin and Lawless in [33] under the CEE setting. We followed their idea of proof after establishing preceding parallel lemmas in Chapter 4. Now, assume r = p as in all the other chapters of this thesis, where r is the dimension of mn(Xi,X , β) and p is the dimension of β0. Also assume Ŝ21 = E {∇βmn(Xi,X , β0)} is non singular for β ∈ Uβ0 , where Uβ0 is defined in Theo- rem 5.2.1. Then (6.14) becomes V = Ŝ−1Σ Ŝ−121 1 21 Ŝ −1 −1 21Σ1 ΣΣ1 Ŝ21Ŝ −1Σ Ŝ−1 = Ŝ−121 1 21 21 ΣS −1 21 . (6.16) Comparing (6.12) and (6.16), we can see that V = Σ̃, i.e., the asymptotic variance covariance matrix for β̂n is the same as that for β̃n. 6.3 Some Proofs √ Proof of Lemma 6.2.1 Note that nm̄n(X, β0) can be split into the difference of 1 ∑n { }√ − kc(Xi, β0)Q(Xi, β0) C(Xi) . (6.17) n k̄(X , β ) i=1 i 0 129 and ∑ { }n1 n− ∑1/2 nj∑=1 C(Xj)k(Xi, Xj, β√ 0) kc(Xi, β0)Q(Xi, β0) n n− − (6.18) 1/2 n j=1 k(Xi, Xj, β0) k̄(Xi, β )i=1 0 Since (6.17) is a summation of i.i.d. terms, under assumption (A.1)-(A.3) and (A.5)- (A.6), by the CLT, it converg{es to N(0,{Σ1), where }} Σ1 = V ar Q(X, β0) C(X)− kc(X, β0) . (6.19) k̄(X, β0) Next, we show that (6.18) is also asymptotically normal. Using the same strategy in the proof of Lemma 4.2.2, first split (6.18) into the difference of the following two terms, n n √1 ∑ ∑Q(X ∑i, β0) 1 { } − n √ C(Xj)k(Xi, Xj, β0)− kc(X , β )n n 1/2 i 0j=1 k(X ,X , β ) ni=1 i j 0 j=1 (6.18.a) and ∑n n √1 Q(Xi, β∑0)kc(Xi, β0) ∑{ }n √1− k(Xi, X1/2 j, β0)− k̄(Xi, β0) .n k̄(Xi, β0)ni=1 j=1 k(Xi, Xj, β0) n j=1 (6.18.b) By Lemma 4.2.1, we know∥ that ∥ n ∥∥∥∥ 1 ∑ ∥∥ a.s.sup k(x,Xj, β0)− k̄(x, β0)∥ −→ 0x∈X n ∥ j=1 Therefore for large n, (6.18.a) and (6.18.b) equals to 1 ∑n nQ(Xi, β0) 1 ∑{ } A = √ C(Xj)k(Xi, Xj, β0)− kc(Xi, β0) (6.18.a.1) n k̄(Xi, βi=1 0) n j=1 and 1 ∑n nQ(Xi, β0)kc(Xi, β ∑0)√1 { }B = k(Xi, Xj, β0)− k̄(Xi, β2 0) , (6.18.b.1)n k̄(X , β ) n i=1 i 0 j=1 130 respectively. Let n 1 ∑ dµn = δX , n i i=1 and dµ = dFX , then (6.18.a.1) can be re-written as an integral with respect to the difference between µn and µ, ∑n ∫ i.e. 1 Q(Xi, β0) √ C(y)k(Xi, y, β0) n {dµn(y)− dµ(y)} . (6.20) n k̄(Xi, β )i=1 0 Let √ dγn(y) = n {dµn(y)− dµ(y)} . Recall that for i = 1, . . . , n, ∫ k̄(Xi, β0) = E {k(Xi, X, β0)|Xi} = k(Xi, x, β0)dµ(x). Continuing to use the integrated empirical process notation to replace the sum, we know that (6.20)∫is equal to∫ ∫Q(z, β0) dµn(z) C(y)k(z, y, β0)dγn(y) (6.21) k(z, x, β0)dµ(x) Note that, as n→∞, µ∫n∫→ µ, which indicates that the integral in (6.21) approaches Q(∫z, β0)C(y)k(z, y, β0)dµ(z)dγn(y) (6.22) k(z, x, β0)dµ(x) Let ∫ Q∫ (z, β0)k(z, y, β0)q1(y, β0) = dµ(z), (6.23) k(z, x, β0)dµ(x) then (6.22) equals to ∫ q1(y, β0)C(y)dγn(y), (6.24) which converges to N(0, V ar {q1(Y, β0)C(Y )}). Note that for large n, (6.18.a) is equal to (6.18.a.1) almost surely. Therefore combining (6.20)-(6.24), we conclude 131 that (6.18.a) converges to N(0, V ar {q1(Y, β0)C(Y )}). Similarly, (6.18.b) can be re-written as∫∫ ∫{k∫(z, x, β0)C(x)dµ(x)Q(z, β0)k(z, y, β0) }2 dµ(z)dγn(y), (6.25) k(z, x, β0)dµ(x) which converges to ∫N(0, V ar {q2(Y, β0)})∫, where k(z, x, β0)C(x)dµ(x) q2(y, β0) = Q(z, β0)k(z, y, β0) {∫ }2 dµ(z). (6.26) k(z, x, β0)dµ(x) It remains to show the joint normality and the asymptotic covariance of (6.18.a) and (6.18.b), or equivalently, the covariance of A and B defined in (6.18.a.1) and (6.18.b.1). Note that ∑n ∑n n n { · 1 ∑∑ Q(Xi, β0) { } E(A B) = E C(Xj)k(Xi, Xj, β0)− kc(X3 i, β0)n ′ ′ k̄(Xi=1 i =1 j=1 j =1 i, β0) (6.2}7) × Q(Xi ′ , β0)kc(Xi′ , β0) { } k(Xi′ , Xj′ , β0)− k̄(Xi′ , β0) k̄(Xi′ , β 20) Let us consider the following cases which are broken down according to the number of distinct number of elements in I = {i, j, i′, j′}, Case 1 If the number of distinct elements in I is 1, then there are n identical terms to be added Case 2 If the number of distinct elements in I is 2, then there are {C14 + C24}n(n− 1) identical terms to be added. Case 3 If the number of distinct elements in I is 3, then there are C14 ·n(n−1)(n−2) identical terms to be added. 132 Case 4 If the number of distinct elements in I is 4, then due to the independence of Xi’s, the expected value in the summand of (6.27) is 0. Since (6.27) is the summation of groups of identical terms then divided by n3, so we only need to consider Case 3. Assume that i 6= j and i′ 6= j′ as we did before, and consider the following subcases for Case 3, Case 3a i = i′, then the expected values of (6.27) are all equal to { Q(X1, β0) { } E C(X2)k(X1, X2, β0)− kc(X1, β0) } (6.28)k̄(X1, β0) × Q(X1, β0)kc(X1, β0) { } k(X ,X , β )− k̄(X 2 1 3 0 1 , β0) . k̄(X1, β0) We can show that (6.28) is identical to 0 by conditioning on X1. Case 3b i = j′, then the expected values of (6.27) are all equal to { Q(X1, β0) { } E C(X2)k(X1, X2, β0)− kc(X1, β0) } (6.29)k̄(X1, β0) × Q(X3, β0)kc(X3, β0) { } k(X3, X1, β0)− k̄(X3, β0) k̄(X 23, β0) We can show that (6.29) is identical to 0 by conditioning on X1 and X3. Case 3c j = i′, then the expected values of (6.27) are all equal to { Q(X1, β0) { } E C(X2)k(X1, X2, β0)− kc(X1, β0) } (6.30)k̄(X1, β0) × Q(X2, β0)kc(X2, β0) { } k(X2, X3, β0)− k̄(X2, β0) k̄(X2, β )20 We can show that (6.30) is identical to 0 by conditioning on X1 and X2 133 Case 3d i = 1, j = 2, i′ = 3, j′ = 2, then the expected values of (6.27) are all equal to { Q(X1, β0) { } ΣAB = E C(X2)k(X1, X2, β0)− kc(X1, β0) } (6.31)k̄(X1, β0) × Q(X3, β0)kc(X3, β0) { } k(X3, X2, β0)− k̄(X3, β0) k̄(X3, β )20 Combining (6.18.a), (6.24) and (6.26), we know that (6.18) converges to N(0,Σ2), where Σ2 = V ar {q1(Y, β0)C(Y )}+ V ar {q2(Y, β0)} − 2ΣAB (6.32) As for the asymptotic covariance of (6.17) and (6.18), consider ∑n ∑n { { }1 kc(Xi, β0) E Q(Xi, β0) C(Xi)− (6.33) n ′ k̄(X , β )i=1 i =1 { i 0 } n− ∑ 1 n j∑ }=6 iC(Xj)k(Xi′ , Xj, β0) kc(Xi′× − , β0)Q(Xi′ , β0) n−1 nj=6 i k(Xi′ , Xj, β0) k̄(Xi′ , β0) (6.34) When i 6= i′, then the terms in (6.33) and (6.34) are independent, hence the expected value of the product of (6.33) and (6.34) equals to the product of their expected value. Furthermore, notice that (6.33) are centered at zero, therefore when i 6= i′, the expectation of (6.33)-(6.34) is zero. Next, we consider the case when i = i′. Since there are n identical terms, we can rewrite the expectation of the (6.33)-(6.34) as { { } − kc(X1, β0)E Q(X1, β0) C{(X1) ∑ k̄(X1, β0) }} n−1 nj∑=6 iC(Xj)k(X1, Xj, β× 0) kc(X1, β0)Q(X1, β0) n− n − 1 j=6 i k(X1, Xj, β0) k̄(X1, β0) 134 ∑ Since by∑strong law of large numbers, as n goes to infinity, n −1 n j=1 C(Xj)k(X1, Xj, β0) and n−1 nj=1 k(X1, Xj, β0) goes to kc(X1, β0) and k̄(X1, β0), respectively, then by dominated convergence theorem, the expected value in the expression above also approach zero as n goes to infinity. Therefore (6.17) and (6.18) are asymptotically √ independent, and nm̄n(X, β0)→ N(0,Σ) with Σ = Σ1 + Σ2.  Proof of Lemma 6.2.1 Let β be on the surface of the ball centered at β0 and with radius n−1/3. Hence for a unit vector u, β = β0 + un −1/3. Next, we give a lower bound for l(β) on the surface of the ball. Let υ ∈ Rp be a unit vector an{d t = ‖t‖υ. Owen in}[32] showed that the (6.35) holds uniformly for values of β ∈ β : ‖β − β0‖ ≤ n−1/3 under the CEE setting, i.e., when the criterion function is given by[m(Xi, β)∑ ] [ ]n −1 n1 1 ∑ t(β) = m(X , β)⊗2 m(X , β) + o(n−1/3i i ), (6.35) n n { i=1 } i=1 uniformly about β ∈ β : ‖β − β ‖ ≤ n−1/30 . It is essential that (6.35) is still true in the EEE setting, therefore we prove it following the steps in Qin and Lawless [33] and Owen [32]. Since we have shown the parallel preceding lemmas in Chapter 4, the steps are not much different than that in [33] or [32]. Let Yi(X tr i,X , β) = t mn(Xi,X , β) and substitute 1 = 1− Yi(Xi,X , β) (6.36) 1 + Yi(Xi,X , β) 1 + Yi(Xi,X , β) 135 Note that ∑n1 mn(Xi,X , β) = 0 (6.37) n 1 + ttrm (X ,X, β) i=1 n i can be re-written as 1 ∑n mn(Xi,X , β) = 0 (6.38) n 1 + Yi(Xi,X , β)i=1 Multiply (6.38) by υ, then plug i(n (6.36), then we get∑n )1 Yi(Xi,X , β) 0 = υtrmn(Xi,X , β) 1− n ∑ ∑1 + Yi(Xi,X , β)i=1n n1 υtrtr − mn(Xi,X , β)m trn(Xi,X , β) t= υ mn(Xi,X , β) . (6.39) n 1 + Yi(Xi,X , β)i=1 i=1 Define n 1 ∑ m (X ,X, β)⊗2n i S̃(X, β) = n∑ 1 + Yi(Xi,X , β)i=1 (6.40)n1 S(X, β) = m (X ,X, β)⊗2n i , n i=1 Plug (6.40) into (6.39), yielding υtrm̄n(X, β) = ‖t‖υtrS̃(X, β)υ (6.41) By the definition of S̃ in (6.40), 1 ∑n≥ mn(X ,X, β)⊗2i 1S̃(X, β) = S(X, β) · , n 1 + max Y i=1 i i (Xi,X , β) 1 + maxi Yi(Xi,X , β) implying S(X, β) ≤ S̃(X, β)(1 + maxYi(Xi,X , β)), i where the notation that A ≥ B for matrix A and B means that A − B is positive definite. We know that . ‖t‖υtrS(X, β)υ ≤ ‖t‖υtrS̃(X, β)υ(1 + maxYi(Xi,X , β)). (6.42) i 136 Let Z∗n(X, β) = maxi ‖mn(Xi,X , β)‖ in (4.7), so ‖t‖υtrS̃(X, β)υ(1 + maxYi(Xi,X , β)) ≤ ‖t‖υtrS̃(X, β)υ(1 + ‖t‖Z∗n(X, β)) (6.43) i Now, by (6.41), ‖t‖υtrS̃(X, β)υ on the right hand side of (6.43) can be substituted by υtrm̄n(X, β), hence ‖t‖υtrS̃(X, β)υ(1 + maxYi(Xi,X , β)) ≤ υtrm̄n(X, β)(1 + ‖t‖Z∗n(X, β)), (6.44) i where ∑n1 m̄n(X, β) = mn(Xi,X , β). (6.45) n i=1 Combining (6.42) and (6.44), we attain ‖t‖υtrS(X, β)υ ≤ υtrm̄n(X, β)(1 + ‖t‖Z∗n(X, β)), (6.46) which result in ‖t‖(υtrS(X, β)υ − Z∗n(X, β)υtrm̄n(X, β)) ≤ υtrm̄n(X, β). (6.47) Note that we assume t = ‖t‖υ and υ is a unit vector. By Proposition 4.2.1 and Lemma 4.2.2, we know that Z∗(X, β)υtrm̄ (X, β) = o(n1/2)O(nδ−1/2) = o(nδn n ). (6.48) Plug (6.48) into (6.47), yielding ‖t‖(υtrS(X, β)υ + o(nδ)) ≤ O(nδ−1/2) (6.49) Since υtrSυ is bounded by the minimum and maximum eigen value of V ar(mn(Xi,X , β)), (6.49) indicates that ‖t‖ = O(n−1/2). (6.50) 137 From (6.38) we can see that ∑n ( )1 2 0 = mn(Xi,X , β) 1− Yi(Xi,X , β) Yi(Xi,X , β) + n ∑ 1 + Yi(Xi,X , β)i=1 n − 1 mn(Xi,X , β)= m̄n(X, β) S(X, β)t+ Yi(X 2i,X , β) . (6.51) n 1 + Y i=1 i (Xi,X , β) Now, let us discuss the norm of the last term in (6.51). Recall that Yi(Xi,X , β) = ttr∥ mn(Xi, ,β), therefore∥∥∥ ∥ ∥ ∥n n∥ 1 ∑ mn(Xi,X , β) ∥ ∥ ∑2∥ ∥ 1 mn(Xi,X , β) ( ) ∥Y tr 2∥i(Xi,X , β) ∥ = ∥ t mn(Xi, ,β) ∥n 1 + Yi(Xi,X , β) ∥ ∥n∑ 1 + Yi(Xi,X , β) ∥i=1 i=1n ≤ 1 ‖ 1mn(Xi,X , β0)‖3‖t‖2 . n 1 + Yi(Xi,X , β)i=1 (6.52) Plug Z∗n(X, β) = maximn(Xi,X , β) into (6.52), yielding 1 ∑n ‖ 1mn(Xi,X , β )‖30 ‖t‖2 (6.53) n 1 + Yi(Xi,X∑, β)i=1 n ≤ Z∗ 1 1n(X, β) · ‖mn(X 2i,X , β)‖ · ‖t‖2 · .n 1 + Yi(Xi,X , β)i=1 Let δ be a number between 0 and 1/2, by Proposition 4.2.1, Lemma 4.2.2 and (6.50), we know 1 ∑ that n ‖m (X ,X, β )‖3‖t‖2 1n i 0 ≤ o(n1/2)O(nδ)O(n−1) n 1 + Y (X ,X, β) i=1 i i (6.54) = o(n−(1/2−δ)) Equations (6.52)-(6.54) implies that the norm of the last term in (6.51) is of order n−(1/2−δ), therefore from (6.51), we know that t = S−1(X, β)m̄n(X, β) + γ, where γ = o(n −1/3). (6.55) 138 Let l(β) be the negative logarithm of the profile empirical likelihood function ∑n { } l(β) = ln 1 + ttr(β)mn(Xi,X , β) ∑i=1n n tr − 1 ∑[ ]2 = t (β)mn(Xi,X , β) t tr(β)mn(Xi,X , β) + o(n 1/3) a.s. 2 i=1 i=1 (6.56) By[(6.55), we can rewr]ite [the right hand side of (6.56) as n 1 ∑ tr ∑ ] [ ]n n −1 n1 1 ∑ mn(Xi,X , β) mn(Xi,X , β) ⊗2 mn(Xi,X , β) +o(n 1/3) 2 n n n i=1 i=1 i=1 (6.57) Ta[ke Taylor expansion of (6.57) around the true param]ete[r value β0, attaining∑n ∑n tr ]n −1n 1 1 ∂mn(Xi,X , β ∑0) m −1/3 1 ⊗2 [ n(Xi,X , β0) + un mn(Xi,X , β)2 n n ∂β ni=1 i=1 ] i=1 n n × 1 ∑ 1 ∑ ∂mn(Xi,X , β0) m (X ,X, β ) + un−1/3 + o(n1/3n i 0 ) (6.58) n n ∂β i=1 i=1 Combining (6.56)-(6.58), by Lemma 4.2.2, we know that [ ( ) ]tr n −1/2+δ ∂m(X, Y, β0)l(β) = O(n ) + E un−1/3 2 ∂β × [Σ−11 ( ) ] × −1/2+δ ∂m(X, Y, β0)O(n ) + E un−1/3 + o(n1/3) ∂β ≥ (c− ε)n1/3 a.s., where 0 < δ < 1{/6, and [ ] } { ⊗22 − E C(Y2)k(Y1, Y2, β0)|Y1}Σ1 = E Q(Y1, β0) C(Y1) . (6.59) E {k(Y1, Y2, β0)|Y1} 139 with Y1, Y2[being i.i.d. replica o]f X[1 by Proposition 5.2.1.]Sim[ilarly,∑ ]n tr ∑n −1 nn 1 1 1 ∑ l(β0) = mn(Xi,X , β0) mn(Xi,X , β0) ⊗2 mn(Xi,X , β0) a.s. 2 n n n i=1 i=1 i=1 = O(nδ) a.s. { } Since l(β) is continuous for β ∈ β|‖β − β ‖ ≤ n−1/30 , there exists a minimum for l(β) in the interior of the ball, and β̂n satisfies ∣∣∣∣ ∑n ∣∂l(β) (∂ttr(β)/∂β)mn(Xi,X , β) + ttr(β) (∂mn(Xi,X , β)/∂β) ∣= ∣∂β 1 + ttrβ=β̂ ∑ ((β)mn(Xi,X , β) ∣ n i=1 n )tr ∣ β=β̂n 1 ∂mn(Xi,X , β) ∣ = t(β)∣ = 0 1 + ttr(β)mn(Xi,X , β) ∂β ∣i=1 β=β̂n  Proof of Theorem 6.2.2 Let us take the derivative of Q1n and Q2n in Lemma 6.2.1 with respect to β and t and get n n ∂Q1n(β, 0) 1 ∑ ∂m ∑n(Xi,X , β) ∂Q1n(β, 0) 1 = , = m (X ,X, β)⊗2n i , ∂β n ∂β ∂ttr n i=1 (i=1n )tr ∂Q2n(β, 0) ∂Q (β, 0) 1 ∑2n ∂mn(Xi,X , β) = 0, = . ∂β ∂ttr n ∂β i=1 Then take the Taylor expansion of Q1n(β̂n, t̂n) and Q2n(β̂n, t̂n) at (β0, 0), 0 = Q1n(β̂n, t̂n) ∂Q1n(β0, 0) − ∂Q1n(β0, 0)= Q1n(β0, 0) + (β̂n β0) + (t̂n − 0) + oP (δn), ∂β ∂ttr 0 = Q2n(β̂n, t̂n) ∂Q2n(β0, 0) ∂Q2n(β0, 0) = Q2n(β0, 0) + (β̂n − β0) + (t̂n − 0) + oP (δn), ∂β ∂ttr 140 where δn = ‖β̂n −β0‖+ ‖t̂n‖.   t̂n  −Q (β , 0) + o−  1n 0 P (δn) = Ŝ 1n  , β̂n − β0 oP (δn) where        { } ∂Q1n ∂Q1n − ∂mn E {m m tr n n } E   ∂ttr ∂β Ŝ =  −P→  Ŝ11 Ŝ12  =  { } ∂β n ,∂Q2n 0 Ŝ21 0  ∂mn  E 0, ∂ttr (β=β0,t=0) ∂β (6.60) where in (6.60), mn = mn(Xi,X , β0). n 1 ∑ Q1n(β0, 0) = mn(Xi,X , β0) = OP (n −1/2) n i=1 implies that δn = O (n −1/2 P ). Therefore √ √ n(β̂n − β0) = Ŝ−122.1Ŝ Ŝ−121 11 nQ1n(β0, 0) + oP (1), where { ( )tr ( )} ∂mn tr −1 ∂mnS22.1 = E (Emnm ) E . ∂β n ∂β √ Furthermore, by Lemma 6.2.1, nQ1n(β0, 0)→ N(0, Σ̃), hence √ { }tr n(β̂ − β )→ N(0, V ), where V = Ŝ−1 Ŝ Ŝ−1Σ̃ Ŝ−1n 0 22.1 21 11 22.1Ŝ21Ŝ−111 . (6.61)  141 Chapter 7: Computational Results This chapter is devoted to showing some simulation results under the right cen- sored Accelerated Lifetime Model (AFT) using Extended Estimating Equation (EEE) and Profile Empirical Likelihood (pEL) with R (3.4.1). Let T be the lifetime, and consider the model Y = βtr0 Z + ε, (7.1) where Y = ln(T ), β0 ∈ Rp is the structural parameter, Z is the p-dimensional covariate and ε is the error term centered at zero with unknown hazard rate function. We assume that T may be subject to right censoring C with unknown distribution function, therefore the data we actually observe are triplets (V,∆, Z), where V = min(ln(T ), ln(C)), and ∆ = I {T ≤ C} . (7.2) In previous chapters, we proposed two ways to construct an estimator for β0, which were β̃n via solving Extended Estimating Equation (EEE), and β̂n via minimizing the negative logarithm of the pEL, i.e.: ∑n EEE: β̃n such that Sn(β̃) = mn(Xi,X , β̃n) = 0; (7.3) i=1 ∑n ( ) pEL: β̂n = arg min l(β), where l(β) = ln 1 + t tr(β)mn(Xi,X , β) (7.4) β i=1 142 where t(β) = t(β,X ) is the solution to the gradient of the Lagrangian function defined in (5.50), and under ∑ { model∑assumption (7.1), }n n tr tr − ∑j=1 ZjI {Vj − β Zj ≥ Vi − β Zi}mn(Xi,X , β) = ∆i Zi n . (7.5)tr tr i=1 j=1 I {Vj − β Zj ≥ Vi − β Zi} The purpose of this chapter is to provide algorithms to numerically compute β̃n and β̂n. Since both Sn(β) and l(β) involve indicator functions, usual root-finding and optimization methods may not be directly applied here. Then we compare (7.3) and (7.4) with respect to the asymptotic behavior of β̃n and β̂n under different parameter settings, the time-efficiency of the two methods, and the local continuity of Sn(β) and l(β). We also compare the empirical variance-covarince matrices of β̃n and β̂n with the corresponding theoretical ones, and with each other. 7.1 Description of the Algorithm Associated with the EEE In this section, we outline the algorithms for computing β̂n and β̃n. To summarize, we compute β̂n in two steps, namely, first to construct a function t(β) that expresses the Lagrange multiplier in terms of β, and second to calculate β̂n by maximizing the pEL. We compute β̃n by minimizing the Euclidean norm of Sn(β) because, due to the discontinuity issue of Sn(β), a root may not exist. 143 7.1.1 Algorithm for Computing the pEL Estimator Given the AFT model described in (7.1), for fixed β, to find the probability vector that maximizes the empirical likelihood, we solve the following problem,  ∏arg maxp pi, where p = (p1, . . . , pn), ∑ ∑ (7.6)n nsubject to pi = 1, pi ∈ (0, 1), pimn(Xi,X , β) = 0. i=1 i=1 We have proved in Chapter 5 that with probability approaching 1, there exists a neighbourhood of β0 in which (7.6) has a unique solution given by 1 1 p̂i = · , (7.7) n 1 + ttrmn(Xi,X , β∑)n mn(Xi,X , β) where t is the solution to = 0. (7.8) 1 + ttrmn(Xi,X , β)i=1 Therefore the algorithm to compute β̂n is divided into two steps. First, write t as a function of β and X according to (7.8); then after combining (7.1) and (7.7), β̂n is given by ∑n ( ) β̂n = arg min ln 1 + t(β) trmn(Xi,X , β) (7.9) β i=1 In the following paragraph, we explain how these two steps are performed using software R (3.4.1). Step1: Construct t(β) To construct t(β) we convert the root solving problem of equation (7.8) into an optimization problem. For any fixed β, let ∑n t∗ = arg max f(t), where f(t) = ln(1 + ttrmn(Xi,X , β)) (7.10) t i=1 144 Since we seek the calculus maximum of (7.10), the solution t∗ must satisfy ∑n ∇ mn(Xi,X , β)tf(t∗) = 0, where ∇tf(t) = (7.11) 1 + ttrmn(Xi,X , β)i=1 In other words, for any fixed β, we can construct t(β) via maximizing in (7.10) using the nlm function built in R. For any fixed β and X , it is easy to verify that ∇⊗2t f(t) is always negative definite, therefore the solution of (7.11) is guaranteed to be a calculus maximum, if it exists. Since tinitial = (0, . . . , 0) always provides a legitimate probability vector (7.7), we use tinitial as the initial value for the nlm function in this step. We also need to pay attention to the domain of f(t), denoted by Df . For the univariate case, it is easy to show that ( ) D = −(max(m (X ,X, β)))−1f n i ,−(min(mn(Xi,X , β)))−1 . i i However, for the vector valued β, the analytical form of Df is no longer simple. In cases when 1 + ttrmn(Xi,X , β) has negative components, f(t) is no longer well- defined. Therefore we replace ln(·) by a monotone function h(·) defined everywhere but extremely negative at feasible values t. This penalty function should guarantee that the interactive root-finding method like Newton-Raphson converges. Condi- tions on h(z) include, 1. h(z) = ln(z) for any z > ε, where ε is a positive constant that is close to 0; 2. For any z −1 ∈ R ∪ {0} and z2 ∈ R+, h(z2) > h(z1). This way, replacing the function ln(·) by the penalty function will not change the solution to the maximization problem (7.10); 145 3. h(z) is continuous and differentiable for z ∈ R; 4. When seeking a root with Newton-Ralphson, if in the kth step zk < 0, then h′(zk) should point to the direction such that zk+1 > 0. In other words, if z < 0, then h′(z) should always guarantee z − h(z)/h′(z) > 0. A feasible choice of such a function is ln(z) when z > ε h(z) =  (7.12)z − εln(ε) + when z ≤ ε ε2 Using h(z), for any fixed β, we define t = t(β) as ∑n t(β) ≡ arg max f trh(t), where fh(β) = h(1 + t mn(Xi,X , β)). (7.13) t i=1 Step 2: Estimate β0 After writing the Lagrange multiplier t as a function of β, we can compute β̂n by minimizing the negative logarithm of the pEL, namely, β̂n = arg minβ l(β), where ∑n ( ) l(β) = ln 1 + ttr(β)mn(Xi,X , β) (7.14) i=1 The quasi-Newton-Raphson methods do not work for (7.14) under the AFT model assumption due to the appearance of indicator functions in mn(Xi,X , β). Therefore we use the default Nelder-Mead method in the optim function of R, which is a topological method that can be applied to nonlinear optimization problems for which derivatives may not exist. We use the least- square estimator on the uncensored data 146 denoted by b0 as the initial value, where Uncensored Index Set: R = {i : ∆i = 1}, where #R = u Vector of Uncensored Lifetime: O = (Vk; k ∈ R) ∈ Ru; (7.15) Matrix of Covariates with Uncensored Observations: H = (Z : k ∈ R) ∈ Rp×uk then b = (H trH)−1H tr0 O and β̂n is computed by optim(b0,l)$par. This choice of b0 is good for uncensored AFT, but in the right-censored case, a more reasonable initial choice is the estimator of Koul, Sursarla and van Ryzin in [25]. Restarting Improvement Ideally, with a reasonable initial value, Nelder-Mead method should give the solution to the minimization problem with a signle appli- cation of optim. However, the optimization procedure with respect to β turns out to require multiple restarts. More specifically, let b0 be the least square estimator calculated on set S, and br+1 = arg min l(β) with initial value br, r = 0, 1, . . . (7.16) β Let ‖ · ‖1 be the L1 norm on vectors. The sequence of restarts stops when ‖br − br+1‖1 ≤ 0.01. Ideally, the sequence (7.16) should stop at r = 0. However, we found that this ideal case only happens when the initial value is good enough, which occurs especially when the data are uncensored, or using the Koul-Susarla-van Ryzin estimator in [25] as b0 for right censored data. On the contrary, when the simple least square estimator is far away from β0, then the restarting improvement is necessary. 147 For example, when Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1), Z3 ∼ ln(F3,5), (7.17) Z4 ∼ ln(Beta(5, 3)), ε = Weibull(1)− 1, C = Exp(5.8); for sample size n = 200 and batches of simulations of size m = 1000, the number of iterations r are 1, 2 and 3, with frequencies 780, 209 and 11. 7.1.2 Algorithm for Computing the Z-Estimator As we discussed in previous chapters, we can construct an estimator for β0 by solving the extended estimating equation Sn(β) = 0. Due to the discontinuity of Sn(β), instead of directly solving the equation, we define β̃n ≡ arg min ‖Sn(β)‖, (7.18) β where ‖·‖ denotes the Euclidean norm. Again, we minimize using optim with initial value b0, i.e., the least square estimator on set R. As for the computation of β̂n, the Nelder-Mead method is not guaranteed to give the local solution to the minimization problem (7.18) in one step. We use the same restarting strategy described on page 146. In the following section, we will show that the negative profile log-likelihood function l(β) is much smoother than Sn(β), therefore we should expect to require more restarts in the calculation of β̃n than β̂n. This conjecture will be verified numerically in Section 7.4 148 7.2 Local Continuity of EEE and pEL In this section, we compare the local continuity of the EEE Sn(β) with the negative logarithm of the pEL function l(β). From the plots of l(β) and Sn(β), we will see clearly that the pEL is much smoother than the EEE. At the end of this section, we define a quantity tot.dif that measures the continuity pattern in the neighborhood of β̂n, βLS, and β0. From the histograms tot.dif, we will see that continuity behavior in the neighborhoods of β̂n and β0 are very similar. 7.2.1 Plots of Projection of EE and EL Since both Sn(β) and l(β) are defined in Rp, it is not easy to provide a plot directly. Therefore we plot only one “slice” at a time, namely, project Sn(β) and l(β) with respect to βi, where in general, for v ∈ Rp, vi is the ith component of v. More specifically, for EEE ∑n Sn(β) = mn(Xi,X , β), β ∈ Uβ0 , (7.19) i=1 consider function dS(s, i, j) dS(s; β, i, j) = Sn(β + sej) · ei, (7.20) where β ∈ Uβ0 , s ∈ (−δ, δ) with δ being a small positive number, i, j = 1, . . . , p, and ei is the ith column of the p× p identity matrix. The function dS(s; β, i, j) allows β to change only in the direction of ej, then records the value of of the ith component in Sn(β + s · ej) as output. 149 As for the negative log empirical likelihood function, consider dl(s; β, u, i) = l(β + s · ui), i = 1, . . . , p, (7.21) where β ∈ Uβ0 , and s ∈ (−δ, δ) for a small positive number δ, and u is a randomly generated unit direction. In the simulation, we use rnorm(p) to generate u, then divide it by its Euclidean norm. We run the simulation under two parameter settings as follows, for β0 = (4, 3, 2, 1) with normally distributed covariates in Set 1, and none-normally distributed covari- ates in Set 2, Set 1: Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1), Z3 ∼ N(3, 25), (7.22) Z4 ∼ N(5, 9), ε ∼Weibull(1)− 1, C ∼ Exp(0.015) and Set 2: Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1), Z3 ∼ F3,5 (7.23) Z4 ∼ Beta(5, 3), ε ∼Weibull(1)− 1, C ∼ Exp(.08) With n = 1000 for m = 1 replica. For this Set 1 and Set 2, censoring rates are given by 18.6% and 33.2%. β̂n and β̃n in both sets are identical to the second decimal place, and are (4.01, 3.00, 1.99, 0.99) and (3.86, 3.09, 2.05, 0.80), respectively. We plot dS for both Set 1 and Set 2. As we can see from Figure 7.1 and 7.2, even for a large sample size n = 1000, within a small neighbourhood of true β0, the extended estimating equation has a lot of jumps. 150 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 x x x x −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 x x x x −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 x x x x −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 x x s x x Figure 7.1: Plot of dS(s; β0, i, j), normal covariates Figure 7.1 shows a plot of d (s; β , i, j) = S (β + se )trS 0 n 0 j ei for s ∈ (−0.02, 0.02), where i, j = 1, . . . , p index horizontal and vertical plots. Horizontal lines indicate level 0. It implies that the function is only linear and smooth on the diagonal. The magnitude of the y-coordinates indicates Sn(β) is close to a diagonal matrix times β. 151 y y snb y y 2.18 2.22 2.26 7.92 7.96 −0.005 0.005 0.015 −0.20 −0.10 0.00 y y y y 2.10 2.20 2.30 7.90 7.95 8.00 8.05 −0.6 −0.2 0.2 0.6 −0.100 −0.090 −0.080 y y y y 2.15 2.20 2.25 2.30 −5 0 5 10 20 −0.05 0.05 −0.09 −0.07 −0.05 y y y y −2 0 2 4 6 8 7.70 7.80 7.90 8.00 −0.05 0.05 0.10 −0.100 −0.085 −0.070 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 x x x x −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 x x x x −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 x x x x −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 x x s x x Figure 7.2: Plot of dS(s; β0, i, j), non-normal covariates Figure 7.2 shows a plot of dS(s; β0, i, j) = Sn(β tr 0 + sej) ei for s ∈ (−0.4, 0.4), where i, j = 1, . . . , p index horizontal and vertical plots. Horizontal lines indicate level 0. It implies that the function is only linear and smooth on the diagonal. The magnitude of the y-coordinates indicates Sn(β) is close to a diagonal matrix times β. 152 y y snb y y 0.085 0.100 0.115 −1.62 −1.56 −1.50 −2.10 −2.00 −1.90 −1 0 1 2 y y y y 0.07 0.09 0.11 0.13 −1.8 −1.6 −1.4 −5 0 5 0.60 0.70 0.80 y y y y 0.06 0.10 0.14 −15 −5 0 5 −2.2 −1.8 −1.4 0.50 0.60 0.70 0.80 y y y y −0.1 0.1 0.2 0.3 −1.60 −1.56 −1.52 −2.04 −2.00 −1.96 0.73 0.75 0.77 Then we check the smoothness of the negative profile log likelihood l(β) for β is in a neighbourhood of β̂n by plotting dl(s; βn, u, i). Figure 7.3 and 7.4 are calculated under Set 1 (normal covariates) and Set 2 (non-normal covariates), respectively. The solid vertical line in each picture denotes β̂n, and the dotted vertical line denotes the true β0. As we can see from Figure 7.3 and 7.4, the distance between the estimated value and the true parameter value are small, the profile likelihood function is very smooth within the neighborhood of β̂n, and β̂n is the calculus maximum of l(β) in each randomly generated directions for both normal and non-normal covariates. 153 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 x x x x −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 x x x x −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 x x x x −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 −0.02 0.00 0.02 x x s x x Figure 7.3: Plot of dl(s; β̂n, u, i), normal covariates Figure 7.3 shows a plot of dl(s; β̂ i n, u, i) = l(β̂n + su ) for s ∈ (−0.02, 0.02), where i, j = 1, . . . , p index horizontal and vertical plots. The solid vertical line in each picture denotes β̂n, and the dotted vertical line denotes the true β0. The picture indicates on any random direction, l(β) is smooth and has a parabolic form, and β̂n is the calculus minimizer of l(β). 154 y y l(b) y y 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 0.0 1.0 2.0 3.0 0.00 0.02 0.04 0.06 y y y y 0 1 2 3 0.00 0.10 0.20 0.00000 0.00015 0.0 0.5 1.0 1.5 y y y y 0.0 0.2 0.4 0.6 0.0 0.5 1.0 1.5 2.0 0.00 0.02 0.04 0.06 0 1 2 3 4 y y y y 0.0 1.0 2.0 0 1 2 3 4 0 1 2 3 4 0.0 0.5 1.0 1.5 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 x x x x −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 x x x x −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 x x x x −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 −0.4 0.0 0.2 0.4 x x s x x Figure 7.4: Plot of dl(s; β̂n, u, i), non-normal covariates Figure 7.4 shows a plot of dl(s; β̂n, u, i) = l(β̂n + su i) for s ∈ (−0.4, 0.4), where i, j = 1, . . . , p index horizontal and vertical plots. The solid vertical line in each picture denotes β̂n, and the dotted vertical line denotes the true β0. The picture indicates on any random direction, l(β) is smooth and has a parabolic form, and β̂n is the calculus minimizer of l(β). 155 y y l(b) y y 0.0 0.4 0.8 0 10 20 30 40 50 0.0 1.0 2.0 3.0 0 10 30 50 y y y y 0 10 20 30 40 0 5 10 15 20 0 5 10 15 20 0 2 4 6 8 10 y y y y 0 10 30 50 0 5 10 15 20 0 20 40 60 80 0 5 10 15 y y y y 0.0 1.0 2.0 3.0 0 2 4 6 8 10 0.00 0.02 0.04 0.06 0 5 10 15 7.2.2 Quantitative Measurement of Local Continuity of l(β) In fact, the continuity of the negative profile log-likelihood function l(β), the local continuity around β0 can be quantified using the total difference of the approximated derivative for function dl(s; β, i) in the following steps, 1. For fixed, β ∈ Uβ0 , s ∈ (−δ, δ) with δ being a small positive number, dl(s; β, i) = l(β + s · ei), , i = 1, . . . , p (7.24) where ei is the ith row of p× p identity matrix. 2. For i = 1, . . . , p, approximate the derivative of dl(s) using 1 der(s; β, i) = (dl(s+ ε1; β, i)− dl(s− ε1; β, i)). (7.25) 2ε1 3. Generate a grid of points over interval (−a, a) with grid length ε0. Calculate der(s; β, i) at each grid point. Record the output in a p× l matrix out, where l is the number of grid points. outi,j approximates the derivative of l(β) in the direction of βi at the jth grid point, where i = 1, . . . p and j = 1, . . . , l. 4. For each row of the matrix out Step 3, calculate the absolute value of the difference between the consecutive components, abs.di,j = |outi,j − outi,j+1|, i = 1, . . . , p, j = 1, . . . , l − 1, (7.26) then record the row sum these differences, i.e., ∑l−1 tot.di = abs.di,j, where i = 1, . . . , p. (7.27) j=1 156 Remark 7.2.1 We do not have a universal criterion to choose a, ε0 and ε1 that is guaranteed to work for every parametric setting. Since we are only interested in the local continuity around a fixed β, ε0 should not be too large. The choice of ε1 depends on ε0. In general, the magnitude of ε1/ε0 should not be too large, in which case, the total difference cannot reflect the subtle differences for different choices of β; on the other hand, if the ratio is too small, then due to the appearance of the indicator functions, the total difference would be large for any choice of β. The quantity tot.di should be much smaller for smooth functions than non-smooth functions. We also expect the total difference around β̂n,m to be similar to the total difference of around β0, where m denotes the number of simulations. To test these conjecture, we use the same simulated data sets in (7.22) in Section 7.1. The parameters for grids and difference quotients are a = .5, ε0 = .01, ε1 = .001. We ran the simulation for m = 1000 times with sample size 200. The censoring rate ranges from 22% to 42% and the simulation means are (3.954,2.970,1.919,0.974). 157 500 1500 2500 0 4000 8000 0 4000 8000 0 50000 150000 s s s s 500 1500 2500 1000 3000 5000 1000 3000 5000 0e+00 4e+04 8e+04 s s s s 200 600 1000 1400 1000 3000 5000 1000 3000 5000 0 40000 80000 s s beta s s Figure 7.5: Plot of tot.diff , normal covariates This picture shows the histogram of total differences defined in (7.27), where the first row is for β0, second for β̂n, and third for βLS. The shapes and ranges of the histogram indicates that the β̂n has a similar continuity pattern compared with β0, and such pattern cannot be preserved when β is far from β0 evidenced by the histogram of β̂LS. 158 Frequency Fretoqtu.deinf(cbyeta) Frequency 0 50 100 150 200 250 0 50 150 250 350 0 50 150 250 350 Frequency Frequency Frequency 0 50 150 250 350 0 100 200 300 0 100 300 500 Frequency Frequency Frequency 0 50 150 250 350 0 50 150 250 350 0 100 300 500 Frequency Frequency Frequency 0 100 200 300 400 0 100 200 300 400 0 100 200 300 400 7.3 Convergence of the Estimator ∑n Since t∗ = t(β) is the calculus maxima of f(t) = h(1 + ttrmn(Xi,X , β)) for i=1 each fixed β, the gradient ∇tf(t) at t∗ must be zero, i.e., ∑n ∗ mn(Xi,X , β)g(t , β) = ∗ , where t ∗ = t(β). (7.28) 1 + t trm i=1 n (Xi,X , β) Therefore at β̂n, the gradient g(t(β̂n), β̂n) should be close to zero regardless of parametric setting or sample size. To test this claim, we did experiments under β0 = (1, 2, 3, 4) Set 3: Z1 = Bernoulli(0.5), Z2 = ln (Γ(3)) , Z3 = ln (F3,5) , (7.29) Z4 = ln (Beta(5, 3)) , ε ∼Weibull(1)− 1; C ∼ Exp(0.5) sample sizes n = 30, 50 and 100 for m = 100 simulations. We found the gradient is always bounded by a very small number to the magnitude of 1e− 5. 7.4 Asymptotic Normality of the Z-Estimator and pEL Estimator In this section, we compare β̃n,m, the solution to the EEE, and β̂n,m the value that minimizes the negative logarithm of the pEL. Under three different parametric settings, we compare the center of β̃n,m and β̂n,m as well as the empirical variances. We seek proper sample sizes n1 and n2 such that β̃n1,m and β̂n2,m reaches asymptotic normality. We also compare the computational efficiency in terms of time lapsed and iterations that is needed for restarting improvement described on page 147. Throughout this section, m = 1000 if not otherwise specified. 159 7.4.1 Covariates without Normality Let us start with a parametric setting with covariates that are not normally distributed. Let β0 = (0.1, 0.2, 0.3, 0.4), and consider Set 3 described in (7.29), Set 3: Z1 = Bernoulli(0.5), Z2 = ln (Γ(3)) , Z3 = ln (F3,5) , (7.29) Z4 = ln (Beta(5, 3)) , ε ∼Weibull(1)− 1; C ∼ Exp(0.5) Normality We found that the smallest sample size that is needed to attain nor- mality is affected by the symmetricity of covariates. Checking the histograms, Z1 and Z3 are appears much more symmetric, the sample size that for β̃ 1 n, β̃ 3 1 n, β̂n and β̂3n to reach normality is significantly smaller than that for β̃ 2 n and β̃ 4, and β̂2n n and β̂4n. With n = 100, censoring rate ranging from 16% to 43% and mean equal to 29.3%. Based on the Shapiro test, β̃in,m and β̂ i n,m reaches normality only when i = 1 and i = 3. When the sample size n = 200, both β̃in,m and β̂ i n,m pass the Shapiro test for i = 1, . . . , 4. However, if we use the Mardia’s Test for multivariate normality, then the sample size needs to be increased to 450. A histograms for β̃i450,1000 and scattered plot of β̃i i450,m against β̃450,m can be found in Figure 7.6 160 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.20 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8 ● ● ● ●●● ● ●●● ● ●●● ● ●●● ●●● ●●●● ●●● ●●●● ●●●● ●●●● ●●● ● ●●●● ●●●● ●●● ●●● ●●● ●●●● ●● ●● ● ●●●● ●●● ●●● ● ●●● ● ●●●● ●● ●●●●● ●●● ● ● ●●● ●●● ● ● ●●● ●● ●● ●●● ●●● ●●●●● ●●●● ●● ● ●● ●●● ●●● ●●● ●● ● ●● ●●● ●●● ● ●● ●●● ●● ● ●● ● ● ● ● ● ●● −0.2 −0.1 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 ● ● ● ●●●●●● s[1:m] ●●● ●●●● ●●● ● s[1:m] ●● ●● ●●●● ●● ●● ●● ●●● ●● ●●●● ●● ● ●●●● ● ●●●● ●●● ●●●● ●●● ● ● ●●●● ● ●●●● ●●● ●●●● ●● ●●● ●●● ●●● ● ●●● ● ●●● ●● ●●●● ●●●● ●●● ●● ● ●●●● ●●● ● ●●● ●● ● ● ●●● ●● ●● ●● ● ●● ●●●● ●●● ●● ● ● ●●● ●● ●● ●●● ●● ● ●● ● ●● ● ● 0.20 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8 s[1:m] beta s[1:m] Figure 7.6: Histogram and QQ Plots for None-Normal Covariate Covariates 161 s[(m + 1):(2 * m)] s[(m + 1):(2 * m)] Density Density 0.20 0.30 0.40 −0.2 0.0 0.2 0 2 4 6 8 12 0 1 2 3 4 5 s[(m + 1):(2 * m)] s[(m + 1):(2 * m)] Density Density 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.0 1.0 2.0 3.0 0 1 2 3 4 5 6 Accuracy and Asymptotic Variance Covariance The first thing that we com- pared was whether the two estimators β̂n,m and β̃n,m differs a lot from replica to replica. In our simulation, we measure this difference using L1 vector norm of β̃n,m − β̂n,m, which should be very small theoretically for proper n. This is because when constructing β̂n,m, we first got t = t(β) by ∑n t∗ = arg max f(t), where f(t) = ln(1 + ttrmn(Xi,X , β)) (7.30) t i=1 for fixed β. Since t is a calculus maxima for f(t), ∑n ∇ mn(Xi,X , β)tf(t) = = 0 (7.31) 1 + t(β)trmn(X ,X, β)i=1 i On the other hand, β̃n is the solution to ∑n Sn(β) = mn(Xi,X , β) = 0 (7.32) i=1 Comparing (7.31) and (7.32) we see that Sn(β) is exactly ∇f(t) when t = 0. On the other hand, theoretically, the solution to (7.30) has a very small magnitude relative to the order of ‖β − β0‖. Since for large n, ‖β̂n − β0‖ is small, therefore by the continuity of ∇tf(t), β̃n and β̂n should also be very close as well. We can observe this fact from the scattered plot of β̃in,m against β̂ i n,m for i = 1, . . . , p. For n = 100, the dots is distributed closely to the line y = x. As n grows to 150 and 200, the linear pattern is even more clear. As for accuracy, even for n = 100, the mean and median for both β̃n,m and β̂n,m are very close to the true β0. We have also compared the empirical variance covariacne matrix of the two estimators denoted by cov1 and cov2. The largest eigenvalues of cov1 − cov2 for n = 100 is to the magnitude of n−3. 162 Restarting Improvement and Timing We found that β̃n,m is always faster to compute than β̂n,m. This is because the latter one requires two steps of optimization, yet the β̃n,m requires only one step. As for the iterations that is needed for the restarting improvement, we found that β̃n,m always requires more rounds. This is not surprising because the negative logarithm of the pEL function l(β) is much smoother than the extended estimating equation function Sn(β). Consequently, it takes more restartings for to find the minimum value of ‖Sn(β)‖. 7.4.2 Normally Distributed Covariates We investigate similar aspects of β̃n,m and β̂n,m under β0 = (0.51, 0.15, 1.18, 0.92) Set 4: Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1) Z3 ∼ N(0.3, 0.25); (7.33) Z4 ∼ N(0.1, 0.09), ε ∼Weibull(1)− 1, C ∼ Exp(0.5) When all the covariates are normally distributed, the n that for β̃i in,m β̂n,m to gain normality is much smaller, where i = 1, . . . , p. For n = 100 with censoring rate ranging from 15% to 44% and centered at mean = 28%, all components of β̃n,m and β̂n,m passed the Shapiro normality test. When n = 400, both estimators pass the Mardia test for multivariate normality. A histograms for β̃i400,1000 and scattered plot of β̃i400,m against β̃ i 400,m can be found in Figure 7.7 163 −0.2 −0.1 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.20 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8 ● ● ● ●●● ● ●● ● ● ●●● ●● ● ●● ●●●● ●● ●● ●●● ●● ●●● ● ●●●● ●●● ●● ● ●●●● ● ●●●● ● ● ●●● ● ●●●● ●●● ●●●●● ● ●●●● ●● ●●● ●●● ●● ●● ●●●● ● ●● ●●● ●●● ●●● ● ●● ●●● ●● ● ● ●●● ●● ●●● ●●● ●● ● ●●●● ●● ●●● ●●●● ●●● ●●● ●● ● ●● ● ●● ●●● ●● ● ●●●● ●● ● ● ● ●● −0.2 −0.1 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.4 ●● ●●●● s[1:m] ●●● ● ●● ●●● s[1:m] ● ●● ●●● ● ●●●● ● ●●●●● ●●● ●● ●● ● ●●● ●● ●● ●●●● ●●●● ●●●● ● ●●● ●● ●●●● ●● ● ●● ●●●● ●● ●●●● ● ● ●●●● ●●● ●● ●●●● ● ●● ●●●●● ● ●● ● ●●● ●●●● ●●● ●●●● ● ●● ●●● ●●● ● ●●● ● ●●●● ●● ●● ●● ●●● ●● ● ● ●●● ●● ● ●● ●● ●● ● ●● ● ●● ● ● 0.20 0.25 0.30 0.35 0.40 0.0 0.2 0.4 0.6 0.8 s[1:m] beta s[1:m] Figure 7.7: Histogram and QQ Plots for Normal Covariate Covariates 164 s[(m + 1):(2 * m)] s[(m + 1):(2 * m)] Density Density 0.20 0.30 0.40 −0.2 0.0 0.2 0 2 4 6 8 12 0 1 2 3 4 5 s[(m + 1):(2 * m)] s[(m + 1):(2 * m)] Density Density 0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.0 1.0 2.0 3.0 0 1 2 3 4 5 6 Similar to the none-normal setting, the L1 norm of β̃n,m − β̂n,m is always small evidenced by the the scattered plot of β̃i in,m − β̂n,m for i = 1, . . . , 4, on which the points are dist(ributed closely along t)he line y = x. The magnitude of the absolute value of eigen cov(β̃n,m)− cov(β̂n,m) is also small. When n = 450, the eigenvalue with the largest absolute value is 1.66e-05 Under Set 4, the the EL method is still more costly in terms of time. However, like in Set 3, the number of iterations needed for restarting improvement is much smaller for the EL method. When n = 400, 938 replicas are finished by only 1 restarting and the rest 62 are finished by in 2 restarting iterations for β̂n,m; in comparison, when calculating β̃n,m, only 834 replicas are finished within 1 restarting, and the maximum number of iterations is 4. 7.4.3 Extreme Cases In this section, we consider an “extreme” parameter setting, under which the lifetime t is either very large or very small. This is rarely seen in practice, however, we would like to compare the behavior of β̃n and β̂n out of theoretical interest. Consider β0 = (0.4, 0.3, 0.2, 0.1), Set 5: Z1 ∼ Bernoulli(0.5), Z2 ∼ N(0, 1), Z3 ∼ F3,5, (7.34) Z4 ∼ Beta(5, 3), ε ∼Weibull(1)− 1, C ∼ Exp(0.8) Under parametric setting (7.34), the censoring rate ranges between 25.8% to 38.6% with mean around 31.3%. From the histogram, we can see that T and V are severely skewed to the right with extremely large outliers, which happens because the co- 165 variate Z3 is severely skew to the right. Therefore we expect the estimator for β 3 0 to gain asymptotic normality with a much larger n. For both β̃n,m and β̂n,m, a large sample size is needed in order to attain asymptotic normality. Unfortunately, we was not able to fine a proper n such that β̃3 3n,m or β̂n,m gains normality. For n = 1000, the p-values of Shapiro test on both β̃in,m and β̂ i n,m for i = 1, 2, 4 are above 0.6, yet 0.0004 for β̃3n,m and β̂ 3 n,m. In comparison between β̃n,m and β̂n,m they are both very closed evidenced by a small L1 norm form = 1, . . . , 1000, and the absolute difference of variance covariance matrices with the magnitude of eigenvalue 1e-5 when n = 1000, and 1e-4 when n = 500. 166 Chapter 8: Contribution and Future Work In this section, we summarize the major contributions of this thesis, and then give an outline of what future work can be done related to the Extended Estimating Equations (EEE), and how to make the extension even further so that the EEE can be applied to a broader class of semiparametric models. 8.1 Original Contribution The most important concept we developed in this thesis is the EEE. Motivated by the lack of coverage of Classic Estimating Equation (CEE) in the regime of right censored semiparametric models including the widely used Cox model and Accelerated Failure Time (AFT) model, we see the necessity to extend the scope of CEE so that it can also serve as a tool for those semiparametric setting. Inspired by the construction of the partial likelihood equation for Cox model in [11] and the linear rank equation for the AFT in [38], we establish the EEE by allowing the criterion function not only depend on data and structural parameter, but also on the nuisance parameter. Using the concept of Glivenko-Cantelli class and Donsker class, we are able to prove lemmas parallel to the EL under EEE setting in [32] 167 and [33], which lead to the asymptotic normality of the corresponding Z-estimator, i.e., the root to the EEE. We also constructed the ϕ-transformation model, which to our knowledge, is the broadest class of semiparametric models of which the structural estimator can be defined as the root of EEE. The ϕ-transformation is a generalization of the General Transformed Model (GMT), and can represent a series of semiparametric models including linear model, AFT, location-scale model, and operational time model, and etc. The GMT model has been well research in statistical literature such as in [38] by Tsiatis. Similar to Tsiatis’s work, we construct a martingale structure on the residual scale. This structure guarantees the zero mean assumption of EEE. Then we prove the local asymptotic linearity of the associated EEE around true structural parameter, which leads to the asymptotic normality of the structural parameter. We can also use EEE as a constraint in the Empirical Likelihood (EL) maximiza- tion. We prove that with criterion functions that are continuous with respect to the structural parameter, or a martingale structure as described in the ϕ-transformation model and Cox model, there exists a non-shrinking neighbourhood of the true struc- tural parameter such that for any fixed value in that neighbourhood, there exists a unique probability vector that maximizes the EL with probability approaching 1. Then a structural estimator can be calculated by maximizing the Profile Empirical Likelihood (pEL). After establishing lemmas that are parallel to those in [32] and [33] 168 using empirical process theory, we show that the pEL estimator is asymptotically normal. When the dimension of the EEE equals to that of the structural parameter, the asymptotic variance covariance matrix of the pEL is identical to the sandwich- formed variance covariance matrix of the Z-estimator via solving the corresponding EEE. From simulation with R, we are able to visually compare the local continuity of EEE and the pEL, and we found the latter is much smoother than the former. We also propose a concept of “total difference” to quantify the local continuity of the pEL. Although we are not clear about the statistical behavior of this quantity, from the simulation result, we can see that the pEL estimator preserves a similar pattern of total difference to the true parameter value. Although strictly speaking, neither EEE or pEL is continuous with respect to the structural parameter, the pEL does appear to be more smooth. One benefit of this “continuity” is that it takes much less iterations to perform the restart improvement procedure using the Nelder-Mead optimization. We also find that for a sufficiently large sample size, the Z-estimator and the pEL estimator are very close. Namely, the L1 distance between the two is small, and the magnitude of the eigenvalue of the difference between the variance covariance matrix small as well. The smallest sample size that is needed for the pEL to reach asymp- totic normality is affected by the skewness of the covariate. The more symmetric the covariance is, the smaller the threshold sample size is. 169 8.2 Future Work Under the CEE setting, one of the important benefits of EL is constructing the Wilk’s type confidence region and develop the EL ratio test. Compared with the Wald type statistics, the EL confidence region does not involve calculating an vari- ance covariance matrix, therefore is more computationally efficient. Owen developed the EL confidence region in [32] for the case when the dimension of the criterion function r equals the dimension of the structural parameter p, and Qin and Lawless generalized his conclusion to the case when r > p. Following the thread of Qin and Lawless’s work, we were able to establish the asymptotic normality of the pEL esti- mator associated with EEE, however, we did not establish the Wilk’s type theorem for EEE. Since we have already proved the asymptotic normality of EEE evaluated at the true structural parameter, with the same Taylor expansion technique applied to the pEL as the proof of Theorem 2 in [33], we should be able to show the EL ratio statistic follows a chi squared distribution with degree of freedom equal to the dimension of the structural parameter, and a confidence region will follow after this result. A major advantage of using CEE as a constraint in EL method rather than directly solving for an Z-estimator is that the former allows the dimension of the CEE r exceeds that of the structural parameter p. In this thesis, we restrict our discussion to the case when r = p except for the proof of the lemmas that are parallel to Lemma 1 and Theorem 1 of [33]. It is appealing to allow r > p under the EEE 170 setting. Related additional proof could involve establishing a neighbourhood of the true structural parameter on which there exists a unique probability vector maximizing the EL and etc. The efficiency is of the Z-estimator and the pEL estimator associated with EEE is an untouched area in this thesis. Recall that the broadest class of semipara- metric models associated with the EEE is the ϕ-transformation model, and the ϕ- transformation model is generalized from the General Transformation Model (GMT). In [38], Tsiatis proposesd a class of estimators using linear rank tests for the GMT, and constructed the efficient estimators within this class together with conditions when they are fully efficient. In Theorem 3 of [33], Qin and Lawless pointed out the pEL with r > p is efficient in the sense of [40] and [5]. As for EEE, we are interested in the following two questions. First, when r = p, are the Z-estimator and pEL efficient in any sense? Second, in the case when r > p, the asymptotic variance covariance matrices are no longer identical. Therefore it is appealing to inspect whether the two matrices can become the same after further simplification; if not, then how does the efficiency compare with one another. Up to now, the broadest class of semiparametric models that can we construct to serve as an example of EEE is the ϕ-transformation model. Therefore a natural question is whether it is possible to extend EEE even further to cover more models. One such model could be the Frailty Model first introduced by Vaupel in univariate survival models in [43], and later applied to multivariate situation on familial ten- 171 dency in chronic disease incidence by Clayton in [9]. Frailty model introduces an unobserved random effect in the exponential proportionality part of the Cox model, i.e., { } λ trT |Z̃(t),W {t|z̃(t), w} = λ0(t) exp lnW + β Z(t) , (8.1) where lnW is an unobserved continuous random variable unique to the linear pre- dictor of each observation that is independent of Z, Z = {Z(t) : t ≥ 0} is a p × 1 covariate that may be time dependent, λT |Z̃(t),W {t|z̃(t), w} is the hazard function of T conditional on W and Z̃(t) = {Z(s), s ≤ t}, λ0(t) is an unspecified base hazard function. Let f(w, γ) be the density of function of W with unkown parameter γ ∈ R. Let θ = (β, γ) ∈ Θ, where Θ = Uβ0 × R, then (8.1) yields a class of semiparametric models with parameters ψ = (θ, λ) ∈ Θ×H, { ∫ } where H ≡ ∞λ(s) : λ(u) > 0 for all u ∈ R+; λ(u)du =∞ . We did some pre- 0 liminary calculation on the (8.1) assuming W follows a gamma distribution, and we found that an estimator of β can be defined via maximizing the pEL, which eventually breaks down to solving the an equation that involves quotients among three higher order summations. So the question is whether we could make a gen- eral definition out of the frailty model that is an extended EEE. On the one hand, this extended EEE should cover cases including like frailty model, ϕ-transformation model and Cox model; on the other hand, there should be examples beyond those three semiparametric models that can be investigated using the extended EEE. 172 Appendix A: Empirical Process Theory In this appendix, we first show the that under suitable conditions, if F is a Donsker class with finite bracketing integral, then it is also a Glivenko-Cantelli class. Then we prove that under Pθ with θ = (β, µ), 1 ∑n { }tr P tr ZjI(V > t)e β Zj j −→ E Z1I(V > t)eβ Z11 , (A.1) n j=1 where {(Zj, Vj)}∞j=1 are independently identically distributed (i.i.d.) samples defined in Chapter 1, and the convergence is uniform in parameter (β, t) ⊂ Uβ0×R+, where Uβ0 is a ball in Rp that contains the true parameter value. Definitions of Donsker, Glivenko-Cantelli class and bracketing integral J[ ](1,F , L2(P )) < ∞ can be found on page 269-270, Chapter 19 of [41]. Theorem A.0.1 Let F be a class of measurable functions such that the bracketing integral J[ ](1,F , L2(P )) <∞. Then F is a Glivenko-Cantelli class. Proof: Let N[ ](ε,F , Lp(P )) be the ε-bracketing number in Lp defined on page 270 of [41]. By this definition, every ε-bracket in L2 is also a ε-bracket in L1, we know that N[ ](ε,F , L1(P )) ≤ N[ ](ε,F , L2(P )). 173 By the definition of the bracketing integral on page 270, Section 19.2 in [41], ∫ δ√ J[ ](δ,F , L2(P )) = lnN[ ](ε,F , L2(P ))dε, (A.2) 0 √ Therefore given J[ ](1,F , L2(P )) <∞, then lnN[ ](ε,F , L2(P )) is finite for almost every ε. Consequently, N[ ](ε,F , L2(P )) < ∞ a.e. By Theorem 19.4 in [41], F is also Glivenko-Cantelli. The following theorem is established as part of Example 19.20 on page 277 of [41]. It shows how to construct a new Donsker class from two existing Donsker classes via Lipschitz transformation. Theorem A.0.2 (Lipschitz Transformation) Let φ : R2 7→ R be a fixed Lip- schitz function. If f and g range over Donsker classes F and G with integrable envelope functions, then the set of functions φ(f, g) is Donsker. Now we are ready show that the convergence in (A.1) is uniform in parameter θ = (β, t). Theorem A.0.3 Let x = (z, s) ∈ Z × R+, θ = (β, t) ∈ U × R+β0 , where Z is a compact and bounded set in Rp and Uβ0 ⊂ Rp is a ball centered at β0. Then n 1 ∑ { }tr P ZjI(Vj > t)e β Zj −→ βtrE Z I(V > t)e Z11 1 , (A.3) n j=1 and the convergence is uniform in θ. Proof: For fixed θ = (β, t) ∈ U +β0 × R , the convergence of (A.3) is guaranteed by the Law of Large Numbers (LLN). To prove this convergence is uniform in θ, for 174 any fixed θ, define function f(x, θ) with argument x = (z, v) ∈ Z × R+, f(· tr, θ) : Z × R+ →7 R : f(x, θ) = zeβ zI(v ≥ t). (A.4) If we can show that F { tr= f(x, θ) = zeβ zI(s ≥ t), where θ = (β, t) ∈ U × R+β0 } is Glivenko-Cantelli, then the uniformity is proved. The idea is to consider F as a Lipschitz transformation of a parametric class and a bounded variation class, then apply Theorem A.0.2. For any fixed β, let g(x, β) : (Z × R+)× R+ → R, and G tr= {g : g(x, β) = zeβ z, where β ∈ Uβ0}, Let us show that G is a parametric class that satisfies conditions for the parametric class described in Example 19.7 of [41], where β corresponds to θ in Example 19.7. For any β1 and β2 in Uβ0 , | βtrz − βtrz| | βtrz − βtrz | ≤ | || βtrz − βtrze 1 ze 2 = z(e 1 e 2 ) z e 1 e 2 z| (A.5) Since the exponential function ex is Lipschitz continuous when x is bounded, there exists a constant C such that | βtrz − βtre 1 e 2 z| ≤ C|βtr1 z − βtr2 z| = C|(β1 − β tr2) z| ≤ |β1 − β2||z| Plug the equation above back into (A.5), yielding | βtrze 1 z − zeβtr2 z| ≤ C|z|2|β1 − β2|. Hence we have shown that for m(X) = C|Z|2. Consequently, G is a Donsker class by Example 19.7 of [41]. 175 For any fixed t ∈ R+, consider function h(·, t) : R 7→ R of the form of H = {h(x, t) = I(s ≥ t), where t ∈ R+}. For any h ∈ H, the variation is 1. Then by the bounded variation class described in Example 19.11, H is also a Donsker class. Next, we consider the product of functions from G and H. Let φ(fg) : R2 7→ R be φ(fg) = fg. This is a Lipschitz function since z is assumed to be bounded and therefore G is too. Hence by Theorem A.0.2, the new class of φ(g, h), where g ∈ G and h ∈ H, is also Donsker, i.e., the class of functions f : Z × R+ × R+ × Uβ0 → R of the form of F = { trf : f(x, θ) = zeβ ZI(s ≥ t), where θ = (β, t) ∈ Uβ0 × R+} is also a Donsker class. 176 Bibliography [1] P. Andersen, O. Borgan, R. Gill and N. Keiding, Statistical Models Based on Counting Processes, (Springer, New York), Chap. 2. [2] P. Andersen and R. Gill, “ Cox’s Regression Model for Counting Processes: a Large Sample Study”, Ann. Statist. 10, (1982). [3] V. Bagdonavičius and M. Nikulin, “Semiparametric Models in Accelerated Life Testing,” Queen’s Papers in Pure and Applied Mathematics. 98, (1995). [4] B. Bibby and M. Sørensen, “Martingale Estimation Functions for Discretely Observed Diffusion Processes”, Bernoulli 1, (1995). [5] P. Bickel, C. Klaassen, Y. Ritov and J. Wellner, Efficient and Adaptive Esti- mation for Semiparametric Models, (Johns Hopkins Univ. Press). [6] P. Billingsley, Probability and Mearsure, (Wiley, New York). [7] S. Boyd, and L. Vandenberghe, Convex Optimization, (Cambridge Univ. Press). [8] J. Buckely and I. James, “Linear Regression with Censored Data”, Biometrika 66, (1976). [9] D. Clayton, “a Model for Association in Bivariate Life Tables and Its Ap- plication in Epidemological Studies of Familial Tendency in Chronic Diease Incidence”, Biometrika 65, (1978). [10] E. Çinlar, and S.Özekici, “Reliability of Complex Devices in Random Environ- ments”, Probability in the Engineering and Informational Sciences. 1, (1987). 177 [11] D. R. Cox, “ Regression Models and Life-Tables”, Journal of the Royal Statis- tical Society. Series B (Methodological). 34, (1972). [12] D. R. Cox, “ Partial Likelihood”, Biometrika. 62, (1975). [13] D. R. Cox and O. Oakes, Analysis of Survival Data, (Chapman and Hall/CRC). [14] B. Efron, “The Efficiency of Cox’s Likelihood Function for Censored Data”, JASA 72, (1977). [15] R. Ellis and D. Gulick, Calculus: w/ Concepts in Calculus, (Cengage Learning, Ohio). [16] V. Godambe, “an Optimum Property of Regular Maximum Likelihood Estima- tion”, Ann. Math. Statist. 31, (1960). [17] G. Heller, “Smoothed Rank Regression With Censored Data,” JASA. 102, (2007). [18] T. R. Hettmansperger, Statistical Inference Based on Ranks, (Wiley, Newyork). [19] N. Hjort, I. Mckeague and I. van Keilegom , “Extending the scope of empirical likelihood,” Ann. Statist. 37, (2009). [20] S. Y. Hwang and I. V. Basawa, “A Review Toward a Unifying Tool”, Confer- ence paper seen from S. Lahiri et al, Contemporary Developments in Statistical Theory, pp 9 - 28. [21] J. Jacod, “Un théorème de représentation pour les martingales discontinues,” Z. Warsh. verw Gebiete, 34, (1976). [22] I. James and P. Smith, “Consistency Results for Linear Regression iwth Cen- sored Data”, Ann. Statist. 12, (1984). [23] J. Jurec̆ková, “Asymptotic linearity of a rank statistic in regression parameter,” Ann. Statist. 40, (1969). [24] J. Jurec̆ková, “Nonparametric estimate of regression coefficients,” Ann. Statist. 42, (1971). [25] J. Koul, V. Susarla and J. Van Ryzin, “Regression analysis with randomly right censored data.,” Ann. Statist. 9, (1981). 178 [26] T. Lai and Z. Ying, “Large Sample Theory of a Modified Buckley-James Esti- mator for Regression Analysis with Censored Data”, Ann. Statist. 19, (1991). [27] E. Lopez, I. van Keilegom and N. Veraverbeke, “Empirical likelihood for non- smooth criterion functions,” Scand. J. of Stat. 36, (2009). [28] T. A. Louis, “Nonparametric analysis on an accelerated failure time model,” Biometrika. 68, (1981). [29] T. Merkouris, “Transform Martingale Estimating Functions”, Ann. Statist 35, (2007). [30] R. Miller, “Least Squares Regression with Censored Data”, Biometrika 63, (1976). [31] A. Owen, “ Empirical Likelihood Ratio Confidence Intervals for a Single Func- tional,” Biometrika. 75, (1988). [32] A. Owen, Empirical Likelihood, (CRC Press, Raton, Florida). [33] J. Qin and J. Lawless, “Empirical Likelihood and General Estimating Equa- tions,” Ann. Statist. 22, (1994). [34] Y. Ritov, “Estimation in a linear regression model with censored data,” Ann. Statist. 18, (1990). [35] Y. Ritov and J. Wellner, “Censoring, Martingales and the Cox Model”, Con- temp. Math. 80, (1988). [36] R. Serfling, Approximation Theorems of Mathematical Statistics (Wiley- Interscience, New York). [37] J. Shao, Mathematical Statistics (Springer, New York). [38] A.Tsiatis, “Estimating regression parameters using linear rank tests for cen- sored data,” Ann. Statist. 18, (1990). [39] A. Tsiatis, Semiparametric Theory and Missing Data (Springer, New York) [40] A. Van der Vaart, Statistical Estimation in Large Parameter Spaces (Centrum voor Wiskunde en Informatica, Amesterdam). 179 [41] A. van der Vaart, Asymptotic Statistics (Cambridge University Press, New York). [42] A. van der Vaart and J. Wellner, Weak Convergence and Empirical Processes (Springer, New York). [43] J. Vaupel, K. Manton, and E. Stallard, “the Impact of Heterogeneity in Indi- vidual Frailty on the Dynamics of mortality,” Demography. 16, (1979). [44] L.J. Wei and M.H. Gail, “Nonparametric estimation for a scale-change with censored observations,” J. Amer. Statist. Assoc. 78, (1983). [45] L. Wei, Z. Ying and D. Lin, “Linear Regression Analysis of Censored Survival Data Based on Rank Tests”, Biometrika 77, (1990). [46] M. Zhou, “Empirical likelihood analysis of the rank estimator for the censored accelerated failure time model,” Biometrika. 92, (2005). [47] M. Zhou, “Empirical likelihood ratio with arbitrarily censored/truncated data by EM algorithm,” J. Computational and Graphical Statistics. 14, (2005). [48] M. Zhou, Empirical Likelihood Method in Survival Analysis (Chapman and Hall/CRC). 180