ABSTRACT Title of dissertation: Statistical Inference Using Data From Multiple Files Combined Through Record Linkage Ying Han, Doctor of Philosophy, 2018 Dissertation directed by: Professor Partha Lahiri Joint Program of Survey Methodology & Department of Mathematics Record linkage methods help us combine multiple data sets from different sources when a single data set with all necessary information is unavailable or when data collection on additional variables is time consuming and extremely costly. Link- age errors are inevitable in the linked data set because of the unavailability of an error-free and unique identifier and because of possible errors in measuring or record- ing. It has been realized that even a small amount of linkage errors can lead to substantial bias and increase variability in estimating the parameters of a statistical model. The importance of incorporating uncertainty of the record linkage process into the statistical analysis step cannot be overemphasized. The current research is mainly focused on the regression analysis of the linked data. The record linkage and statistical analysis processes are treated as two sepa- rate steps. Due to the limited information about the record linkage process, simpli- fying assumptions on the linkage mechanism have to be made. In reality, however, these assumptions may be violated. Also, most of the existing linkage error models are built on the linked data set, which only contains records for the designated links. Information about linkage errors carried by the designated non-links is missing. In the dissertation, we provide general methodologies for both regression anal- ysis and small area estimation using data from multiple files. A general integrated model is proposed to combine the record linkage and statistical analysis processes. The proposed linkage error models are built directly on the data values from the original sources, and based on the actual record linkage method that is used. We have adapted the jackknife methods to estimate bias, variance, and mean squared error of our proposed estimators. To illustrate the general methodology, we give one example of estimating the regression coefficients in the linear and logistic regression models, and another example of estimating small area mean under the nested-error linear regression model. In order to reduce the computational burden, simplified version of the proposed estimators, jackknife methods, and numerical algorithms are given. A Monte Carlo simulation study is devised to evaluate the performance of the proposed estimators and to investigate the difference between the standard and simplified jackknife methods. Statistical Inference Using Multiple Data Files Combined Through Record Linkage Techniques by Ying Han Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2018 Advisory Committee: Professor Partha Lahiri, Chair/Advisor Professor Cinzia Cirillo Professor Paul Smith Professor Takumi Saegusa Professor Yan Li ©c Copyright by Ying Han 2018 Dedication I dedicate this dissertation to my caring husband, Alexander Estes. ii Acknowledgments First of all, I would like to thank my dissertation advisor, Partha Lahiri. This dissertation would not have been possible completed without his consistent guidance and support. It is him who brought me into the area of Survey Methodology, the area I would devote myself into. It is him who gave me the opportunity to work on multiple interesting projects for the past years. It is him who helped me in my job hunting when graduation was approaching. It has been a great pleasure to work with him during my graduate life. I would also like to give my special thanks to the rest of my committee mem- bers, Cinzia Cirillo, Paul Smith, Takumi Saegusa, and Yan Li for taking the time to read the thesis and give me their insightful feedbacks. Thank you to all the other faculties in the Department of Mathematics, Eric Slud, Leonid Koralov, Benjamin Kedem, Abram Kagan, and Joan Ren, for helping me building a solid background in mathematics and statistics. Thank you to all my friends, especially, Yimei Fan, Xia Li, Xuan Yao, and Chen Wang, for encouraging me when I got frustrated, and providing suggestions whenever I need some. Thank you to my parents, Zhongtang Han and Shuxiangfan, and my sister, Shujuan Han, for supporting me over the years. Thank you to the rest of the family for their support and collective wisdom. Thank you to my husband, Alex Estes, for being with me the whole way. Thank you to everyone else who has helped me. iii Table of Contents Dedication ii Acknowledgements iii List of Tables vii List of Figures viii List of Abbreviations ix List of Notations x 1 Introduction 1 1.1 Record Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Fellegi and Sunter Model . . . . . . . . . . . . . . . . . . . . . 2 1.2 Statistical Analysis of Linked Data . . . . . . . . . . . . . . . . . . . 7 1.2.1 Linkage Mechanisms . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 Linkage Error Model: Chambers (2009) . . . . . . . . . . . . . 11 1.2.3 Linkage Error Model: Scheuren and Winkler (1993) . . . . . 14 1.3 Discussion and Overview of the Dissertation . . . . . . . . . . . . . . 16 2 Regression Analysis of Data from Two Files 19 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Problem Description and Data Availability . . . . . . . . . . . . . . . 20 2.3 General Integrated Model for Regression Analysis . . . . . . . . . . . 21 2.3.1 Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.2 Linkage Error Model . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.3 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.4 Designation of Links and non-Links . . . . . . . . . . . . . . . 26 2.4 Estimation of Regression Coefficients . . . . . . . . . . . . . . . . . . 27 2.5 Variance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 iv 2.7.1 Proof of (2.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 2.7.2 Proof of (2.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 Applications to Linear and Logistic Regression 39 3.1 Linear Regression using Data from Two Files . . . . . . . . . . . . . 39 3.2 Logistic Regression using Data from Two Files . . . . . . . . . . . . . 45 3.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.1 Proof of (3.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3.2 Proof of (3.4), (3.6), and (3.7) . . . . . . . . . . . . . . . . . . 50 3.3.3 Proof of (3.10) . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.4 Proof of (3.11) . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.5 Proof of (3.13) . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Small Area Estimation with Data from Two Files 57 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2 Small Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.3 General Integrated Model for Small Area Estimation . . . . . . . . . 59 4.3.1 Problem Description and Data Availability . . . . . . . . . . . 59 4.3.2 General Integrated Model for Small Are Estimation . . . . . . 61 4.4 Empirical Best Prediction Estimator . . . . . . . . . . . . . . . . . . 66 EBP 4.5 Estimation of the Mean Squared Error of θ̂i . . . . . . . . . . . . 68 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 Application to the General Linear Mixed Model 71 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.2 General Linear Mixed Model with Block Diagonal Covariance . . . . 72 5.3 Nested Error Linear Model . . . . . . . . . . . . . . . . . . . . . . . . 74 5.3.1 Estimation of Small Area Mean Using Data From a Single File 75 5.3.2 Estimation of Small Area Mean Using Data From Two Files . 75 5.3.3 Estimation of φ: Maximum Likelihood Method . . . . . . . . 80 5.3.4 Numerical Algorithms . . . . . . . . . . . . . . . . . . . . . . 82 5.3.5 Estimation of φ: Pseudo Maximum Likelihood Method . . . . 86 5.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.1 Proof of (5.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.4.2 Proof of (5.11) . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.4.3 Proof of (5.12) . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4.4 Proof of (5.15) . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.4.5 Proof of (5.16) and (5.17) . . . . . . . . . . . . . . . . . . . . 96 5.4.6 Proof of (5.19) . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4.7 Proof of (5.20) and (5.21) . . . . . . . . . . . . . . . . . . . . 104 5.4.8 Proof of (5.23) . . . . . . . . . . . . . . . . . . . . . . . . . . 107 v 6 Monte Carlo Simulation Study 109 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 6.2 The Equal Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2.1 Linear regression with linked data . . . . . . . . . . . . . . . 115 6.2.2 Logistic regression with linked data . . . . . . . . . . . . . . . 118 6.3 The Unequal Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.3.1 Linear regression with linked data . . . . . . . . . . . . . . . 123 6.4 Comparison of the Standard and Simplified Jackknife Methods . . . . 127 6.4.1 Equal Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.4.2 Unequal Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 7 Future Research 136 Bibliography 140 vi List of Tables 2.1 Data layout for observations on y and x in Fy and Fx. . . . . . . . . . 21 4.1 Data layout for joint observations on y and x for each small area . . . 59 4.2 Data layout for observations on y and x for each small area in Fy and Fx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Estimating equations for β estimators in linear and logistic models. . 110 6.2 Simulation conditions for two cases under the equal scenario. . . . . . 112 6.3 AAD, ASD and PRI of β estimators for linear regression (Equal Sce- nario). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.4 AAD, ASD and PRI of β estimators for logistic regression (Equal Scenario). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.5 Monte Carlo estimates and standard deviations for logistic regression (Equal, Case 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.6 RE of β estimators for logistic regression (Equal Scenario, Case 1). . 123 6.7 Simulation conditions for two cases under the unequal scenario. . . . 124 6.8 AAD, ASD and PRI of β estimators for linear regression (Unequal Scenario). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.9 Monte Carlo estimates and standard deviations for linear regression (Unequal Scenario, Case 1) . . . . . . . . . . . . . . . . . . . . . . . . 129 6.10 RE of β estimators for linear regression (Unequal Scenario, Case 1). . 129 6.11 Wilcoxon signed rank test for absolute relative difference in estimated variance for logistic regression (Equal Scenario, Case 1). . . . . . . . 132 6.12 Wilcoxon signed rank test for absolute relative difference in estimated variance for linear regression (Unequal Scenario, Case 1). . . . . . . 135 vii List of Figures 6.1 Scatter plots of β estimates in a linear model (Equal Scenario). . . . . 116 6.2 Heat maps of absolute and squared deviations of β estimates in a linear model (Equal Scenario). . . . . . . . . . . . . . . . . . . . . . . 117 6.3 Scatter plots of β estimates in a logistic model (Equal Scenario). . . . 120 6.4 Heat maps of absolute and squared deviations of β estimates in a logistic model (Equal Scenario). . . . . . . . . . . . . . . . . . . . . 121 6.5 Box plots of deviations and relative deviations of β estimators for logistic regression (Equal Scenario, Case 1). . . . . . . . . . . . . . . 122 6.6 Plots of Monte Carlo estimates and standard deviations for logistic regression (Equal Scenario, Case 1). . . . . . . . . . . . . . . . . . . . 122 6.7 Scatter plots of β estimates in a linear model (Unequal Scenario). . . 125 6.8 Heat maps of absolute and squared deviations of β estimates in a linear model (Unequal Scenario). . . . . . . . . . . . . . . . . . . . . 126 6.9 Box plots of deviations and relative deviations of β estimates for linear regression (Unequal Scenario, Case 1). . . . . . . . . . . . . . . . . . 128 6.10 Plots of Monte Carlo estimates and standard deviations for linear regression (Unequal Scenario, Case 1). . . . . . . . . . . . . . . . . . 128 6.11 Box plot of relative differences in estimated variance of β estimators for logistic regression. (Equal Scenario, Case 1). . . . . . . . . . . . . 133 6.12 Histogram of absolute relative differences in estimated variance of β estimators for logistic regression (Equal Scenario, Case 1). . . . . . . 133 6.13 Box plot of relative differences in estimated variance of β estimators for linear regression. (Unequal Scenario, Case 1). . . . . . . . . . . . 134 6.14 Histogram of absolute relative differences in estimated variance of β estimators for linear regression (Unequal Scenario, Case 1). . . . . . . 135 viii List of Abbreviations AAD Average Absolute Deviation ASD Averate Squared Deviation BLUE Best Linear Unbiased Estimator BP Best Prediction CMSE Conditional Mean Squared Error EBP Empirical Best Prediction EM Expectation Maximization ECM Expectation Conditional Maximization ML Maximum Likelihood MLE Maximum Likelihood Estimate MOM Method of Moments OLS Ordinary Least Squares WLS Weighted Least Squares PML Pseudo Maximum Likelihood PMLE Pseudo Maximum Likelihood Estimate MSE Mean Squared Error PRI Percent Relative Improvement RE Relative Efficiency REML Restricted Maximum Likelihood SE Standard Error ix List of Selected Notations U the population of interest. y a scalar random variable of interest. x a vector random variable of order p. w the vector of K matching fields. c the comparison vector variable. Fy the file containing observations on y. Fx the file containing observations on x. Sy the set of population units covered Fy. Sx the set of population units covered Fx. M the set of matches. M c the set of mismatches. N the number of records in Fx. n the number of records in Fy. G the number of blocks. m the number of small areas. Ni the number of records in block or small area i from file Fx. ni the number of records in block or small area i from file Fy. K the number of matching fields. mk the probability of a record pair agreeing on matching field k among matches M . µk the probability of a record pair agreeing on matching field k among mismatches M c. ỹj the value of y for record j in Fy. x̃j the value of x corresponding to ỹj. xj′ the value of x for record j ′ in Fx. yj′ the value of y corresponding to xj′ . w̃j the value of w for record j in Fy. wj′ the value of w for record j ′ in Fx. cjj′ the value of c for the record pair (j, j ′). l ′ the true matching status of the record pair (j, j′jj ). ỹij the value of y for record j in block or small area i from Fy. x̃ij the value of x corresponding to ỹij. xij′ the value of x for record j ′ in block or small area i from Fx. yij′ the value of y corresponding to xij′ . w̃ij the value of w for record j in block or small area i from Fy. wij′ the value of w for record j ′ in block or small area i from Fx. cijj′ the value of c for the record pair (j, j ′) in block or small area i. li ′jj′ the true matching status of the record pair (j, j ) in block or small area i. ỹ i the ni × 1 vector of observed y values in block of small area i from Fy. X̃ i the ni × p matrix of unobserved true x values corresponding to ỹ i. X i the Ni × p matrix of observed x values in block or small area i from Fx. x y i the Ni × 1 vector of unobserved true y values corresponding to X i. C i the Nini ×K matrix of the comparison vector c in block or small area i. Li the ni ×Ni matrix of the unknown true matching status in block or small area i. ỹ the n× 1 vector of observed y values in Fy. X̃ the n× p matrix of unobserved true x values corresponding to ỹ. X the N × p matrix of observed x values in Fx. y the N × 1 vector of unobserved true y values corresponding to X . C the Nn×K matrix of the comparison vector c. L the n×N matrix of the unknown true matching status. xi Chapter 1: Introduction 1.1 Record Linkage In record linkage, or exact file matching, one compares two or more files on a single population in absence of a unique and error-free identifier for purposes of unduplication or production of an enhanced, merged database (e.g., Newcombe et al. 1959, Fellegi and Sunter 1969, Herzog et al. 2007). Record linkage differs from statistical matching in terms of the types of units to be linked or matched. The primary goal of record linkage is to link an entity (e.g., person, household, farm, etc.) from one file to the same entity in other file(s). In contrast, the primary goal of statistical matching is to link similar units (e.g., matching the same demographic group from different files). In this dissertation, our focus is on the statistical esti- mation related to record linkage and not statistical matching. Readers interested in statistical matching are referred to Rässler(2002), D’Orazio (2006), and others. A merged or linked database, created by record linkage, is of great interest to analysts interested in certain specialized multivariate analysis, which would be otherwise either impossible or difficult without advanced statistical expertise as variables are stored in different files. Record linkage is used in many applications, including population size estimation at the Census Bureau (Winkler 1994, 1995, and 1 Jaro 1989), epidemiological and medical studies (e.g., Gill 1997), sociological studies, survey frame improvement, and, more recently, counter-terrorism (Gomatam and Larsen 2004). For more information on its applications, see Alvey and Jamerson (1997) and references therein. The National Death Index is matched to existing insurance, medical, and other databases for studies (e.g., Livingston and Ko 2005). Record linkage techniques can be broadly classified into deterministic and probabilistic record linkages. They both use common matching fields available from files to be linked that are indicative of a true match status of an entity. Examples of matching fields include last name, date of birth, address, etc. In deterministic record linkage, a record pair is deemed a link if the two records agree on all or some available matching fields according to a pre-specified rule, and hence there is no stochastic element in the deterministic record linkage process. On the other hand, if such a link is only deemed a link with certain probability it is called probabilistic record linkage. This dissertation concerns probabilistic record linkage. 1.1.1 Fellegi and Sunter Model Fellegi and Sunter (1969) first developed a theoretical framework for record linkage. Suppose we have two files FA and FB, which contain records for a sample SA of size n and a sample SB of size N , respectively, from the same population U . Let j represent the index of a record in F ′A, and let j represent the index of a record in FB. The goal of record linkage is to partition all record pairs in the set FA × FB = {(j, j′) : j ∈ F , j′A ∈ FB} into two disjoint sets: the set 2 of matches M = {(j, j′) : l ′jj′ = 1, j ∈ FA, j ∈ FB} and the set of mismatches M c = {(j, j′) : ljj′ = 0, j ∈ FA, j′ ∈ FB}. Here, ljj′ represent the true matching status of the record pair (j, j′); that is, ljj′ = 1 if record j from FA and record j ′ from FB actually correspond to the same population unit. The goal of record linkage is achieved by making comparisons comparisons of information on matching fields between records in FA and records in FB. The match- ing fields usually do not include a unique and error-free identifier, such as Social Security Number. Examples of matching fields include name, gender, race, date of birth, address, etc. Some matching fields (e.g., last name) have more discriminatory power than the others (e.g., gender) in distinguishing matches from mismatches. Let w = (w )Kk k=1 denote a vector of K matching fields, and let w A j and w B j′ represent the values of w for record j ∈ FA and j′ ∈ FB, respectively. The record linkage model is built on a comparison vector c = (ck) K k=1, which is a vector-valued variable that displays the pattern of agreement and disagreement on matching fields. The simplest method of constructing comparison vectors is to use exact matching. The comparison vector c Kjj′ = (cjj′k)k=1 for the record pair (j, j ′) is defined as:  A B cjj′k =  1 if wjk = wj′k . 0 if wAjk =6 wBj′k For example, when there are K = 3 matching fields, the possible values of a com- parison vector are (1, 1, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 0, 0), (0, 1, 0), (0, 0, 1) and (0, 0, 0). Matches tend to have more ones in their comparison vectors than mis- matches. 3 Fellegi and Sunter (1969) proposed an optimal decision rule to designate record pairs into links and non-links. The decision rule is optimal in the sense that it minimizes the number of records requiring clerical review at a fixed error level. The decision rule is based on the following likelihood ratio score: P (cjj′|(j, j′) ∈M) P (cjj′ |ljj′ = 1) Rjj′ = = (1.1) P (cjj′|(j, j′) ∈M c) P (cjj′ |ljj′ = 0) Intuitively, the larger the likelihood ratio Rjj′ is, the more likely the record pair (j, j′) is to be a true match. Therefore, based on the values of Rjj′ , one of three decisions are to be made based on the following rule: • If Rjj′ > RU , then designate the record pair (j, j′) as a link ; • If RL < Rjj′ < RU , then send the record pair (j, j′) to clerical review; • If Rjj′ < RL, then designate the record pair (j, j′) as a non-link ; where RU and RL are the optimal upper and lower thresholds, respectively, which are determined at a pre-specified error levels for false links and false non-links. Note that any monotone increasing function (such as logarithm) of the likelihood ratio Rjj′ can serve equally well as a test statistic for the purpose of record linkage. The probabilities in (1.1) are unknown and need to be estimated. To simplify the estimation of these probabilities, Fellegi and Sunter (1969) made a conditional independence assumption: agreements on matching fields are independent within 4 matches and mismatches. That is, ∏K ∏K P (c |l = 1) = P (c | c ′l = 1) = m jj k(1−m )1−c′ ′ ′ ′ jj′kjj jj jj k jj k k , k∏=1 k∏=1K K c P (c jj ′k 1−c jj′|ljj′ = 0) = P (c ′ jj′kjj k|ljj′ = 0) = uk (1− uk) , (1.2) k=1 k=1 where mk = P (cjj′k = 1|ljj′ = 1) and uk = P (cjj′k = 1|ljj′ = 0) are the probabilities of a record pair agreeing on the matching field k among matches M and mismatches M c, respectively. Under this assumption, estimation of the unknown probabilities in (1.1) is reduced to estimation of the matching parameters {mk, uk, k = 1, · · · , K}. The optimality of Fellegi and Sunter’s method heavily depends on the accuracy of the estimates of these matching parameters. Fellegi and Sunter (1969) also suggested a method called blocking to reduce the computational burden caused by the large amount of comparison vectors. The- oretically, all the record pairs in set FA × FB should be considered for comparison. However, comparison of two moderate files can lead to an extremely large amount of comparison vectors. For example, 105 comparison vectors will be generated for two files of sizes 102 and 103. Fellegi and Sunter (1969) suggested to partition records into blocks based on whether they agree on one or a set of characteristics, such as zip code or the first three digits of phone numbers. Only records within the same block are compared. In this way, the number of comparison vectors can be greatly reduced. Remark 1: Elements of the comparison vector c can be binary or continu- ous. Besides exact matching, a comparison vector with binary elements can also be constructed in a more general way by assigning a distance function dk(·) and a 5 threshold τk to each matching field k, k = 1, · · · , K. For the record pair (j, j′), its comparison vector can be defined as: cjj′k = 1 if dk(wjk, wj′k) ≤ τk and cjj′k = 0 if dk(wjk, wj′k) > τk, k = 1, · · · , K. For example, the string comparator (Winkler 1990) can be used as a distance function for a string-valued matching field. Remark 2: The conditional independence assumption has been criticized since it often fails in practice. However, estimation under the conditional indepen- dence assumption can still provide accurate decision rules even though the assump- tion is violated; see e.g., Thibaudeau (1993), Winkler (1989a, 1994). The conditional independence assumption can also be relaxed by introducing interactions among matching fields; see e.g., Armstrong and Mayda 1993, Thibaudeau 1993, Larsen and Rubin 2001. Remark 3: The optimality of the decision rule heavily depends on the accu- racy of the estimates of the matching parameters and the choices of the upper and lower thresholds; see e.g., Belin 1993, Belin and Rubin 1995. Also, it is possible that a record in FA is linked to two or more records in FB, since all the record pairs with their likelihood ratio scores above the upper threshold will be designated as links. Some programming approaches have been developed to force one-to-one linkage; see e.g. Jaro (1989) and Fortini et al. (2002). Though these approaches can help to avoid the occurrence of the one-to-many or many-to-one linkage problems in record linkage, it may also remove the true matches. 6 1.2 Statistical Analysis of Linked Data Probabilistic record linkage procedures are subject to linkage errors. There are two types of linkage errors. The linkage error is called false positive if a true mismatch is deemed a link by the record linkage procedure. On the other hand, the linkage error is called false negative if a true match is deemed a non-link by the record linkage procedure. Neter et al. (1965) showed that a relatively small amount of linkage errors could lead to substantial bias in estimating a regression relationship. If one simply ignores the linkage errors, analysis of linked data could yield misleading results in a scientific study. Therefore, the importance of accounting for linkage errors in statistical analysis cannot be overemphasized. 1.2.1 Linkage Mechanisms Suppose that a linked data set is generated by combining the records from two files Fy and Fx through some record linkage techniques. Here, Fy represents the file containing the observed values of a scalar variable y, and Fx represents the file containing the observed values of a vector-valued variable x of order p. Let X denote the matrix of the observed x values in file Fx, let y denote the vector of the unobserved true y values corresponding to X , and let y? denote the vector of the y values that are selected from the file Fy and linked to X . Thus, the linked data set contains (y?,X ). Most of the existing linkage error models are directly built on the linked data by exploiting the relationship between y? and y. Under certain assumptions, the randomness of the record linkage process can be generally modeled 7 via the following identity: y? = Ty (1.3) Here, T = (tjj′) n,N j=1,j′=1 is an unknown random permutation matrix, with tjj′ repre- senting the true matching status between y?j and yj′ , where y ? j is the y value that is linked to xj and yj′ is the true y value corresponding to xj′ ; that is, tjj′ = 1 if y?j and yj represent the y value for the same population unit, tjj′ = 0 otherwise. So t ?jj = 1 indicates that (xj, yj ) is correctly linked. The distribution of linkage errors depends on the characteristic of the proba- bilistic record linkage method that is actually used. Here, based on the conditional distribution of the matching status matrix T given the observed data y?, X , C , we classify the linkage mechanisms into three categories: • LCAR: The linkage is called linkage completely at random (LCAR) if the conditional distribution of T given the data y?, X and C , say f(T |y?,X ,C ;γ) does not depend on y?, X and C ; that is, f(T |y?,X ,C ;γ) = f(T ;γ) for all y?, X , C , γ , where γ denotes a vector of unknown parameters. Note that it does not mean that the linkage process is random, but the linkage process does not depends on values of y?, X and C . • LAR: The linkage is called linkage at random (LAR) if the conditional distri- bution of T given the data y?, X and C depends only X or/and C , but not on y?. That is, f(T |y?,X ,C ;γ) = f(T |X,C ;γ) for all y?, X , C , γ . • LNAR: The linkage is called linkage not at random (LNAR) if the conditional 8 distribution of T given the data y?, X and C depends on y?. However, the detailed information about record linkage may not be available to the people who perform statistical analysis. To adjust the linkage bias in the statistical analysis of the linked data, assumptions on the linkage mechanism have to be made based on the available information one can obtain about linkage errors. In secondary data analysis, researchers can only get access to the linked data generated from the record linkage process. The information about linkage errors can be obtained from a training sample of the linked data. The true matching status for each linked record pair in the training sample can be determined through clerical review. If the linked data and the training sample are available to researchers, there is a scope for correcting the linkage bias in the statistical analysis. Neter et al. (1965) discussed this secondary analysis using an audit sample. In the context of understanding the effects of low-level radiation data on cancer death rate, Lahiri (1995) and Krewski et al. (2001) suggested analysis of the Cox proportional hazard model using information contained in a sample to correct for linkage error biases. More recently, following Neter et al. (1965), Chambers (2009) put forward a variety of methods for different secondary data analyses that use a sample to correct for linkage error biases. Following the work of Chambers (2009), researchers advanced the secondary data analysis of the linked data in several different directions; see e.g., Chambers et al. (2009), Chipperfield et al. (2011), Kim and Chambers (2012a,b, 2013), Samart and Chambers (2014), Dasylva (2014), Chipperfield and Chambers (2015), and Chambers and Kim (2016). Kandari and Lahiri (2016), following up 9 on Lahiri (1995), suggested a theory of predicting a function misclassified binary variables using information from a sample. However, due to the limited information about the linkage process in secondary data analysis, researchers typically assume that the linkage is LAR or LCAR (limited to dependence on X only). In primary data analysis, researchers can get access to not only the linked data but also some summary information generated during the record linkage pro- cess, such as values of matching fields, values of comparison vectors, the matching weights (such as the likelihood ratio score), the estimated linkage probabilities, and so on. This detailed information can assist researchers to learn more about the linkage mechanism, and can be potentially used to correct the linkage bias in the statistical procedures. Scheuren and Winkler (1993, 1997) showed how to use record linkage process information in correcting the linkage bias of the ordinary least squares (OLS) estimator of the regression coefficient in a standard multiple linear regression model. Their approach involves first estimating an analytical expression of the bias of the OLS using the record linkage process information and then ap- plying the estimated bias correction to the OLS. Lahiri and Larsen (2005) obtained an exact unbiased estimator of regression coefficients by deriving the expected value of the linked response variable when linkage errors are uncorrelated with the true response given the comparison vectors. Hof and Zwinderman (2012) followed up on the Lahiri-Larsen approach and showed how to extend it to link multiple files or when one-to-one matching is not desired. In primary data analysis, with the additional information about the linkage process, researches are able to build more sophisticated linkage error models by assuming that the linkage depends on C , X , 10 or both (LAR). 1.2.2 Linkage Error Model: Chambers (2009) In this part, we introduce the linkage error model proposed by Chambers (2009) as an example of using the LCAR linkage mechanism and a training sample in secondary data analysis. Chambers (2009) developed a linkage error model under the following assump- tions: (1) The linked data is obtained by combining two files Fy and Fx. Fy and Fx contain the observed values of y and x, respectively, for all the units of the same population of size N , without duplicate. Hence, Fy and Fx are of the same size N . (2) The records in files Fy and Fx are partitioned into G blocks, with Ni records in block i, without error. So linkage errors only occur within the same block. (3) The resulting linkage is complete (i.e., all records are linked) and one-to-one between Fy and Fx. Let xij denote the observed value of x for record j in block i from Fx, let yij denote the true value of y corresponding to xij, and let y ? ij denote the value of y that is recorded in block i of Fy and linked to x Ni ij. Let y i = (yij)j=1 and y?i = (y ? ij) Ni j=1 denote the vector of true y values and the vector of linked y values in block i corresponding to X = (xT )Nii ij j=1, respectively. Under the above assumptions, Chambers (2009) modeled the randomness of the outcome of the linkage process via 11 the identity y?i = T iy i, i = 1, · · · , G. where T = (ti )Ni,Nii jj′ j=1,j′=1 is an unknown random permutation matrix of dimension N T Ti×Ni with 1N T i = 1N and T i1N = 1N , and the T i are independently distributedi i i i across blocks. Following Neter et al. (1965), Chambers (2009) further assumed that: (4) The probability of a designated link being a true match is the same within each block. (5) The probability of a designated non-link being a true match is the same within each block. Under these assumptions, Chambers proposed the following exchangeable link- age error model: i 1− λiP (tjj = 1|data) = λi, P (tijj′ = 1|data) = ,Ni − 1 for i = 1, · · · , G, j = 1, · · · , N ′ ′i, j = 1, · · · , Ni, and j 6= j , where λi is an unknown block-specific parameter. Under the above framework for statistical analysis of the linked data, Cham- bers (2009) took account of linkage errors into the linear regression analysis of the linked data. Inspired by Scheuren and Winkler (1993), Chambers (2009) developed a bias-corrected ordinary least squares (OLS) estimator of the regression coefficient by adjusting the bias of the naive OLS estimator under the exchangeable linkage error model. Inspired by Lahiri and Larsen (2005), Chambers (2009) developed an unbiased OLS estimator and a best linear unbiased estimator (BLUE) of the 12 regression coefficient by exploiting the regression relationship between y?i and X i under the exchangeable linkage error model. Chambers (2009) also extended these ideas, developed a general estimating-equations-based theory for the regression anal- ysis using linked data by correcting the bias of the estimating functions under the proposed exchangeable linkage error model, and applied the theory to the linear and logistic regressions. Subsequently, Kim and Chambers (2012a, 2012b, 2013) extended the methodology to accommodate the situation where the linked data is produced by linking more than two files and the linkage is incomplete. Smart and Chambers (2014) proposed a method for estimating the regression coefficient in a nested-error linear regression model when linked data is used. Other related articles include Chambers et al. (2009), Chipperfield and Chambers (2015), Chambers and Kim (2016). Remark 1: The exchangeable linkage error model is based on the assumption that the linkage mechanism is LCAR. Chambers (2009) realized that it was probably the simplest way to characterize the behavior of a probability-based record linkage process, and that more sophisticated models could be formulated with additional information. Evidently, the optimal estimators derived from the proposed estimating equations under the exchangeable linkage error model will not be optimal under a more complex linkage error model, such as the one proposed by Scheuren and Winkler (1993), which allows the probability of correct linkage to vary both within and across blocks. Remark 2: The estimators of the regression coefficient mentioned above are unbiased in the sense that they are unbiased when the block-specific parameters λi, 13 i = 1, · · · , G (and variance component parameters if they are involved) are known. In practice, however, these block-specific parameters are unknown and need to be estimated by using a clerically-reviewed training sample of the linked data for each block. The block-specific parameters λi can be simply estimated by the sample pro- portions of the correctly-linked record pairs in block i, i = 1, · · · , G. Therefore, the unbiasedness and efficiency of the proposed estimators of the regression coefficient depends on the accuracy of the estimated block-specific parameters. The estimate of λi can be unreliable if there are not enough samples in block i, which occurs often in the literature of small area estimation. 1.2.3 Linkage Error Model: Scheuren and Winkler (1993) Here, we introduce the linkage error model proposed by Scheuren and Winkler (1993) as an example of using the LAR linkage mechanism and summary information from the record linkage process in primary data analysis. Using assumptions (1) and (2) in Section 1.2.2, the linkage error models pro- posed by Scheuren and Winkler (1993) can be generally rewritten as: y?i = T iy i, i = 1, . . . , G. where T i = (t i Ni,Ni jj′)j=1,j′=1 with T i1N = 1N , and T i are independent across blocks.i i The linkage error model proposed by Scheuren and Winkler (1993) allows the probability of being a true match to vary across record pairs. They generally assumed that: P (ti ijj′ = 1|data) = qjj′ , 14 ∑ where Nij′=1 q i jj′ = 1, i = 1, . . . , G, j = 1, . . . , Ni, and j ′ = 1, . . . , Ni. Based on the available information about the linkage process, more specific assumptions can be made on the linkage mechanism to simplify the estimation of the probabilities qijj′ . As illustrated in Scheuren and Winkler (1993), one can assume that the probability of a record pair (j, j′) in block i being a true match only depends on its corresponding matching weight rijj′ , which can be derived from the comparison vector cijj′ . That is, P (t i jj′|data) = P (ti ijj′|rjj′). In this case, the method proposed by Belin and Rubin (1991) can be used to estimate probabilities qijj′ by fitting a two-class Gaussian mixture model to the transformed matching weights. Note that a clerically-reviewed training sample is also required to estimate the unknown parameters involved in the transformation. Under the linkage error model, Scheuren and Winkler obtained an unbiased estimator of the regression coefficient in a multiple linear regression model by adjusting the linkage bias of the naive OLS estimator. Following Scheuren and Winkler (1993), Lahiri and Larsen (2005) assumed that the probability of a record being a true match depends only on its comparison vectors cijj′ . That is, P (t i i i jj′ |data) = P (tjj′ |cjj′). Assuming that the comparison vectors follow a two-class mixture model, estimation of probabilities qijj′ reduces to the estimation of unknown parameters in the mixture model. The maximum likelihood estimates of the mixture model parameters can be approximated by using the Expectation-Maximization algorithm. By exploiting the relationship between the linked y values and the true x values, Lahiri and Larsen developed an exact unbiased estimator of the regression coefficient in a general linear model. Remark 1: The exchangeable linkage error model proposed by Chambers 15 (2009) can be treated as a special case of the one proposed by Scheuren and Winkler (1993), where qijj = λi and q i jj′ = (1 − λi)/(Ni − 1), i = 1, . . . ,m, j = 1, . . . , Ni, j′ = 1, . . . , Ni, j ′ 6= j, which are estimated using a sample drawn from the linked data file. In contrast, in Chapter 2 we propose a model where the probability of correct linkage could vary both within and across blocks. Moreover, the number of parameters is reduced by exploiting data on matching weights rijj′ or comparison vectors cijj′ through the mixture model and hence the parameters are estimated efficiently using data from many blocks. The difference between the two approaches can be attributed to the fact that Chambers (2009) focused on the secondary analysis of linked data while we focus on a method that can be directly applied to two separated data sets to be linked. 1.3 Discussion and Overview of the Dissertation In Chapter 1, we have presented a brief overview of the record linkage tech- niques for data integration. We have introduced the first statistical framework for record linkage, and the optimal decision rules used for designating record pairs into links and non-links, as a special example. In addition, we have discussed the effects of linkage errors on statistical analysis and emphasized the importance of taking ac- count of linkage errors into statistical analysis. The existing methods for correcting the linkage bias are discussed, and several examples are provided. In Chapter 2 and Chapter 3, we provide a methodological framework for the regression analysis using data from two different files. Especially, we are interested 16 in estimating the regression parameters related to the conditional distribution of the response variable y given the predictors x. Rather than separating regression analysis from record linkage as most existing methods do, we propose a general integrated model to combine these two processes, based on the assumption that the sample units in one file is a subset of those in the other file. We also provide a general class of estimating equations that can produce different estimators corrected for the linkage bias. A jackknife method is then adapted to estimate the bias, variance and mean squared error of our estimators. Our methodology can be widely applied to the general linear regression models, the generalized linear regression models, and the general linear mixed models, as long as observations on y are independent across blocks given x. To illustrate our methodology, we implement our general methodology to two special situations where the linear and logistic regression models are of the focus of research. In Chapter 4 and Chapter 5, we focus our research on small area estimation using data from two files. Specifically, we are interested in predicting an area-specific parameter, which can be expressed as a function of fixed and mixed effects. A new linkage error model is developed to combine the small area model with the record linkage model. Its difference from the previously proposed linkage error model is discussed. Under the modified general integrated model, we provide the general methodology for obtaining the Empirical Best Prediction (EBP) estimator of the parameter of interest and for estimating its mean squared error. To illustrate our methodology for small area estimation, we consider the situation where the general linear mixed model with block-diagonal covariance structure is used as the unit-level 17 small area model. The nested-error linear model is discussed as a special example. In Chapter 6, we devise a Monte Carlo simulation study to compare differ- ent estimators, and we investigate the performance of the standard and simplified jackknife methods. In Chapter 7, we offer some scope for future research. 18 Chapter 2: Regression Analysis of Data from Two Files 2.1 Introduction In Chapter 2, we provide a methodological framework for statistical analysis using data from two different files. Specifically, we are interested in estimating the regression parameters related to the conditional distribution of the response variable y given the predictors x. We propose a general integrated model that takes account of linkage errors in the analysis of a wide range of variables—discrete and continuous. We also provide a general class of systems of estimating equations that can produce various estimators corrected for the linkage bias. A jackknife method is then adapted to estimate the bias, variance and mean squared error of our estimators. Moreover, we also introduce some simplified versions of the proposed estimators and the standard jackknife method in order to reduce the computational burden. Application of our methodology only requires observations of the response variable y to be independent across blocks given predictor x. So it is not limited to the observations related to mutually independent population units, but can be used for observations corresponding to units that are independent across blocks, such as residents in a county, patients in a clinic, or students in a school. 19 2.2 Problem Description and Data Availability Let y represent a scalar random variable of interest, and let x represent a vector-valued variable of order p. Our goal is to model the relationship between y and x in a population U . In particular, we are interested in estimating the regression parameters associated with the conditional distribution of y given x. However, the joint observations on (y,x) are not available. Instead, observations on y and observations on x are separately recorded in two files Fy and Fx, but the matching status between any record from Fy and any record from Fx is unknown. To be specific, Fy contains the observed values of y for a sample Sy of n units from U , Fx contains the observed values of x for a sample Sx of N units from U , and there is no duplicate in either file. In this dissertation, we assume that Sy ⊂ Sx. The data layout for files Fy and Fx is shown in Table 2.1. Here, ỹj denotes the value of y for record j in Fy, xj′ denotes the value of x for record j ′ in Fx, and yj′ denote value of y corresponding to x ′ , where j = 1, . . . , n, j′j = 1, . . . , N . Since yj′s exist but are not observed in Fx. their corresponding column in Fx is shaded in gray. The records in Fy are not aligned to those in Fx, so ỹj and yj′ may not represent the y-values for the same population unit even if j′ = j. Assume that there also exists a vector of K matching fields, denoted by w, whose observations are available in both files. Let w̃j and wj′ represent the values of w for record j in Fy and record j ′ in Fx, respectively. It is also sufficient to assume that only the values of comparison vector c, cjj′ , are available for each record pair (j, j′), j ∈ S ′y, j ∈ Sx. 20 Let ỹ = (ỹj) n j=1 denote the n×1 vector of observed y values in Fy,X = (xTj′)Nj′=1 denote the N×p matrix of observed x values in F Nx,(y =)(yj′)j′=1 denote the unknownn N × 1 vector of y values (asso)ciated with X , W̃ = w̃ T j denote the n×K matrixj=1 N of w values in Fy, W = w T j′ ′ denote the N ×K matrix of w values in Fj =1 x, and C denote the Nn×K matrix of comparison vectors derived from comparing W̃ and W . In summary, our observed data are {ỹ ,X ,W̃ ,W }, or equivalently, {ỹ ,X ,C}. Table 2.1: Data layout for observations on y and x in Fy and Fx: Fy contains observed values of variables w and y for a sample Sy of size n, Fx contains observed values of variables w and x for a sample Sx of size N , and Sy ⊂ Sx. Note that the true y values corresponding to x (shaded in gray) exist but are not observed in Fx. Label wT y Label wT xT y 1 w̃T ỹ T T1 1 1 w1 x1 y1 · · · · · · · · · · · · · · · · · · · · · Fy Fj w̃T xj ỹj j ′ wT Tj′ xj′ yj′ · · · · · · · · · · · · · · · · · · · · · n w̃T ỹ N wTn n N x T N yN 2.3 General Integrated Model for Regression Analysis In this section, we propose a general integrated model to propagate the un- certainty of the linkage process in the later estimation step under the assumption of data availability described in Section 2.2. The general integrated model involves three important components: a regression model, a linkage error model and a mix- ture model. The regression model is used to characterize the relationship between the response variable y and the predictor x, the linkage error model is used to characterize the randomness of the linkage process, and the mixture model on com- 21 parison vectors is used to estimate the probability of a record pair being a match given the observed data and designate all record pairs into links and non-links. In the following part, we introduce each component one by one. 2.3.1 Regression Model Assume that values of (y,x) for units in the population U follow a general regression model and the model holds for all sampled units in Sx. To illustrate the methodology, we assume that E(y|X ) = µ(X ;β), V ar(y|X ) = V (X ;β,τ ). (2.1) Here, β is a p× 1 vector of unknown coefficient parameters, τ is an h× 1 vector of other unknown variance components, µ(X ;β) = (µj′(X ;β)) N j′=1 is an N × 1 vector, and V (X ;β,τ ) = (v (X ;β,τ ))N,Nj′t′ j′=1,t′=1 is an N ×N matrix, where µj(·) and vj′t′(·) are known functions. Three simple examples are given below: Example 1: For the linear regression model y|X ∼ N(Xβ, σ2eIN) where IN is an identity matrix of dimension N×N , we have µ(X ;β) = Xβ , V (X ;β,τ ) = σ2eIN , and τ = σ2e . Example 2: For the logistic regression model where yj(′s are) independ(ent an)d identically distributed with P (yj′ = 1|xj′) = g(xT Tj′β) = exp xj′β /[1 + exp xTj′β ], we have the mean com[ponent µj′(]X ;β) = g(x T j′β) and the covariance component vj′t′(X ;β,τ ) = g(x T T ′ ′ j′β) 1− g(xj′β) for j = t and vj′t′(X ;β,τ ) = 0 for j′ 6= t′. Example 3: For a nested-error linear model y i = X iβ + vi1N + ei wherei iid 2 indv 2i ∼ N(0, σv), ei ∼ N(0N , σeIN ), vi is independent of ei, 1N is an N × 1 vectori i i i 22 of ones, 0N is a matrix of zeros of dimension Ni × Ni, IN is an identity matrix ofi i d∑imension Ni × Ni, and Ni is the number of units in group i, i = 1, . . . , G, withG i=1Ni = N , we have µ(X ;β) and V (X ;β,τ ) are block-diagonal with the ith block values µi(X ;β) = X iβ , V i(X ;β,τ ) = σ 21 T 2 2 2 Tv N 1N + σi i eIN and τ = (σi v , σe) . 2.3.2 Linkage Error Model As discussed in Chapter 1, most of the existing linkage error models are built directly on the linked data, based on the assumptions that (1) the linked data is obtained by linking two files of the same size that contain observations on all population units of U , and (2) the resulting linkage is complete for Fx, i.e., each record in Fx has a designated link selected from Fy. In reality, however, the two files used for analysis usually come from different sources, such as survey samples and administrative records, and thus their coverage of the population is different. Typically, the units covered by Fy are a subset of those covered by Fx, as described in our dissertation. Also, only the designated links obtained from the linkage process are contained in the linked data file. In practice, the decision rules for most record linkage techniques rely on the specified threshold values. Seldom, the choice of the threshold values can lead to a complete linkage for Fx. In this way, information about the linkage error carried by the designated non-links is ignored. Here, we develop a new linkage error model that allows different file sizes as long as Sy ⊂ Sx. The model is built directly on data from the original files by exploiting the relationship between the observed y values in Fy and the unobserved 23 y values corresponding to the observed x in Fx. The linkage error model we proposed here is the key to the general integrated model, serving as a connection between the regression model introduced in Section 2.3.1 and the record linkage model, which will be described in Section 2.3.3. Under the assumption that there is no duplicate in each file and that Sy ⊂ Sx, a specific observed y value in file Fy, say ỹj, and one of these unobserved y values from the set {yj′ : j′ = 1, . . . , N} must be related to the same population unit. Let ljj′ be the unknown binary matching status indicator for record pair (j, j ′), such that ljj′ = 1 if record j in Fy and record j ′ in Fx represent the same population unit, and l ′jj′ = 0 otherwise, j ∈ Sy, j ∈ Sx. Then the relationship between ỹj and {yj′ : j′ = 1, . . . , N} can be modeled via the following identity: ∑N ỹj = ljj′yj′ , j = 1, . . . , n. (2.2) j′=1 Let L = (l )n,Njj′ j=1,j′=1. Then the above model (2.2) can also be written in the following matrix form: ỹ = Ly (2.3) In other words, ỹ, the observed y values for all sampled units in Sy, is a n-permutation of y, the unobserved y values for all sampled units in Sx. In this dissertation, we assume that the linkage mechanism is at random (LAR). That is, the conditional probability of L given y, X , and C could depend on X and C but not on y, i.e., P (L|y,X,C ) = P (L|X,C ). Here, specifically, we assume that the probability of a record pair being a true match only depends on its 24 comparison vector. That is, P (L|y,X,C ) = P (L|C ). 2.3.3 Mixture Model Following Larsen and Rubin (2001), we assume that the comparison vectors fol- low a two-class mixture model. Jaro (1989), Winkler (1993, 1994, 1995), Thibaudeau (1993), and Armstrong and Mayda (1993) used mixture models in record linkage problems. The two-class mixture model on comparison vectors is motivated by the idea that patterns of agreement and disagreement on matching fields would have different distributions among matches M = {(j, j′) : ljj′ = 1, j ∈ Sy, j′ ∈ Sx} and mismatches M c = {(j, j′) : l ′jj′ = 0, j ∈ Sy, j ∈ Sx}. The comparison vectors cjj′ are assumed to be independent and identically distributed with the following probability mass function: P (cjj′) = πP (cjj′ |ljj′ = 1) + (1− π)P (cjj′|ljj′ = 0), where π = P (ljj′ = 1) represents the probability of a record pair being a match, P (cjj′|ljj′ = 1) and P (cjj′|ljj′ = 0) are the probabilities of observing cjj′ among matches M and among mismatches M c, respectively. Under the conditional independence assumption as shown in (1.2), the above mixture model is simplified into: ∏K c ∏K′ P (c ) = π m jj k(1−m )(1−c c ′′ jj′k)jj k k + (1− π) u jj k k (1− u (1−cjj′k) k) . (2.4) k=1 k=1 where cjj′k is the kth element of cjj′ , mk = P (cjj′k = 1|ljj′ = 1) and uk = P (cjj′k = 1|ljj′ = 0) are the probabilities of a record pair agreeing on matching fields k among 25 matches and mismatches, respectively. Let ψ = (π,m1, . . . ,mK , u1, . . . , uK) T denote the vector of unknown parameters in the mixture model. By Bayes’ Rule, the conditional probability of a record pair being a match given the observed comparison vector i∏s given by: π K c m jj ′k(1−m )(1−cjj′k) P (l k=1 k jj′ = 1|cjj′ ;ψ) = ∏ kK c ′ ∏π m jj k(1−m )(1−cjj′k) c ′k=1 k k + (1− π) K u jj k(1− u )(1−cjj′k)k=1 k k := qjj′ . Note that qjj′ = qjj′(cjj′ ;ψ) is a function of cjj′ and ψ. Let Q(ψ) ≡ Q(C ;ψ) = (q )n,Njj′ j=1,j′=1. Then we have E(L|y,X,C ) = E(L|C ) = Q(ψ). The maximum likelihood estimator of ψ can be obtained using the expecta- tion maximization (EM) (Dempster, Laird, and Rubin 1977) and the expectation conditional maximization (ECM) (Meng and Rubin 1993) algorithms. 2.3.4 Designation of Links and non-Links As mentioned before, it is not necessary to produce a linked file in the middle of the record linkage process for the purpose of parameter estimation in our case. If the primary goal is to generate a linked data set for secondary data users, the estimated probabilities can be used to partition the record pairs into designated links and non-links and to estimate error rates. The decision rule is similar to Fellegi and Sunter’s method. A record pair (j, j′) is declared as a link if the probability qjj′ is above a pre-specified upper threshold. Other than qjj′ , one can also consider use one of the following as a matching weight: 26 ∏ c| K jj′ (1−cP (c ′ l k jj′k) (1) likelihood ratio: R (ψ) = jj jj ′=1) ′ = k=1 mk (1−mk) jj P (c cjj′ |ljj′=0) ∏K ′u jj k (1−c(1−u ) jj′k)k=1 k k P (l ′=1|c ′ ) q ′ (2) posterior likelihood ratio: r (ψ) = jj jj jjjj′ = .P (ljj′=0|cjj′ ) 1−qjj′ In our dissertation, based on the assumption that Sy ⊂ Sx, it is reasonable to assume that the record linkage is complete for Fy; that is, each record in Fy has a linked record from Fx. Thus, we designate record pairs as links and non- links based on the following decision rule: for any record j in Fy, a record j ′ in Fx is selected to be its link if its corresponding probability qjj′ is the largest among {qjt : t = 1, . . . , N}, j = 1, . . . , n. By using this decision rule, we can generate a linked dataset, which contains data (ỹ , X̃ ?). Here, X̃ ? = (x̃?T nj )j=1 denote the n× p matrix of x values which are selected from Fx and linked to ỹ, and x̃ ? j denotes the selected x value from Fx that is linked to ỹj in Fy, j = 1, . . . , n. Therefore, X̃ ? in the linked data set is an n-permutation of X in Fx. That is, X̃ ? = AX where A = (ajj′) n,N j=1,j′=1 is an n × N permutation matrix with A1N = 1n. Here, ajj′ = 1 if qjj′ is the largest among {qjt : t = 1, . . . , N}, ajj′ = 0 otherwise. Note that A is derived from probabilities qjj′ , which are functions of comparison vectors cjj′ and mixture model parameters ψ. Hence, when ψ is known and C is observed, A is fixed. Therefore, X̃ ? is fixed given X , C and ψ. 2.4 Estimation of Regression Coefficients Assuming that y is conditionally independent of C given X , the conditional mean and variance of ỹ given X and C can be derived under the general integrated 27 model. That is, E(ỹ|X,C ) = Q(C ;ψ)µ(X ;β), V ar(ỹ|X,C ) = Σ(X,C ;ψ,β, τ ). (2.5) where Σ ≡ Σ(X,C ;ψ,β, τ ) = (σjt)n,nj=1,t=1 with diagonal entries σjj and off-diagonal entries σjt (j 6= t) equal to ∑N ∑N ∑N ( ) σjj = σjj(X,C ;ψ,β, τ ) = q 2 jj′qjt′vj′t′ + qjj′(1− qjj′) vj′j′ + µj′ , j∑′=1∑t′=1 j′=1N N σjt = σjt(X,C ;ψ,β, τ ) = qjj′qtt′vj′t′ , (t 6= j). j′=1 t′=1 The detailed proof of (2.5) is shown in Section 2.7.1. When ψ is known, merely based on (2.5), we can estimate β by solving the following class of system of p unbiased estimating equations: β̂ : f(β,τ ,ψ) = H (β,τ ,ψ) [ỹ −Q(ψ)µ(β)] = 0p, (2.6) where H (β,τ ,ψ) ≡H (C,X ;β,τ ,ψ) is a given p× n matrix which does not depend on ỹ, and 0p is a p × 1 vector of zeros. The possible choices for matrix H (β,τ ,ψ) includes but are not limited to X̃T ,XTQT , andXTQTΣ−1. The choices ofH (β,τ ,ψ) for the linear and logistic regressions will be discussed in the next chapter. When ∂f(β,τ ,ψ) exits, an application of Taylor series expansion yields ∂β ∂f(β,τ ,ψ) ( ) f(β̂ , τ ,ψ) ≈ f(β,τ ,ψ) + β̂ − β . ∂β Based on the fact that f(β̂ , τ ,ψ) = 0p, E [f(β,τ ,ψ)] = 0p, and V ar (f(β,τ ,ψ)|X,C ) = H (β,τ ,ψ)Σ (ψ,β, τ )HT (β,τ ,ψ), 28 we can obtain that [ ] | ≈ ∂f(β,τ ,ψ) −1 [ ] E(β̂ X,C) [ ] E f(β̂ , τ ,ψ)− f(β,τ ,ψ) + β =(β[, ] )(2.7)∂β T | ≈ ∂f(β,τ ,ψ) −1 ∂f(β,τ ,ψ) −1 V ar(β̂ X,C) H (β,τ ,ψ)Σ (ψ,β,τ )HT (β,τ ,ψ) , ∂β ∂β when the matrix ∂f(β,τ ,ψ)∂β is invertible at the true value of β . The detailed proof of (2.7) is given in Section 2.7.2. It implies that the resulting estimator β̂ from solving (2.6) is (approximately) unbiased for β when other parameters are unknown. When the estimating equations in (2.6) are used, the resulting estimator β̂ may depend on the unknown variance component τ if the selected matrix H (β,τ ,ψ) depends on τ . In that case, methods for estimating τ need to be considered. When additional assumptions about the regression model are made, such as normality, other unbiased estimating functions f(β,τ ,ψ) can be derived to estimate β and τ simultaneously when ψ is known. For example, the maximum likelihood estimator of β and τ can be obtained by using the first partial derivatives of the log-likelihood function as the estimating functions. An example is given for the linear regression case in the next chapter. In order to simplify the methodology, one may replaceQ in the estimating equations f(β,τ ,ψ) by QM (or QM2), which is a simplified version of Q with all entries in each row set to zeros except the largest one (or two). For known ψ, let β̂F (ψ) denote an estimator of β obtained as a solution to (2.6) for a given choice of H (β,τ ,ψ). The corresponding estimator of β when Q is replaced by QM (or QM2) in (2.6) is denoted by β̂M (ψ) (or β̂M2(ψ)). When ψ is unknown, one can use β̂F (ψ̂) to estimate β by replacing ψ with one of its consistent estimators ψ̂. The corresponding estimator of β when Q is replaced by QM (or QM2) in (2.6) is denoted by β̂M (ψ̂) (or β̂M2(ψ̂)). In this dissertation, the maximum likelihood estimator of ψ is used as ψ̂, and it can also be treated as a solution of 29 a system of estimating equations. An estimator of the variance of β̂(ψ̂) can be obtained by plugging in the estimates β̂ , τ̂ and ψ̂ in the variance formula from (2.7). But we do realize that this plug-in variance estimator would underestimate V ar(β̂(ψ̂)|X,C) since it does not take account of the variability of τ̂ and ψ̂. In the next section, a resampling method of estimating V ar(β̂(ψ̂)|X,C) is given for the case where the measurements are uncorrelated across blocks, and we leave the variance estimation for the correlated-across-blocks case for future research. Now, we consider the situation where blocking is used during the record linkage process. The records in Fy and Fx can be partitioned into G blocks based on some basic characteristics, such as zip code, first letter of last name, or first three digits of phone numbers. Let ni and Ni be ∑the number of sa∑mple units in block i within Sy and Sx, respectively, i = 1, . . . , G. So G Gi=1 ni = n and i=1Ni = N . Let ỹij denote the value of y for record j in block i from Fy, xij′ denote the value of x for record j ′ in block i within Fx, and yij′ denote its corresponding y value, i = 1, . . . , G, j = 1, . . . , ni, j ′ = 1, . . . , Ni. We denote the vector of values ỹij′ and yij′ within block i by ỹ i = (ỹ ni ij)j=1 and y i = (yij′) Ni j′=1, respectively. Similarly, we denote the matrix of x values in block i by X = (xT )Nii ij′ j′=1. Then the regression model shown in (2.1) can be rewritten as: E(y i|X i) = µi(X i;β), V ar(y i|X i) = V i(X i;β,τ ), (2.8) for i = 1, . . . , G. Here, we assume that y i are independent across blocks given X . We assume that there is zero probability that one record from Fy and another record from Fx represent the same population unit if they are from different blocks. Therefore, only records within the same blocks need to be compared, and linkage errors can only occur within blocks. Let lijj′ denote the matching status of record pair (j, j ′) in block i, and Li = (l i )ni,Nijj′ j=1,j′=1 denote its corresponding matrix. The linkage error model in (2.3) 30 can be simplified into: ỹ i = Liy i, i = 1, . . . , G. (2.9) In other words, L = diag(L1, . . . ,LG). Let cijj′ be value of the comparison vector c derived from values of matching fields w̃ ′ij and wij′ for record pair (j, j ) in block i. The two-class mixture model in (2.4) can be re-written as ∏K Kci ′ (1−ci ∏ i) c ′ (1−ci ) P (ci ) = π m jj kjj′ k (1−m ) jj ′k k + (1− π) u jj kk (1− uk) jj ′k . (2.10) k=1 k=1 Let qijj′ = P (l i i i ni,Ni jj′ = 1|cjj′) and Qi = (qjj′)j=1,j′=1, then E(Li|C i) = Qi. Therefore, in case of blocking, when y i’s are independent across blocks, the estimat- ing equations for β (and τ ) given known ψ can be generally written are ∑G fi(β,τ ,ψ) = 0t, (2.11) i=1 where t is equal to p for estimating β or (p + h) for estimating β and τ . In particular, fi(β,τ ,ψ) = H i(β,τ ,ψ) [ỹ i −Qi(ψ)µi(β)] for (2.6). 2.5 Variance Estimation As mentioned above, when the mixture model parameter ψ is known, estimate of β (and τ ) can be obtained by solving the following system of estimating equations. That is, ∑G fi(β,τ ,ψ) = 0t, (2.12) i=1 In order to estimate the bias, variance, and mean squared error of an estimate β̂(ψ), the unified jackknife theory proposed by Jiang, Lahiri and Wan (2002), henceforth referred to as JLW, can be used. Jackknife replicate i is obtained by deleting data from block i in 31 both files Fx and Fy, (i = 1, . . . , G). The delete-i estimates of β , β̂−i(ψ), and the delete-i estimate of τ , τ−i(ψ), are the solutions of ∑G β̂−i(ψ), τ−i(ψ) : fi′(β,τ ,ψ) = 0t, (2.13) i′=6 i for i = 1, . . . , G. The jackknife estimate of bias, variance and mean squared error of β̂ , when ψ is known, are then given by ( ) ( ) ¯̂ biasJ (β̂(ψ)) = (G− 1) β(ψ)− β̂(ψ) ,− ∑GG 1 ( )( )− ¯̂ ¯̂ TvarJ β̂(ψ) = β̂−i(ψ) β(ψ) β̂−i(ψ)− β(ψ) , ( ) G ∑i=1− GG 1 ( )( )T mseJ β̂(ψ) = β̂−i(ψ)− β̂(ψ) β̂−i(ψ)− β̂(ψ) . G ∑ i=1¯̂ where β(ψ) = 1 GG i=1 β̂−i(ψ) is the average of the replicate estimates of β . The bias, variance and mean squared error of τ̂ (ψ) can also be estimated similarly. In practice, however, the mixture model parameter ψ is unknown. The maximum likelihood estimate (MLE) of ψ, say ψ̂, can be obtained using the EM algorithm. The MLE ψ̂ can also be treated as the solution of a system of estimating equations derived from the log-likelihood function based on the distribution of comparison vectors cijj′ . In order to account for uncertainty of ψ̂, ψ should be replaced by ψ̂ and ψ̂−i in (2.12) and (2.13), respectively, where ψ̂−i is the delete-i estimate of ψ by removing values of comparison vectors in block i, i = 1, . . . , G. Then the bias, variance, and mean squared error of β̂(ψ̂) can be estimated. The properties of β̂(ψ̂) are expected to be similar to those of β̂(ψ) if ψ̂ is assumed to be independent of the response variable y; that is, the distribution of the matching variables (e.g., last name, phone number) is assumed to be independent of the response variable y (e.g., income) and hence of ỹ. This is true in many applications. The bias, variance and mean squared error of any smooth function of β can be proposed in a 32 straightforward way. For large G, under regularity conditions, asymptotic properties of β̂(ψ̂) and the jackknife estimators proposed in this section can be obtained from the unified theory on jackknife given in Jiang et al. (2002). To reduce the computational burden, a simplified jackknife method can be used by replacing the delete-i estimate ψ̂−i by its full sample estimate ψ̂. Our simulation results show that the accuracy of the variance estimate would not be jeopardized much even though the uncertainty of ψ̂ is ignored. 2.6 Summary In chapter 2, we introduce a general methodology for regression analysis when data values are from two different files. Rather than separating regression analysis from record linkage as most existing methods do, we connect the regression model and the record linkage model through our proposed new linkage error model. The general integrated model can be implemented when the sample units in one file are a subset of those in the other file. The y values observed in Fy are related to the x values observed in Fx through the integrated model, and standard statistical analysis methods can be applied for parameter estimation. For the purpose of parameter estimation, there is no need to generate a linked file in the middle of the process. Information about linkage errors carried by all record pairs (links and non-links) can all be passed into the estimation process and used to correct for linkage bias. This is where our model is different from the secondary data analysis where only the designated links are considered. Essentially,, parameter estimation starts with deriving the conditional distribution of the observed y values in file Fy given the observed x values in Fx and comparison vectors. Based on their relationship, estimators can be obtained by solving a system of estimating equations. In case of blocking, if data values are independent across blocks, the 33 jackknife resampling method proposed by Jiang, Lahiri, and Wan (2005) can then be used to estimate the bias, variance, and mean squared errors of the estimators, taking account of both estimation errors and linkage errors. In the following chapter, we will give two specific examples to illustrate our methodology. 2.7 Proofs 2.7.1 Proof of (2.5) Based on the assumption that P (L|ỹ ,X,C) = P (L|C), it can be proved from the mixture model that ljj′ are conditionally independent given C with P (ljj′ = 1|y,X,C) = P (ljj′ = 1|C) = P (ljj′ = 1|cjj′) = qjj′ . Therefore, for j = 1, . . . , n, j′ = 1, . . . , N , t = 1,. . . , n, t ′ = 1, . . . , N , we have [ ] q ′ ′jj′ if j = t and j = t E(ljj′ |y,X,C) = qjj′ , E ljj′ ltt′ |y,X,C =  . (2.14)qjj′qtt′ otherwise Let Q = (qjj′) n,N j=1,j′=1, then E(L|y,X,C) = Q (2.15) Now, we consider the first-order and second-order conditional expectation of ỹ given y, X and C . Combined result (2.15) with the linkage error model ỹ = Ly, we can get E [ỹ|y,X,C ] = E [Ly|y,X,C ] = E [L|y,X,C ]y = Qy. ∑ Since ỹ = Nj j′=1 ljj′yj′ under the linkage error model, the (j, t)th entry of the matrix ỹỹ T is equal to ∑ (∑ )N N (ỹỹT )jt =  ljj′y ′j ltt′yt′ , j = 1, . . . , n, t = 1, . . . , n. j′=1 t′=1 34 By applying the results in (2.14), we can calculate the (j, j) diagonal entries and (j, t) off-diagonal entries of E[ỹỹT |y,X,C ].T(hat is, for j)= 1, . . . ,n and t = 1, . . . , n,∑N  ∑NE[ỹỹT |y,X,C ]jj = E ljj′yj′ ljt′yt′ |y,X,C ∑∑j′=1 [ t ′=1 N N ] = yj′yt′E ljj′ ljt′ |y,X,C j∑′=1 t′=1N ∑ [ ] ∑N [ ] = yj′yt′E ljj′ ljt′ |y,X,C + yj′yj′E ljj′ ljj′ |y,X,C j∑′=1 t∑′=6 j′ ∑ j′=1N N = yj′yt′q 2 jj′qjt′ + yj′qjj′ j∑′=1 t∑′ 6=j′N N ∑j′=1N = y ′y ′q 2j t jj′qjt′ + yj′qjj′(1− qjj′), (2.16) j′=1 t′=1 j′=1 ∑ (∑ ) N NE[ỹỹT |y,X,C ]jt = E ljj′yj′ l tt′yt′ |y,X,C ∑N ∑j′=1 ′N [ t =1 ] = yj′yt′E ljj′ ltt′ |y,X,C j∑′=1 tN ∑′=1N = yj′yt′qjj′qtt′ , (j =6 t). j′=1 t′=1 Assuming that the response variable y is conditionally independent of comparison vector c given x, we can derive the first-order and second-order expectation of y given x and c from the regression model (2.1). That is, E(yj′ |X,C) = E(yj′ |X ) = µ ′ ,[ ] [ ] j E yj′yt′ |X,C = E yj′yt′ |X = vj′t′ + µj′µt′ (2.17) for j′ = 1, . . . , N , t′ = 1, . . . , N . Based on results (2.16) and (2.17), the first-order and second-order of ỹ given X 35 and C can be derived by applying law of total expectations. That is, E[ỹ|X,C ] = E (E[ỹ|y,X,C ]|X,C) = E (Qy|X,C) = QE (y|X,C) = Qu, ( ) E[ỹỹT |X,C ]jj = EE[ỹỹT |y,X,C ]j,j |X,C∑N ∑N ∑ N= E yj′yt′qjj′q 2jt′ + yj′qjj′(1− qjj′)|X,C ∑ j′=1 t′=1 j′=1N ∑N ( ) ∑N ( ) = qjj′qjt′E yj′yt′ |X,C + qjj′(1− qjj′)E y2j′ |X,C j∑′=1 t′=1N ∑N [ ] ∑j′=1N [ ] = qjj′qjt′ vj′t′ + µj′µt′ + qjj′(1− qjj′) vj′j′ + µ2j′ , j′=(1 t′=1 ) j′=1 E[ỹỹT |X,C ]jt = EE[ỹỹT |y,X,C ]j,t|X,C∑N ∑ N= E yj′yt′q jj′qtt′ |X,C ∑∑j′=1 t′=1N N ( ) = qjj′qtt′E yj′yt′ |X,C j∑′=1 t∑′=1N N ( ) = qjj′qtt′ vj′t′ + µj′µt′ , (t 6= j). j′=1 t′=1 for j = 1, . . . , n, t = 1, . . . , n. By applying the identity V ar(ỹ|X,C) = E[ỹỹT |X,C ] − E[ỹ|X,C ]E[ỹ|X,C ]T , the diagonal and off-diagonal entries of V ar(ỹ|X,C) are given by: ( ) V ar(ỹ|X,C) T Tjj = E[ỹỹ |X,C ] − E[ỹ|X,C ]E[ỹ|X,C ]∑N ∑ jj [ ] ∑ jjN N [ ] = qjj′q 2 jt′ vj′t′ + µj′µt′ + qjj′(1− qjj′) vj′j′ + µj′ ∑j′=1N ∑t′=1 j′=1N − qjj′qjt′µj′µt′ j∑′=1 t∑′=1N N ∑N [ ] = qjj′qjt′vj′t′ + qjj′(1− qjj′) vj′j′ + µ2j′ := σjj , j′=1 t′=1 j′=1 36 ( ) V ar(ỹ|X,C) Tjt = E[ỹỹ |X,C ]jt − E[ỹ|X,C ]E[ỹ|X,C ]T∑N ∑N ( ) ∑∑ jtN N = qjj′qtt′ vj′t′ + µj′µt′ − qjj′qtt′µj′µt′ j∑′=1 tN ∑′=1 j′=1 t′=1N = qjj′qtt′vj′t′ := σjt, (t =6 j), j′=1 t′=1 for j = 1, . . . , n and j = 1, . . . , n. Let Σ ≡ Σ(X,C ;ψ,β,τ ) = (σ n,njt)j=1,t=1, then the conditional mean and variance of ỹ given X and C can be written in the following matrix form: E(ỹ|X,C) = Q(C ;ψ)u(X ;β), V ar(ỹ|X,C) = Σ(X,C ;ψ,β,τ ). 2.7.2 Proof of (2.7) By using Talor expansion, the estimating function can be approximated by ≈ ∂f(β,τ ,ψ) ( ) f(β̂ , τ ,ψ) f(β,τ ,ψ) + β̂ − β . ∂β Thus, when the matrix ∂f(β,τ ,ψ)∂β is invertible at the true value of β , we can have[ ] − ≈ ∂f(β,τ ,ψ) −1 [ ] β̂ β f(β̂ , τ ,ψ)− f(β,τ ,ψ) . ∂β Based on the fact that E [ỹ|X,C ] = Q(ψ)µ(β) and V ar (ỹ|X,C) = Σ (ψ,β,τ ), we 37 can get E [f(β,τ ,ψ)|X,C ] = E [H (β,τ ,ψ) (ỹ −Q(ψ)µ(β)) |X,C ] = H (β,τ ,ψ) (E [ỹ|X,C ]−Q(ψ)µ(β)) = 0p, V ar (f(β,τ ,ψ)|X,C) = V ar (H (β,τ ,ψ) [ỹ −Q(ψ)µ(β)] |X,C) = H (β,τ ,ψ)V ar (ỹ|X,C)HT (β,τ ,ψ) = H (β,τ ,ψ)Σ (ψ,β,τ )HT (β,τ ,ψ). Combing the(above results with the fact f(β̂ , τ ,ψ) = 0p, we can get[ ] ∂f(β,τ ,ψ) −1 [ ] ) E(β̂ |X,C) ≈ E f(β̂ , τ ,ψ)− f(β,τ ,ψ) + β |X,C [ ∂β ] ∂f(β,τ ,ψ) −1 = [0p − E (f(β,τ ,ψ|X,C))] + β ∂β = β, V ar(β̂ |X,C) = V[ ar(β̂ − β |X],C) ( ) ∂f(β,τ ,ψ) −1 ( ) [ ] T≈ − ∂f(β,τ ,ψ) −1V ar f(β̂ , τ ,ψ) f(β(, τ ,ψ)|X,C[ ∂β ] ) ∂β ∂f(β,τ ,ψ) −1 [ ] T ∂f(β,τ ,ψ) −1 = V ar (f(β,τ ,ψ)|X,C) [ ∂β ] ∂β ([ ] )T ∂f(β,τ ,ψ) −1 ∂f(β,τ ,ψ) −1 = H (β,τ ,ψ)Σ (ψ,β,τ )HT (β,τ ,ψ) . ∂β ∂β 38 Chapter 3: Applications to Linear and Logistic Regression To illustrate our general methodology for regression analysis using data from two files as described in Section 2.2, we consider two special situations where regression parameters in linear and logistic models are of interest. Here, we use the same notation as Chapter 2. 3.1 Linear Regression using Data from Two Files Assume that the values of y and x for all sampled units in block i of Sx satisfy the following model: E(y i|X i) = X iβ, V ar(y i|X i) = σ2eIni , i = 1, . . . , G, (3.1) where σ2e is an unknown constant parameter. Note that values yij in block i are uncorre- lated and have the same variance σ2e (homoscedasticity). When data (y i,X i) is available for each block, the Ordinary Least Squares (OLS) estimator is the best linear unbia(sed estimat)or (B(LUE), and∑ −1 ∑ ) it is given by: G G β̂ = XT Ti X i X i y i (3.2) i=1 i=1 However, the above estimator cannot be used to estimate β in our case since the y i’s are not observed. If a linked data file is generated based on the decision rule described in 2.3.4 during the record linkage process and data (ỹ i, X̃ ? i ) is available, one may simply assume the linkage 39 is perfect, replace X ?i and y i in (3.2) by X̃ i and ỹ i, and obtain a naive OLS estimator β̂N (ψ). That is, (∑ ) ( )G −1 ∑G β̂ (ψ) = X̃ ?TX̃ ? X̃ ?TN i i i ỹ i , (3.3) i=1 i=1 which can be treated as the solution of the following system of estimating equations: ∑G ( ) β̂N (ψ) : X̃ ?T ỹ − X̃ ?i i iβ = 0p. i=1 In Section 3.3.2, we prove that the mean and variance of β̂N (ψ) under the general inte- grated model are ( ) (∑ )G −1(∑ )G E β̂N (ψ)|X,C,ψ = X̃ ?Ti X̃ ?i X̃ ?Ti QiX i β, (3.4) ( ) [∑i=1 ] i=1G −1 [∑ ][ ]G ∑G −1 V ar β̂N (ψ)|X,C,ψ = X̃ ?TX̃ ? ?T ? ?Ti i X̃ i ΣiX̃ i X̃ i X̃ ?i . i=1 i=1 i=1 Based on the result, we can see that β̂N (ψ) is an biased estimator of β . In order to correct the linkage bias and obtain a more robust estimator of β , we exploit the relationship between ỹ i and X i, C i under the general integrated model. By using the linear model (3.1) as the first component of the general integrated model, the conditional mean and variance of ỹ i given X i and C i can be derived under the assumption that y i is independent of C i given X i . That is, for i = 1, . . . , G, E(ỹ i|X i,C i) = QiX iβ, V ar(ỹi |X i,C i) = Σi. (3.5) Here, Σ = (σi )ni,nii jt j=1,t=1 is the ni × ni variance-covariance matrix with diagonal element σijj and off-diagonal element σ i jt (j 6= t) equal to ∑Ni ∑Ni ∑Ni σi = qi σ2 + qi (1− qi )(xT β)2, σi = qi qi 2jj jj′ e jj′ jj′ ij′ jt jj′ tj′σe . j′=1 j′=1 j′=1 The detailed proof for (3.5) is given in Section 3.3.1. It is consistent with the general result shown in (2.5). Note that Σi depends on β , σ 2 e and ψ. When compared to the 40 distribution of y i given X i in (3.1), ỹ i also follows a linear regression model of β , but the design matrix changes fromX i toQiX i and the variance matrix changes from σ 2 eIni to Σi. The unequal diagonal entries and the non-zero off-diagonal entries of Σi imply that values ỹij in Fy have different variances and are correlated within blocks (heteroscedasticity). When ψ is known, several different estimators of β can be developed based on the relationship between ỹ and X , C . These estimators include the bias-corrected estimator β̂C(ψ), the ordinary least squares estimator β̂OLS(ψ), the weighted least squares estimator β̂ 2WLS(ψ, σe), and the maximum likelihood estimator β̂MLE(ψ). The development of the bias-corrected estimator β̂C(ψ) starts with investigating the bias of the naive OLS estimator β̂N (ψ) conditional on values X and C . By using the fact that E(ỹ i|X i,C i) = QiX iβ , we prove(in Section 3.3.2 that( ) ∑ )−1(∑ )G G E β̂N (ψ)|X,C = X̃ ?Ti X̃ ? X̃ ?Ti i QiX i β ∑ i=1 i=1 If the matrix G X̃ ?Ti=1 i QiX i is invertible, an unbiased linear estimator β̂C(ψ) of β can be obtained by adjusting the bia(s of β̂N (ψ). Th∑ ) at is(, ) G −1 ∑G β̂ (ψ) = X̃ ?TQ X X̃ ?TC i i i i ỹ i , i=1 i=1 which can be treated as the solution of the following system of estimating equations: ∑G β̂C(ψ) : X̃ ?T i (ỹ i −QiX iβ) = 0p. i=1 In Section 3.3.2, we prove that the mean and variance of β̂C(ψ) are equal to: E(β̂C(ψ)|X,C) = β[,∑ ]−1 [∑ ][∑ ] (3.6) G G G −1 V ar(β̂C(ψ)|X,C) = X̃ ?Ti QiX X̃ ?Ti i ΣiX̃ ?i XTi QTi X̃ ?i . i=1 i=1 i=1 Moreover, by directly utilizing this linear regression relationship between ỹ i and X i, C i as shown in (3.5), we can also use the ordinary least squares method to obtain an linear 41 unbiased estimator β̂OLS(ψ) of β . That is, ∑G β̂OLS(ψ) = arg min (ỹ i −QiX iβ)T (ỹ −Q X β)( i i iβ∑ i=1 )G −1(∑ )G = XTi Q TQ X XTQTi i i i i ỹ i , i=1 i=1 which can be treated as the solution of the following system of estimating equations: ∑G β̂ (ψ) : XTQTOLS i i (ỹ i −QiX iβ) = 0p. i=1 The mean and variance of β̂OLS(ψ) are equal to E(β̂OLS(ψ)|X,C) = β[,∑ ] (3.7) G −1 [∑ ][ ]G ∑G −1 V ar(β̂OLS(ψ)|X,C) = XT T T T T Ti Qi QiX i X i Qi ΣiQiX i X i Qi QiX i . i=1 i=1 i=1 We do realize that the OLS estimator β̂OLS(ψ) is not the best linear unbiased estimator of β since ỹij are correlated within blocks under the general integrated model. One may consider to use the weighted least squares method to obtain the best linear unbiased estimator of β by using the inverse of the variance-covariance matrix Σi as the weight matrix. That is, ∑G β̂ 2WLS(ψ, σe) = arg min (ỹ i −QiX iβ)TΣ−1i (ỹ i −QiX iβ) (3.8)( β∑ i=1 ) ( )G −1 ∑G =6 XTQTi i Σ−1i QiX i X T T −1 i Qi Σi ỹ i (3.9) i=1 i=1 To obtain β̂WLS(ψ, σ 2 e), we take partial derivatives of the sum of weighted squares with respect to βk (k = 1, . . . , p), and set the partial derivatives to zeros. Then the weighted least squares (WLS) estimator β̂ 2WLS(ψ, σe) of β is obtained by solving the following set of estim{ating equations:∑G } β̂WLS(ψ, σ 2 e) : 2δ T T T T −1 −1 kX i Qi + (ỹ i −QiX iβ) Σi Di,k Σi (ỹ i −QiX iβ) = 0, (3.10) i=1 42 for k = 1, . . . , p. The detailed proof is given in Section 3.3.3. Here, δ = ∂βk ∂β is the kthk column of the identity matrix I of dimension p× p, and D = ∂Σip i,k ∂β is an ni × ni matrixk ∂σi with diagonal entries jj∂β andk ∂σi ∑Nijj ∂σijt = 2 qi ijj′(1− qjj′)(xTij′β)xij′kδk, = 0,∂βk j′ ∂βk =1 We can see that the WLS estimator β̂WLS(ψ, σ 2 e) obtained from optimizing (3.8) is not the best linear unbiased estimator that we expect as shown in (3.9). This is mainly because the variance-covariance matrix Σi is not free of β . For the same reason, the resulting β̂ 2WLS(ψ, σe) may not possess the nice properties of the weighted least squares estimator of β obtained in the case where the variance-covariance matrix of the linear regression model is free of β . For example, (1) there is no close-form expressions for β̂WLS(ψ, σ 2 e), and β̂WLS(ψ, σ 2 2 e) is not a linear estimator; (2) β̂WLS(ψ, σe) is not identical to the MLE estimator, which can be seen by comparing their estimating equations. In addition, β̂ (ψ, σ2) depends on parameter σ2WLS e e , which is usually unknown and need to be estimated. Otherwise, β̂WLS(ψ, σ 2 e) cannot be evaluated even if ψ is known. Under the assumption of normality, ỹ i|X i,C i ∼ N(QiX iβ,Σi), we can also derive the maximum likelihood estimator (MLE) of β and σ2e simultaneously when ψ is known. The log-likelihood function of β and σ2e based on data {ỹ i,X i,C i, i = 1, . . . , G} is given by: ∑m1 { − }l(β , σ2e) = − ni ln(2π) + ln |Σi |+ (ỹ i −QiX iβ)TΣ 1i (ỹ i −QiX iβ) .2 i=1 The MLE estimators β̂MLE(ψ) and σ̂ 2 e(ψ) can be treated as solutions of the following 43 system {of estimating equations:∑G ( − ) [ ] }tr Σ 1i Di,k − 2δTXTk i QTi + (ỹ i −QiX iβ)TΣ−1 −1i Di,k Σi (ỹ i −QiX iβ) = 0 ∑i=1G { ( tr Σ− ) } 1 i Di,σ − (ỹ i −QiX iβ) TΣ−1i Di,σΣ −1 i (ỹ i −QiX iβ) = 0 (3.11) i=1 (for k )= 1, . . . , p. The detailed proof of (3.11) is in Section 3.3.4. Here, D = ∂Σii,σ =∂σ2e i ni,ni∂σjt 2 denotes the partial derivative of Σi with respect to σ 2 ∂σ e with e j=1,t=1 ∂σi ∑Ni ∂σi ∑Nijj jt = qijj′ , = q i qi 2 2 jj′ tj′ , (t =6 j). ∂σe ′ ∂σj =1 e j′=1 Remark 1: Here, based on the linear relationship between ỹ and X , C under the general integrated model, we derive four different estimators for the regression coefficient β in a multivariate linear regression model: β̂C(ψ), β̂OLS(ψ), β̂ 2 WLS(ψ, σe), and β̂MLE(ψ). Note that all these four estimators depend on ψ. When ψ is unknown, one can estimate β by substituting ψ with its maximum likelihood estimate ψ̂, which can be obtained by the expectation-maximization algorithm. Besides ψ, β̂ (ψ, σ2) also depends on σ2WLS e e . One may consider to use the linked data (ỹ , X̃ ) to obtain an estimate of σ2e . An estimator of σ2e under the exchangeable linkage error model is given by Chambers (2009). Remark 2: In order to estimate the variances of estimators β̂N (ψ̂), β̂C(ψ̂), β̂OLS(ψ̂), β̂ 2WLS(ψ̂, σ̂e), and β̂MLE(ψ̂), we can simply replacing the unknown parameters β , σ 2 e , and ψ with β̂ , σ̂2e , and ψ̂ in the formula of their corresponding theoretical variances of β̂N (ψ), β̂C(ψ), β̂OLS(ψ), β̂ 2 WLS(ψ, σe), and β̂MLE(ψ). The expression for the theoretical variances of β̂N (ψ), β̂C(ψ) and β̂OLS(ψ) are given in (3.4), (3.6), and (3.7), respectively. However, the variability of σ̂2e , and ψ̂ would be ignored in this way. Since these estimators of β and the maximum likelihood estimator of ψ can all be treated as solutions to a system of estimating equations, the jackknife method proposed by Jiang, Lahiri, and Wan (2005) can then be used to estimate their bias, variance, and mean squared error. 44 Remark 3: As shown above, there is no closed-from expressions for the WLS estimator β̂WLS and the MLE estimator β̂MLE . So numerical algorithms, such as Newton- Raphson method and Fisher scoring algorithm, are needed to find the solutions to the estimating equations. Initial values are required for these numerical algorithms. The naive OLS estimate β̂N (ψ̂), bias-corrected estimate β̂C(ψ̂), and OLS estimate β̂OLS(ψ̂) can be chosen as the initial values. 3.2 Logistic Regression using Data from Two Files Assuming there is no sampling bias, the logistic regression model also holds for all samples units in Sx. That is, yij′ are indepen(dent )and identically distributed with exp xT ′β P (y ′ = 1|xT ij ij ij′ ;β) = g(x T ′ ij′β) = ( ) , i = 1, . . . , G, j = 1, . . . , Ni. ( ) 1 + exp x T ij′β Ni Here, let g(X iβ) = g(x T ij′β) denote the Ni × 1 vector of means. j′=1 When the joint observations (yij′ ,xij′) are available, we can estimate β by using the maximum likelihood method. The MLE estimate β̂ of β is the solution of the following estimating equations: ∑G β̂ : XTi (y i − g(X iβ)) = 0p. i=1 When data values are from two different files and a linked data set with values (ỹ i, X̃ ? i ) (i = 1, . . . , G) is produced by any record linkage process, we can simply ignore the linkage errors in the linked file and obtain a naive MLE estimator β̂N (ψ) of β : ∑G ( ) β̂N (ψ) : X̃ ?T ỹ − g(X̃ ?i i iβ) = 0p. (3.12) i=1 However, the existence of linkage errors can lead to significant bias of the estimators, this is probably because they weaken the relationship between y and x. Similarly to the 45 linear regression case, we can correct the linkage bias by utilizing the relationship between ỹ and X , C under the general integrated model. Applying the fact that E(yij′ |xT Tij′) = g(xij′β),[ ] V ar(yij′ |xTij′) = g(xTij′β) 1− g(xTij′β) , Cov(yij′ , yit′ |X i) = 0, for i = 1, . . . , N , j′ = 1, . . . , N , t′ = 1, . . . , N , t′i i i 6= j′, we follow the steps in Section 3.3.1 and obtain the conditional mean and variance of ỹ i given X i, C i under the general integrated model: E(ỹ i|X i,C i) = Qig (X iβ) , V ar(ỹi |X i,C i) = Σi. (3.13) Here, Σ = (σi ni,nii jt)j=1,t=1 is the ni×ni variance-covariance matrix depending on parameters β and ψ with diagonal element σi and off-diagonal element σijj jt (j 6= t) equal to: ∑Ni [ ] ∑N σi i T i T i i i T Tjj = qjj′g(xij′β) 1− qjj′g(xij′β) , σjt = qjj′qtj′g(xij′β)[1− g(xij′β)]. j′=1 j′=1 Note that under the general integrated model, ỹ i does not follow a generalized linear model anymore given X i and C i, and the ỹij are correlated within each blocks. Based on the fact that E(ỹ i|X i,C i) = Qig (X iβ), we can obtain a set of unbiased estimating equations by adjusting the bias of the estimating function used for the naive MLE estima[tor, as shown in (3.12). Noting]that∑G ( ) ∑G ( ) E X̃ ?T ỹ − g(X̃ ?i i iβ) |X i,C ?T ?i = X̃ i Qig(X iβ)− g(X̃ iβ) , i=1 i=1 the unbiased estimating equations for the bias-corrected estimator β̂C(β) are given by: ∑G β̂C(β) : X̃ ?T i (ỹ i −Qig(X iβ)) = 0p. (3.14) i=1 46 More generally, when ψ is known, we can focus on the following system of p unbiased estimating equations: ∑G H i [ỹ i −Qig(X iβ)] = 0p, (3.15) i=1 where H i ≡H i(C i,X i;β,ψ) is a given p× n matrix. The possible choices for H i includes X̃ ?Ti , X̃ ?T T i Qi , and X̃ ?TQTΣ−1i i i . 3.3 Proofs 3.3.1 Proof of (3.5) ∑ Under the linkage error model ỹ i = Liy i, we have ỹ = Ni ij j′=1 l i jj′yij′ . Assum- ing the linkage is at random (LAR), it is true that P (li ijj′ = 1|y i,X i,C i) = P (ljj′ = 1|C i i i i ii) = P (ljj′ = 1|cjj′) = qjj′ under the mixture model. Thus, E(ljj′ |y i,X i,C i) = qjj′ , Cov(li , li |y ,X ,C ) = qi (1− qi ) if j = t and j′ ′ i ijj′ tt′ i i i jj′ jj′ = t , and Cov(ljj′ , ltt′ |y i,X i,C i) = 0 otherwise. Here, i = 1, . . . , G, j = 1, . . . , n , j′i = 1, . . . , Ni, t = 1, . . . , ni, t ′ = 1, . . . , Ni. By using these facts, we can get: ∑ Ni E(ỹij |y i,X i,C i) = E lijj′yij′ |y i,X i,C i ∑ j′=1Ni = E(lijj′ |y i,X i,C i)yij′ j∑′=1Ni = qijj′yij′ , j∑′=1Ni ∑Ni ( ) Cov (ỹij , ỹit|y i,X i,C i ii) = yij′yit′Cov ljj′ , ltt′ |y i,X i,C i j′=1 t′=1 = 0, 47 V ar(ỹij |y i,X i,C i) = Cov (ỹij , ỹij |y i,X i,C i)∑ Ni ∑Ni= Cov lijj′yij′ , lijt′yit′ |y i,X i,C i ∑ j′=1 t′=1Ni ∑Ni ( ) = yij′y i i it′Cov ljj′ , ljt′ |y i,X i,C i j∑′=1 t′=1Ni ( ) = yij′yij′Cov l i i jj′ , ljj′ |y i,X i,C i ∑j′=1Ni ∑ ( ) + yij′yit′Cov l i , lijj′ jt′ |y i,X i,C i j∑′=1 t′=6 j′Ni = y2 i iij′qjj′(1− qjj′), j′=1 Based on the linear regression model (3.1) and the assumption that the response variable y is conditionally independent of comparison vector c given x, we have E(y Tij′ |X i,C i) = E(yij′ |X i) = xij′β, Cov(y 2ij′ , yij′ |X i,C i) = Cov(yij′ , yij′ |X i) = σe , Cov(yij′ , yit′ |X i,C i) = Cov(yij′ , yij′ |X i) = 0, (j′ 6= t′) for i = 1, . . . , G, j′ = 1, . . . , Ni, t ′ = 1, . . . , Ni. By applying the law of total expectation, the law of total variance, and the law of total covariance, we can get E(ỹij |X i,C i) = E[E(ỹij |y i,X∑ i ,C i)|X i,C i] N = qijj′E[yij′ |X i] j∑′=1N = qi xTjj′ ij′β, j′=1 48 V ar(ỹij |X i,C i) = E [V ar(ỹij |y i,X i,C i)|X i,C i] +V ar (E[ỹij |y i,X i,C i]|X i,C i)∑Ni ∑Ni ∑ Ni= E y2 qiij′ jj′(1− qi ′)|X i,C ijj + Cov qi i jj′yij′ , qjt′yit′ |X i,C i ∑ j′=1 j′=1 t′=1Ni [ ] ∑Ni ∑Ni ( ) = qijj′(1− qi 2jj′)E yij′ |X i ii,C i + qjj′qjt′Cov yij′ , yit′ |X i,C i j∑′=1 ( j ′=1 t′=1 Ni ) = qi i 2jj′(1− qjj′) V ar(yij′ |X i) + (E[yij′ |X i]) ∑j′=1Ni ( ) ∑Ni ∑ ( ) + qi qijj′ jj′Cov yij′ , yij′ |X i ii + qjj′qjt′Cov yij′ , yit′ |X i j∑′=1 j′=1 t′ 6=j′Ni ∑Ni ∑Ni = qijj′(1− qi 2 i i T 2 i 2 2jj′)σe + qjj′(1− qjj′)(xij′β) + (qjj′) σe + 0 j∑′=1 ∑ j′=1 j′=1Ni Ni = qi σ2 + qijj′ e jj′(1− qi T 2 ijj′)(xij′β) := σjj , j′=1 j′=1 Cov (ỹij , ỹit|X i,C i) = E [Cov (ỹij , ỹit|y i,X i,C i) |X i,C i] + Cov (E[ỹij |y i,X i,C i], E[ỹit|y i,X i,C i]|X i,C i)∑ Ni ∑Ni= 0 + Cov qijj′yij′ , qi tt′yit′ |X i,C i ∑∑ j′=1 t′=1Ni Ni ( ) = qi ijj′qtt′Cov yij′ , yit′ |X i,C i j∑′=1 t′=1Ni ( ) ∑Ni ∑ ( ) = qi qijj′ tj′Cov y i i ij′ , yij′ |X i + qjj′qtt′Cov yij′ , yit′ |X i j∑′=1 j′=1 t′=6 j′N = qi i 2 ijj′qtj′σe := σjt, (t =6 j). j′=1 Let Σi = (σ i jt) ni,ni j=1,t=1, then the above result can be written in the following matrix form: E(ỹ i|X i,C i) = QiX iβ, V ar(ỹ i|X i,C i) = Σi, i = 1, . . . , G. 49 3.3.2 Proof of (3.4), (3.6), and (3.7) In case of blocking, the x values in the linked data set are related to the x values in Fx through the following identity: X̃ ?i = AiX i, i = 1, . . . , G. where X̃ ?i is the ni × p matrix of x values linked to ỹ i in Fy, X i is the Ni × p matrix of x values in Fx, and Ai = (a i ni,Ni jj′)j=1,j′=1 is the ni ×Ni matrix of linkage status indicators, where ai ijj′ = 1 if qjj′ is the largest among probabilities {qijt′ : t′ = 1, . . . , Ni}, and aijj′ = 0 otherwise. Note that Ai is derived from Qi, which is a function of C i and ψ. Thus, Ai is fixed when C i and ψ are known, and X̃ ? i is fixed whenX i, C i and ψ are known. Also recall that E[ỹ i|X i,C i] = QiX iβ , V ar(ỹ i|X i,C i) = Σi under the general integrated model. Based on these facts, when ψ is known, the(mean and va)rianc(e of β̂N (ψ) are given by:( ) ∑ ) G −1 ∑G E β̂N (ψ)|X,C = E  X̃ ?Ti X̃ ?i X̃ ?T i ỹ i |X,C (∑ i=1 ) (∑ i=1−1 )G G = X̃ ?TX̃ ? X̃ ?T ( i i i E[ỹ i|X,C ] ∑i=1 )−1(G ∑i=1 )G = X̃ ?Ti X̃ ? i X̃ ?T i E[ỹ |X( i i ,C i] ∑i=1 ) i=1G −1(∑ )G = X̃ ?T ? ?T ( i X̃ i X̃ Q X β ∑i=1 ) ( i i i ∑i=1−1 )G G = X̃ ?Ti X̃ ? i X̃ ?T i QiX i β, i=1 i=1 50 ( ) [∑ ]−1 [G ∑ ] G V ar β̂ (ψ)|X,C = V ar X̃ ?T ? ?TN i X̃ i X̃ i ỹ i |X,C [∑ i=1 ] (∑i=1 )[ ]G −1 G ∑G −1 = X̃ ?TX̃ ?i i V ar X̃ ?T i ỹ i|X,C X̃ ?T ?[ ] [ ][ i X̃ i ∑i=1 ∑ i=1 i=1 ]G −1 G ∑G −1 = X̃ ?TX̃ ? X̃ ?TV ar (ỹ |X,C)X̃ ? X̃ ?T ? [ i i] i i i i X̃ i ∑i=1 i=1 i=1G −1 [∑ ][ ]G ∑G −1 = X̃ ?TX̃ ?i i X̃ ?T i ΣiX̃ ? X̃ ?T ?i i X̃ i . i=1 i=1 i=1 Similarly, the mean and variance of the bias-corrected estimator β̂C(ψ) are given by: [∑ ]G −1 [∑ ] GE(β̂ (ψ)|X,C) = E X̃ ?TQ ?TC i iX i X̃ ỹ i |X,Ci [∑ i=1 ]G −1∑ i=1G = X̃ ?TQ X X̃ ?Ti i i i E (ỹ i|X,C)[∑i=1 ]G −1∑i=1G = X̃ ?TQ ?Ti iX i X̃ i QiXβ i=1 i=1 = β, [∑ ]−1 [∑ ] G G V ar(β̂ (ψ)|X,C) = V ar X̃ ?TQ X X̃ ?TC i i i i ỹ i |X,C [∑ i=1 ] i=1−1 (∑ )[∑ ]G G G −1 = X̃ ?Ti QiX i V ar X̃ ?T i ỹ i|X,C XTQT ?[ ] [ i i X̃ i ∑i=1 ∑ i=1 i=1−1 ][ ]G G ∑G −1 = X̃ ?TQ X ?Ti i i X̃ i V ar (ỹ i|X,C)X̃ ? XTQTX̃ ?[ ] i i i i∑i=1 i=1 i=1G −1 [∑ ][ ]G ∑G −1 = X̃ ?Ti Q ?T iX i X̃ i ΣiX̃ ? i X T i Q T i X̃ ? i . i=1 i=1 i=1 51 Similarly, the mean and variance of the OLS estimator β̂OLS(ψ) are given by: E(β̂OLS [ (ψ)|X,C) ∑ ]−1 [∑ ] G G = E XTQT T T i i QiX i X i Qi ỹ i |X,C [∑ i=1 ] ∑ i=1G −1 G = XTQTQ T Ti i iX i X i Qi E (ỹ i|X,C)[∑i=1 ] i=1G −1∑G = XTQTQ X XTQTi i i i i i QiX iβ i=1 i=1 = β, V ar(β̂OLS[(ψ)|X,C) ∑ ]− [∑ ] G 1 G= V ar XTQTi i Q T TiX i X Q i i ỹ i |X,C [∑ i=1 ] ( i=1 )[ ]G −1 ∑G ∑G −1 = XTQTi i QiX i V ar X TQT T T [ ] [ i i ỹ i|X,C X][i Qi QiX i ∑i=1 i=1 i=1−1 ∑ ∑ ]G G G −1 = XT T T Ti Qi QiX i X i Qi V ar (ỹ i|X,C)Q X XTQTi i i i QiX[ ] [ ][ ] i∑i=1 ∑i=1 ∑ i=1G −1 G G −1 = XT Ti Qi QiX X T i i Q T i Σ Q T T i iX i X i Qi QiX i . i=1 i=1 i=1 3.3.3 Proof of (3.10) Recall that the diagonal entries σijj and off-diagonal entries σ i jt of the variance- covariance matrix Σi are equal to: ∑Ni ∑Ni ∑Ni σi = qi σ2 + qi (1− qi )(xT 2 i i i 2jj jj′ e jj′ jj′ ij′β) , σjt = qjj′qtj′σe , (t =6 j). j′=1 j′=1 j′=1 By taking the first partial derivative with respect to βk, we can get: ∂σi ∑Ni ∑Ni ijj ∂σ = 2 qi (1− qi ∂β jt′ ′)(xT i i T′β)xij′k = 2 q ′(1− q ′)(x ′β)xij′kδk, = 0, ∂β jj jj ij ∂β jj jj ijk ′ k ∂βkj =1 j′=1 52 for k = 1, . . . , p, where δk is the kth column of the identity matrix I p of dimension p× p. ∂σi Let D = ∂Σi = ( jt )ni,nii,k ∂β ∂β j=1,t=1 denote the first partial derivative of Σi with respect to βk.k k When σ2e andψ are known, the weighted least squares (WLS) estimator β̂ 2 WLS(σe ,ψ) is defined to be the value of β that minimizes the weighted sum of squares (WSS) with Σi as its weight matrix. That is, ∑G β̂WLS = arg min (ỹ i −QiX iβ)TΣ−1 2i (ỹ i −QiX iβ) := arg min fwss(β, σe ,ψ). β β i=1 Equivalently, β̂WLS can be obtained by solving the following estimating equations:( )p ∂fwss(β, σ 2 e ,ψ) ∂fwss(β, σ 2 e ,ψ)= = 0p, ∂β ∂βk k=1 where ∂f 2wss(β, σe ,ψ) ∑∂{βkG − ∂β T T T ∂Σi= X i Qi Σ −1 i (ỹ i −Q T −1 −1 iX iβ)− (ỹ i −QiX iβ) Σi Σ (ỹ i −QiX iβ)∂β ∂β i i=1 k } k − ∂β T (ỹ i −Q{iX β) TΣ−1i i QiX i ∑ ∂βkG } = − 2δTXT T −1k i Qi Σi (ỹ i −QiX iβ) + (ỹ i −QiX iβ) TΣ−1i Di,kΣ −1 i (ỹ i −QiX iβ) ∑i=1G { } = − 2δTkXTi QTi + (ỹ i −QiX iβ)TΣ−1i Di,k Σ −1 i (ỹ i −QiX iβ). i=1 3.3.4 Proof of (3.11) Under the assumption of normality, ỹ i|X i,C i ∼ N(QiX iβ,Σi). When ψ is known, the log-likelihood function of β and σ2e based on data {ỹ i,X i,C i, i = 1, . . . , G} is given by: 1 ∑G { } l(β , σ2e) = − ni ln(2π) + ln |Σ T −1i|+ (ỹ i −QiX iβ) Σi (ỹ i −QiX iβ) .2 i=1 53 ( ) i ni,n ( ) i ni,ni ∂Σ ∂σ i Let D i jt ∂Σ ∂σ i i,k = ∂β = ∂β and Di,σ = 2 = jt 2 denote the partial k k ∂σe ∂σe j=1,t=1 j=1,t=1 derivatives of Σi with respect to βk and σ 2 e , respectively, where ∂σi ∑Nijj ∂σi = 2 qi i T jt ∂β jj ′(1− qjj′)(xij′β)xij′kδk, = 0, (j 6= t) k ∂βk j′=1 ∂σi ∑Nijj ∂σi ∑Nii jt= q i ijj′ , = qjj′qtj′ , (j 6= t)∂σ2e ∂σ2j′=1 e j′=1 for i = 1, . . . , G, j = 1, . . . , ni, t = 1, . . . , ni, and k = 1, . . . , p. The first derivatives of the log-likelihood function l(β, σ2e) with respect to βk and σ 2 e are: G { ∂l(β, σ2) 1 ∑ Te − ∂ ln |Σi|= − ∂β XTQTΣ−1i i i (ỹ i −QiX iβ)∂βk 2 ∂βi=1 k ∂βk } T − − T −1∂Σi(ỹ Q −1 ∂βi i{X iβ() Σi )Σi (ỹ i −QiX iβ)− (ỹ −Q X β) TΣ−1i i i i QiX i ∑ ∂βk ∂βkG1 ∂Σi = − tr Σ−1i − 2δ TXTQTΣ−1(ỹ i −QiX iβ) 2 ∂β k i i i i=1 k } − (ỹ −Q X β)TΣ−1i i{ i i D −1 i,kΣi (ỹ i −QiX iβ) ∑G1 ( ) = − tr Σ−1D T T T −1i,k − 2δ X Q Σ (ỹ i −QiX iβ) 2 i k i i i i=1 } − (ỹ i −QiX Tiβ) Σ−1D −1i i,kΣi (ỹ i −QiX iβ) , { } ∂l(β , σ2 ∑Ge) −1 ∂ ln |Σ −1i| ∂Σ= + (ỹ i −Q T i2 2 iX iβ) (ỹ i −QiX iβ)∂σe 2 { (∂σe ) ∂σ2∑i=1 eG }1 ∂Σi ∂Σ−1 = − tr Σ−1i − (ỹ i −Q T −1 i −1 2 i X iβ) Σ 2 ∂σ i Σ ∂σ2 i (ỹ i −QiX iβ) 1 ∑i=1G { ( e) e } = − tr Σ−1i Di,σ − (ỹ i −QiX iβ) TΣ−1D −1i i,σΣi (ỹ i −QiX iβ) .2 i=1 54 3.3.5 Proof of (3.13) ∑ Under the linkage error model ỹ i = Liy i, we have ỹ = Ni i ij j′=1 ljj′yij′ . Assum- ing the linkage is at random (LAR), it is true that P (lijj′ = 1|y i,X ii,C i) = P (ljj′ = 1|C ) = P (lii jj′ = 1|ci i ijj′) = qjj′ under the mixture model. Thus, E(ljj′ |y i,X i,C i) = qijj′ , V ar(lijj′ |y i,X i,C i) = qijj′(1 − qijj′), and Cov(li i ′ ′jj′ , ltt′ |y i,X i,C i) = 0 if j 6= t or j 6= t . Here, i = 1, . . . , G, j = 1, . . . , ni, j ′ = 1, . . . , Ni, t = 1, . . . , ni, t ′ = 1, . . . , Ni. By using these facts, we can get: ∑Ni E(ỹij |y i,X ii,C i) = qjj′yij′ , j∑′=1Ni V ar(ỹij |y i,X 2 i ii,C i) = yij′qjj′(1− qjj′), j′=1 Cov (ỹij , ỹit|y i,X i,C i) = 0. Based on the logistic regression model and the assumption that the response variable y is conditionally independent of comparison vector c given x, we have E(yij′ |X i,C i) = g(xTij′β), V ar(yij′ |X i,C i) = g(xTij′β)[1 − g(xTij′β)], and Cov(yij′ , yit′ |X i,C i) = 0 for i = 1, . . . , G, j′ = 1, . . . , Ni, t ′ = 1, . . . , Ni, and t ′ 6= j′. By applying law of total expectation, law of total variance, and law of total covariance, we can get the following result for i = 1, . . . , G, j = 1, . . . , ni, t = 1, . . . , ni and t 6= j: ∑N E(ỹij |X i,C i) = qi g(xTjj′ ij′β), j∑′=1Ni ∑Ni V ar(ỹij |X i,C i) = qijj′g(xTij′β)[1− g(xTij′β)] + qijj′(1− qi 2 Tjj′)g (xij′β), j∑′=1 j′=1Ni = qi T i Tjj′g(xij′β)[1− qjj′g(xij′β)], j∑′=1N Cov (ỹij , ỹ i it|X i,C i) = qjj′qitj′g(xTij′β)[1− g(xTij′β)], (t =6 j). j′=1 55 Let Σ = (σ2i)ni,nii jt j=1,t=1, then the above result can be written in the following matrix form: E(ỹ|X i,C i) = QiX iβ, V ar(ỹ|X i,C i) = Σi, i = 1, . . . , G. 56 Chapter 4: Small Area Estimation with Data from Two Files 4.1 Introduction In this chapter, we focus our research on a specific area: small area estimation. Specifically, we are interested in predicting an area-specific parameter, which can be ex- pressed as a function of fixed effects and random effects related to the conditional dis- tribution of the variable of interest given the auxiliary variables. We provide a general methodology for small area estimation using data from two files. Similar to the regres- sion analysis using data from two files, we propose a general integrated model for small area estimation, where a new linkage error model is developed to connect the unit-level small area model and the record linkage model. The empirical best prediction estimation of the area-specific parameter is considered under the general integrated model. To es- timate its mean squared error, two jackknife methods are provided. Application of the general methodology is not limited to the mutual independence of measurements. It can be applied to measurements that are correlated within small areas but independent across small areas. Unit-level models such as the general linear model with correlated sampling errors within small areas, the general linear mixed model with nested errors can all be considered. 57 4.2 Small Area Estimation Small area estimation refers to the methodology and techniques used to improve estimation precision for sub-populations (small areas, domains), where the sample sizes for some sub-populations are not large enough to provide a reliable direct estimate (such as a sample mean) with adequate precision. Small area estimation is widely used to provide small area estimates to support policy making, regional planning, and fund allocation at government agencies. For exam- ple, the Small Area Income and Poverty Estimates (SAIPE) program established by the U.S Census Bureau provides estimates of median household income and poverty rate of school-aged children at state and county levels. The government utilizes these small area estimates to allocate federal funds to states and domains within each state. The basic idea of small area estimation is to increase effective sample size by bor- rowing strength from values of the variable of interest from other related areas. Models are used to link related small areas through the use of auxiliary information related to the variable of interest, such as census and administrative records. The small area models can be generally classified into two types: (1) area-level models that relate direct estimates to area-specific and/or area-specific auxiliary variables, and (2) unit-level models that relate the unit values of the study variable to unit-specific auxiliary variables. In the dissertation, unit-level models are of the focus of the research. Most of the existing unit-level-model-based small area estimation methods rely on the joint observations on the variable of interest y and the auxiliary variable x. The corresponding data layout for each small area is shown in Table 4.1. This type of data can be obtained in a couple of ways: (1) A sample of y is obtained by sampling from a 58 population frame with known x; (2) Two separate files (one containing the observations of y and the other containing the observations of x) are linked without any errors. For example, a survey data set can be perfectly linked to an administrative data set if there exists a unique and error-free identifier. Under this data layout, a huge literature on small area estimation is available. We refer reader to Rao and Molina (2015). However, the auxiliary information may not be recorded in the same file as the variable of interest, but is available in an administrative data set. Data integration could be a potential approach to cut down costs in data collection by preventing the need to collect new survey data with all necessary information. In this chapter, we provide a general methodology for small area estimation in the case where observations on y and x are recorded in two different files, and the matching status between records is unknown. Table 4.1: Data layout for the joint observations on (y,x): Values of y are available for a sample of ni units in small area i, values of x are available for a finite population of size Ni in small area i, and the values are aligned, i = 1, . . . ,m. Label y xT 1 yi1 x T i1 · · · · · · · · · ni y T in xi ini · · · · · · N Ti xiNi 4.3 General Integrated Model for Small Area Estimation 4.3.1 Problem Description and Data Availability Suppose that the population of interest U can be partitioned into m subpop- ulations (small areas) Ui, i = 1, · · · ,m. Let y and x denote a scalar variable of 59 interest and a vector random variable of dimension p, respectively. Our goal is to estimate an area-specific parameter θi, which can be expressed as a function of fixed effects β and random effects v related to the conditional distribution of y given x, say θi = h(β,v), where h(·) is known. However, the joint observations on (y,x) are not available. Instead, observations on y and observations on x are separately recorded in two files Fy and Fx, respectively, and the matching status between any record from Fy and any record from Fx is unknown. To be specific, Fy (Fx) contains the observed values of y (x) for a sample Sy (Sx) of size n (N) selected from U . There is no duplicate in either file and Sy ⊂ Sx. We assume that the records in both files can be partitioned into small areas without error. Therefore, there is zero probability that two records (one from Fy and the other from Fy) represent the same unit if they are in different small areas. The data layout for files Fy and Fx is shown in Table 4.2. Let ỹij denote the observed value of y for record j in small area i from Fy, let x̃ij denote the unobserved value of x corresponding to ỹij, and let xij′ represent the observed value of x for record j′ in small area i from Fx. Since x̃ij exists but is not observed in Fy, hence its corresponding column in Fy is shaded in gray. Note that the records in Fy are not aligned to those in Fx, so ỹij and yij′ may not represent the y values for the same population unit even if j′ = j. Other than y and x, there also exists a vector of K matching fields, denoted by w, whose observations are available in both files. Let w̃ij and wij′ represent values of matching fields w for record j and record j ′ in small area i from Fx and Fy, respectively. It is also sufficient to assume that only the value of comparison vector cijj′ is available for each record pair (j, j ′) in small area 60 i. Here, i = 1, · · · ,m, j = 1, · · · , ni, j′ = 1, · · · , Ni. Let ỹ i denote the ni × 1 vector of observed y values in small area i from Fy, let X̃ i denote the ni × p matrix of unobserved x values corresponding to ỹ i, let X i denote the Ni× p matrix of observed x values in small area i from Fx, let y i denote the unobserved Ni × 1 vector of y values associated with X i, let W̃ i denote the ni×K matrix of w values in small area i from Fy, let W i denote the Ni×K matrix of w values in small area i from Fx, and C i denote Nini ×K matrix of comparison vectors derived from comparing W̃ i and W i. In summary, our observed data are {ỹ i,X i, W̃ i,W i : i = 1, · · · ,m}, or equivalently, {ỹ i,X i,C i : i = 1, · · · ,m}. Table 4.2: Data layout for observations on y and x for each small area in Fy and Fx: Fy contains observed values of variables w and y for a sample Sy of ni units in small area i, Fx contains observed values of variables w and x for a sample Sx of Ni units in small area i, and Sy ⊂ Sx. Note that the true x values corresponding to y exists but are not observed in Fy (shaded in gray). Label wT y xT Label wT xT 1 w̃T ỹ T T Ti1 i1 x̃i1 1 wi1 xi1 · · · · · · · · · · · · · · · · · · · · · Fy Fj w̃T ỹ x̃T xij ij ij j ′ wTij′ x T ij′ · · · · · · · · · · · · · · · · · · · · · n T T T Ti w̃in ỹin x̃i i in Ni i wiN xi iNi 4.3.2 General Integrated Model for Small Are Estimation The general integrated model for small area estimation contains three compo- nents, similar to that for the regression analysis. The record linkage model (mixture model) stays the same, but the regression model has been replaced by a unit-level small area model and a new linkage error model is used. 61 Unit-Level Small Area Model: In our dissertation, we focus our research on those unit-level small area models for which the values of y are assumed to be independent across small areas. Suppose that values of (y,x) for units in the population U follow a unit-level small area model, which can be generally written in the following form: y ij = g(xij, v i, eij;φ), i = 1, · · · ,m, j ∈ Ui, (4.1) where g(·) is a known function, v i is a vector of area-specific random effects, eij is a sampling error, and φ is a vector of unknown parameters. Note that yij = g(xij, eij;φ) is a special case of the above general small area model without random effects. Two examples of the unit-level small area models are given below: Example 1: Linear Regression Model with Common Regression Coefficients: y = x′ij ijβ + eij, i = 1, · · · ,m, j ∈ Ui, iid where β is a vector of unknown regression coefficients, and eij ∼ N(0, σ2e). In this case, φ = (β, σ2e) T . W∑hen the population size N p i of small area i is large, the small area mean ȳ = 1i p j∈U yij can be trea∑ted as a linear function of β under theNi i model; that is, ȳ ≈ x̄Ti i β , where x̄ 1i = Np j∈U xi ij is the population mean of x ini small area i. Example 2: Nested Error Linear Regression Model: yij = x T ijβ + v i + eij, i = 1, · · · ,m, j ∈ Ui, (4.2) where v i is a vector of area-specific random effects and eij is the error term, which takes account of any unexplained variation not taken care of by the other terms of 62 the above mixed model. It is often assumed that vi’s and eij’s are independently iid iid distributed with v 2 2i ∼ N(0, σv) and eij ∼ N(0, σe). Here, φ = (βT , σ2, σ2v e). When Npi is large, ȳi can be treated as a linear function of fixed effects β and random effect vi under the model; that is, ȳi ≈ x̄Ti β + vi. Battese et al. (1998) implemented the model to estimate areas covered by corn (or soybeans) for m = 12 counties in north central Iowa using the farm-interview data as y and the LANDSAT satellite data (the number of pixels classified as corn and soybeans) as x. The standard method of estimating small area mean ȳi is described in Section 5.3.1 for the situation where joint observations (yij,xij) is available for all sampled units and the population mean for each small area x̄i is known. Here, we further assume that the model holds for the sampled units in Sy. The assumption is satisfied if the sample design is non-informative, that is, the sample selection probabilities do not depend on values of y but may depends on values of x. Then we have ỹ ij = g(x̃ij, v i, eij;φ), i = 1, · · · ,m, j = 1, · · · , ni. Linkage Error Model: Recall that the linkage error model used in Chapter 2 is developed by exploiting the relationship between the observed y values in Fy and the unobserved y values corresponding to x values in Fx. Here, we develop another linkage error model based on the same idea but with a focus on the relationship between the observed and unobserved x values. Under the assumption that (1) there are no duplications in both files, (2) Sy ⊂ 63 Sx, and (3) there is no error in partitioning records into small areas, the unobserved x value corresponding to ỹij in Fy, x̃ij, must be one of those observed x values {xi1, · · · ,xiN } in the same small area i from F ix. Let ljj′ be the unknown binaryi matching status indicator for record pair (j, j′) in small area i; that is, lijj′ = 1 if record j from Fy and record j ′ from Fx in small area i represent the same population unit, and lijj′ = 0 otherwise. Then the relationship between x̃ij and {xi1, · · · ,xiN }i can be modeled via the following identity: ∑ x̃ij = l i jj′xij′ , i = 1, · · · ,m, j = 1, · · · , ni. j′=1 Let L = (li )niNii jj′ j=1,j′=1 be the ni×Ni matrix of matching status indicators, then the above linkage error model can be written in the following matrix form: X̃ i = LiX i, i = 1, · · · ,m. (4.3) In other words, the unobserved x values for sampled units of Sy in small area i, is a permutation of ni observed x values for sampled units of Sx in the same small area. More generally, the following linkage error model can be used: X̃ = LX. (4.4) where X̃ is the n× p matrix of unobserved x values for units in Sy, X is the N × p matrix of observed x values for units in Sx, and L = (l nN jj′)j=1,j′=1 is the n × N matrix of matching status indicators, where ljj′ represents the matching status for record pair (j, j′), with j ∈ S ′y, j ∈ Sx. This general model (4.4) can be used when the records are partitioned into small areas with errors, and all the possible record pairs (Nn in total) need to be considered for the purpose of record linkage. 64 The linkage error model (4.3) is a special case of the general model (4.4) with L = diag(L1, · · · ,Lm). In most situations, both the linkage error model built on y values and the one built on x values can be used for the purpose of regression analysis or small area estimation. However, in some situations, it is easier to implement our proposed general methodology when the former model is used than when the latter model is used. The opposite may be true in other situations. Below are two examples. For the purpose of comparison, we follow the notation in this chapter. Example 1: Consider the situation where a logistic regression model is used as the first component of the general integrated model. It is easier to calculate the con- ditional expectation of the observed y values in Fy given x values in Fx and compari- son vector c when the linkage error model built on y values is used. When the linkage error model built on y values is used, calculation of the condition(al expec)tation isn quite simple. (That)is, E(ỹ i|X( i,C)i) = Qig(X iβ), where g(X iβ) = g(x Tβ) iij andj=1 g(xTijβ) = exp x T ijβ /[1+exp x T ijβ ]. However, when the linkage error model built on x values is used, E(ỹ i|X i,C i) = E [g(L(iX∑iβ)|X i,C i], w)he[re g(LiX(i∑β) is a ni×1 v)ec]- tor with the jth element equal to exp Ni li T Ni i Tj′=1 jj′xij′β / 1 + exp j′=1 ljj′xij′β , whose conditional expectation is difficult to calculate. Example 2: Consider the situation where a linear model with common re- gression coefficient and equal variance is used as the first component of the general integrated model. The conditional covariance of observed y values in Fy given x values in Fx and comparison vectors c has a simpler format when the linkage model built on x values is used. When the linkage error model built on y values is used, 65 under certain assumptions, the (j, j)th diagonal and the (j, t)th off-diagonal entries of V ar(ỹ i|X i,C i) are given by ∑Ni ∑Ni ∑Ni σi i 2 i i T 2 i i i 2jj = qjj′σe + qjj′(1− qjj′)(xij′β) , and σjt = qjj′qtj′σe . j′=1 j′=1 j′=1 In contrast, when the linkage error model built on x values is used, under certain assumptions, ∑Ni σi = qijj jj′(1− qi Tjj′)(xij′β)2 + σ2e , and σijt = 0. j′=1 When it comes to small area estimation, general linear models and general linear mixed models are widely used as unit-level small area models. So we choose to use the linkage error model built on X̃ , that is, X̃ i = LiX i, in order to obtain a simple format of the conditional covariance matrix, since X̃ i = LiX i can only affect the fixed effects term but not the random effects and mixed error terms. 4.4 Empirical Best Prediction Estimator Under the squared error loss, the best prediction (BP) estimator θ̂BPi of θi is equal to the conditional expectation of θi given the data under the general integrated model when both φ in the small area model and ψ in the mixture model are known; that is θ̂BPi = E(θi|ỹ ,X ,C ). The resulting BP estimator θ̂BPi can be expressed as a function of ỹ ,X ,C,φ and ψ, say θ̂BPi = π(ỹ ,X ,C ;φ,ψ) ≡ π(φ,ψ). To obtain the BP estimator of θi, it is important to derive the conditional distribution of ỹ (and v) given X and C under the general integrated model if the 66 small area model is a fixed (mixed) effect model. The empirical best prediction (EBP) estimator is obtained from BP by substituting suitable estimators φ̂, ψ̂ of model parameters φ, ψ: θ̂EBPi = π(φ̂, ψ̂). Therefore, once the expression of the BP estimator is obtained, the estimation of θi reduces to the estimation of unknown parameters φ and ψ. Then the general methodology given in Chapter 2 can be used. Recall that when ψ is known, we can estimate φ by solving a system of es- timating equations derived from the conditional distribution of ỹ given X and C . The estimating equations for φ can be generally written as ∑m ′ ′ φ̂(ψ) : fi (ỹ i,X i,C i;φ,ψ) + a (φ,ψ) = 0. (4.5) i=1 ′ ′ where fi (ỹ i,X i,C i;φ,ψ) are vector-valued functions such thatE[fi (ỹ i,X i,C i;φ,ψ)] = 0 with respect to the general integrated model for true values of φ and ψ, a′(φ,ψ) is a vector-valued functions which may depend on the joint distribution of {ỹ1, · · · , ỹm}, ′ and 0 is a vector of zeros with the same order as φ. Note that fi (ỹ i,X i,C i;φ,ψ) has the same order as φ. When a(φ,ψ) 6= 0, it plays the role of a modifier or penalizer. In practice, however, value of ψ is unknown. The MLE estimate ψ̂ of ψ can also be treated as the solution of a system of unbiased estimating equations. ∑m ′′ ψ̂ : fi (C i;ψ) = 0 (4.6) i=1 ′′ ′′ where where fi (C i;ψ) are vector-valued functions such that E[fi (C i;ψ)] = 0 for true values of ψ, and 0 is a vector of zeros of the same order as ψ. 67 Based on (4.5) and (4.6), φ̂ ≡ φ̂(ψ̂) and ψ̂ are solutions to the following estimating equations: ∑m φ̂, ψ̂ : F (φ,ψ) = fi(ỹ i,X i,C i;φ,ψ) + a(φ,ψ) = 0 (4.7) i=1 where ( )T ( )′ ′′ ′ T fi(ỹ i,X i,C i;φ,ψ) = fi (ỹ i,X i,C i;φ,ψ) T , f Ti (C i;ψ) , a(φ,ψ) = a (φ,ψ) T ,0T , for i = 1, · · · ,m. EBP 4.5 Estimation of the Mean Squared Error of θ̂i In order to estimate the mean squared error (MSE) of the empirical best prediction estimator θ̂EBP , the unified jackknife method proposed by Jiang, Lahiri and Wan (2002, JLW) and its alternative proposed by Lohr and Rao (2009, LR) can be used. The jackknife replicate j is constructed by deleting (ỹj,X j,C j) from the data. The delete-j estimates φ̂−j, φ̂−j of φ, ψ are the solutions of ∑ φ̂−j, ψ̂−j : fi(ỹ i,X i,C i;φ,ψ) + a−j(φ,ψ) = 0, i 6=j and the delete-j EBP estimate of θi is given by θ̂ EBP i(−j) = π(φ̂−j, ψ̂−j), for j = 1, · · · ,m. The JLW jackknife estimate of bias, variance, matrix MSE and scale MSE of 68 φ̂ are defined as ¯̂ biasJ(φ̂) = (m− 1∑)(φ − φ̂),− mm 1 − ¯̂ ¯̂varJ(φ̂) = (φ̂−j φ)(φ̂ − φ)T−j , m m− 1 ∑j=1m mseJ(φ̂) = (φ̂ T (−j) − φ̂)(φ̂(−j) − φ̂) , m j=1 ¯̂ ∑ where φ = 1 m m j=1 φ̂−j is the average of the replicate estimate of φ. The jackknife estimates for ψ̂ are defined similarly. In terms of estimating the mean squared error of θ̂EBPi , both the JLW and LR methods are based on the following decomposition of MSE(θ̂EBPi ): MSE(θ̂EBP ) = MSAE(θ̂EBPi i ) +MSE(θ̂ BP i ), (4.8) [ ] [ ] where MSAE(θ̂EBP ) = E (θ̂EBPi i − θ̂BP )2 and MSE(θ̂BPi i ) = E (θ̂BPi − θ )2i . The jackknife estimate of the first term on the right-hand side of (4.8) is the same for both methods. That is, m m− 1 ∑ msae(θ̂EBPi ) = (θ̂ EBP − θ̂EBP )2i(−j) i . (4.9)m i=1 The jackknife estimate for the second term is different for the two methods. The method proposed by Jiang, Lahiri and Wan(2005) requires a closed-form expres- sion for the mean squared error of θ̂BPi , while the method proposed by Lohr and Rao (2009) requires a closed-form expression for the conditional mean squared err[or (CMSE) of θ̂BPi ]. Suppose MSE(θ̂BPi ) = bi(φ,ψ) and CMSE(θ̂BPi ) = E (θ̂BPi − θ )2i |ỹ ,X ,C = V ar(θi|ỹ ,X ,C ) = b′i(φ,ψ). Then the jackknife estimates 69 of the second term are given by ∑m [ ] mse (θ̂BP m− 1 J i ) = bi(φ̂, ψ̂)− bi(φ̂−j, ψ̂−j)− bi(φ̂, ψ̂) , (4.10) ∑m[ j=1 ] mseL(θ̂ BP ) = b′i(φ̂, ψ̂)− b′i(φ̂ ′−j, ψ̂−j)− bi(φ̂, ψ̂) . (4.11) j 6=i Both jackknife estimators are nearly unbiased for MSE(θ̂EBPi ) in the sense of having unconditional bias of order o(1/m) under regularity conditions. Compared with the method proposed by Jiang, Lahiri, and Wan (2005), Lohr and Rao’s ap- proach is less computationally intensive, because it does not require the calculation of bi(φ,ψ), which usually involves integration with respect to ỹ i, X i and C i. 4.6 Summary In this Chapter, a new linkage error model is developed based on the relation- ship between observed and unobserved x values. It provides a connection between the unit-level small area model and the mixture model in the general integrated model for small area estimation. Under the general integrated model, we provide a general methodology for obtaining an empirical best prediction (EBP) estimator of a area-specific mixed parameter θ. A jackknife resampling method for estimating the mean squared error of the EBP estimator is also provided. Application of the general methodology is not limited to the mutual independence of measurements. It can be applied to measurements that are correlated within small areas but independent across small areas. Unit-level models such as general linear model with correlated sampling errors within small areas, general linear mixed model with nested errors can all be considered. 70 Chapter 5: Application to the General Linear Mixed Model 5.1 Introduction To illustrate our general methodology for small area estimation, we consider the situation where the general linear mixed model with block diagonal covariance structure is used as the unit-level small area model. The Empirical Best Predic- tion (EBP) estimator for a mixed parameter is derived under the general integrated model. The closed-form expression for the conditional mean squared error of its cor- responding Best Prediction (BP) Estimator is also provided for estimating its mean squared error using the jackknife method provided by Lohr and Rao (2009). As a special example, we consider the estimation of small area means when a nested error linear model is used. We provide two methods for estimating the unknown parame- ters: the Maximum Likelihood (ML) method and the Pseudo Maximum Likelihood (PML) method. We also discuss the use of numerical algorithms in approximat- ing the maximum likelihood estimates (MLE), including Newton-Raphson method and Fish scoring algorithm, and further propose a quasi-scoring algorithm in order to reduce the computational burden. In this chapter, we follow the notation from Chapter 4. 71 5.2 General Linear Mixed Model with Block Diagonal Covariance In this section, we consider a specific case where the first component of the general integrated model is a general linear mixed model with block diagonal co- variance structure, which may be expressed as: ỹ i = X̃ iβ + Ũ iv i + ei, i = 1, . . . ,m. (5.1) Here, ỹ = (ỹ )nii ij j=1 represents the ni × 1 vector of observed y values in small area i from file Fy, X̃ i = (x̃ T ni ij)j=1 is the ni×p matrix of unobserved x values corresponding to ỹ i, β is an p× 1 vector of fixed effects, Ũ i is a known matrix of dimension ni×h, v i is an h× 1 vector of area-specific random effects, ei is an ni× 1 vector of random errors, and the v is and eis are independent with v i ∼ N(0,Gi) and ei ∼ N(0,Ri), where Gi and Ri depend on variance parameters δ . We are interested in estimating a linear combination of fixed effects β and mixed effects v i, say θ T T i = ai β + bi v i, for specified known vectors ai of order p and bi of order h, i = 1, . . . ,m. Under the assumption that vi and ei are independent of X i and C i given X̃ i, the conditional distribution of ỹ i andv i given X i and C i is     ỹ i QiX βind i   Σi Ũ iGi|X ,C ∼ N   , i i  . (5.2) v 0 G Ti iŨ i Gi Here, Q i ni Nii = (qjj′)j=1,j′=1 is the matrix of matching probabilities, and Σi = K i + Ũ iG T iŨ i + Ri, where K i = (k i jt) ni,ni j=1,t=1 is a ni × ni diagonal matrix with diagonal 72 entry kijj equal to∑Ni kijj = q i i T T jj′(1− qjj′)xij′ββ xij′ , i = 1, . . . ,m, j = 1, . . . , ni. j′=1 The detailed proof of (5.2) is given in Appendix 5.4.1. Therefore, the conditional distribution of v ( i given ỹ i , X i and C)i is v i| ind ỹ i,X i,C i ∼ N G T −1 T −1iŨ i Σi (ỹ i −QiX iβ),Gi −GiŨ i Σi Ũ iGi . (5.3) Based on results from (5.3), the BP estimator of v i and θi are given by v̂BP = v̂BPi i (β,δ,ψ) = E(v i|ỹ i,X i,C i) = GiŨ T −1i Σi (ỹ i −QiX iβ), (5.4) θ̂BP = θ̂BPi i (β,δ,ψ) = E(θi|ỹ i,X ,C ) = aTi i i β + bTi v̂ BP i , and the conditional mean squared error of θ̂BPi is given by( ) CMSE(θ̂BP ) = bTV ar(v |ỹ ,X ,C )b = bT G −G Ũ TΣ−1i i i i i i i i i i i i Ũ iGi bi. Note that the estimator θ̂BPi depends on the model parameters β , δ from the small area model and ψ from the mixture model. When ψ is known, the estimates β̂(ψ) and δ̂(ψ) of β and δ can be obtained by using the maximum likelihood method based the conditional distribution of ỹ i given X i and C i: ỹ i| ind X i,C i ∼ N(QiX iβ,Σi). When ψ is unknown, the MLE estimate ψ̂ of ψ can be obtained by using the Expectation-Maximization algorithm, and β̂ = β̂(ψ̂), δ̂ = δ̂(ψ̂). By substituting β̂ for β , δ̂ for δ , and ψ̂ for ψ in (5.4), we can obtain the EBP estimator of θ: θ̂EBPi = θ̂ BP i (β̂ , δ̂ , ψ̂) = a T i β̂ + b T BP i v̂ i (β̂ , δ̂ , ψ̂) (5.5) 73 The jackknife methods are applicable to estimate MSE(θ̂BPi ), because the data {ỹ i,X i,C i} are independent across small areas and the unknown parameters {β,δ,ψ} are estimated using estimating equations. We first calculate the delete-j estimators β̂−j, δ̂−j, ψ̂−j for each jackknife replicate j constructed by deleting data {yj,X j,C j} in small area j from the full data, j = 1, . . . ,m, and then obtain the delete-j estimators θ̂EBPi(−j) by replacing β , δ and ψ by β̂−j, δ̂−j and ψ̂−j, respectively. The jackknife MSE estimator proposed by Lohr and Rao (2009) is recommended to use since the closed-form expression of CMSE(θ̂BPi ) is available. The above results can be easily adjusted for the case where a general linear model is used as the first component of the general integrated model by letting Ũ i be a matrix of zeros. 5.3 Nested Error Linear Model In this section, we consider a specific example of the general linear mixed model with block diagonal covariance structure. Here, a nested error linear model is used as a unit-level small area model to characterize the relationshi∑p between y and x. We are interested in estimating population means θi = ȳ 1 i = Np j∈U y fori i ij each small area. The small area mean ȳi can be treated as a linear combination of the fixed effect β and the mixed effect vi under the nested error linear model if the population sizes Npi are sufficiently large. To be exact, θi = ȳi ≈ x̄Ti β + vi, i = 1, . . . ,m. ∑ where x̄ = 1i p j∈U xij is the population mean of x for small area i.Ni i 74 5.3.1 Estimation of Small Area Mean Using Data From a Single File First, we consider the ideal case where data for small area estimation are from a single file as shown in Table 4.1. Or equivalently, the joint observations (yij,xij) are available for each unit in a sample of ni units in small area i, and the population means x̄i are known, the best prediction (BP) estimator of θi is given by θ̂BP = x̄Ti i β + (1−Bi)(ȳi,s − x̄Ti,sβ) (5.6) ∑ ∑ where ȳi,s = 1 ni 1 ni ni j=1 yij and x̄i,s = ni j=1 xij are sample means of y and x in small area i, respectively, and Bi = σ 2/(σ2e e + n 2 iσv). An empirical best prediction (EBP) estimator of small area mean ȳi can be obtained by substituting the unknown parameters φ = (βT , σ2 2v , σe) with their suitable estimates φ̂ = (β̂ T , σ̂2v , σ̂ 2 e) in the above formula for BP estimator. Typically, the weighted least squares methods is used to estimate β , and methods of moments (MOM), maximum likelihood method (ML), or restricted maximum likelihood (REML) method is used to estimate the variance components σ2 and σ2v e . Under regularity conditions, the mean squared error (MLE) of the EBP estimator of small area mean θ = ȳi can be estimated by M̂SE(θ̂EBPi ) ≈ g1i(σ̂2, σ̂2 2 2 2 2v e) + g2i(σ̂v , σ̂e) + g3i(σ̂v , σ̂e) where the functions g1i(·), g2i(·), g3i(·) are given in Rao (2003), Chapter 7. 5.3.2 Estimation of Small Area Mean Using Data From Two Files Now, we consider the case where observations on y and x are separately recorded in two different files (Fy and Fx) and the matching status between records 75 from Fy and Fx is unknown. The data layout is shown in Table 4.2. Assuming that the nested error linear model also holds for all sampled units in file Fy, the general integrated model then becomes: iid iid (a) ỹij = x̃ T ijβ + vi + eij, vi ∼ N(0, σ2), e ∼ N(0, σ2);∑ v ij eNi (b) x̃ iij = ljj′xij′ ; (5.7) j′=1 i (c) lijj′ ∼ id ind ind Binom(1, π), ci ijj′k|ljj′ = 1 ∼ Binom(1,mk), cijj′k|lijj′ = 0 ∼ Binom(1, uk); where vi are independent of eij, i = 1, . . . ,m, j = 1, . . . , ni. As discussed above, the small area mean can be expressed as ȳ Ti ≈ x̄i β + vi under the nested error linear model, where x̄i is known for each small area i, i = 1, . . . ,m. Then, derivation of the best prediction estimator of ȳi reduces to the derivation of the best prediction estimator of random effect vi. This requires us to obtain the conditional distribution of vi given observed data under the general integrated model. Under the nested error linear model as described in (a), it is not difficult to obtain that E(ỹij|x̃ Tij) = x̃ijβ, Cov(ỹij, ỹit|x̃ij, x̃it) = σ2v , V ar(ỹ 2 2ij|x̃ij) = σv + σe , Cov(ỹ , v |x̃ ) = σ2ij i ij v . (5.8) for integers i ≤ i ≤ m, 1 ≤ j, t ≤ ni, 1 ≤ j′ ≤ Ni. Based on the linkage error model as described in (b), for any value of β , we can get ∑Ni x̃Tijβ = l i jj′x T ij′β. (5.9) j′=1 76 Under the mixture model as described in (c), by using Bayes’ rule and the LAR assumption, we can obtain that P (lijj′ = 1|cijj′) = qijj′ . Thus, for any 1 ≤ j, j′ ≤ ni, 1 ≤ t, t′ ≤ Ni, and j 6= t, we have E(li i i ijj′ |C i) = qjj′ , V ar(ljj′ |C i) = qjj′(1− qijj′), Cov(li ijj′ , ltt′ |C i) = 0. (5.10) By combining results from (5.9) and (5.10), we can get ∑Ni E(x̃T i Tijβ |X i,C i) = qjj′xij′β, j∑′=1Ni V ar(x̃Tβ |X ,C ) = qi (1− qi )(xT 2ij i i jj′ jj′ ij′β) , (5.11) j′=1 Cov(x̃T Tijβ, x̃itβ |X i,C i) = 0, for integers 1 ≤ i ≤ m, 1 ≤ j, t ≤ ni, (j 6= t). Under the assumption that vi and ei are independent of X i and C i given X̃ i, by combining results from (5.8) and (5.11), we can prove that ∑Ni E(ỹ |X ,C i Tij i i) = qjj′xij′β, j∑′=1Ni V ar(ỹ i i T 2 2 2ij|X i,C i) = qjj′(1− qjj′)(xij′β) + σv + σe , (5.12) j′=1 Cov(ỹij, ỹ 2 it|X i,C i) = σv , Cov(ỹij, vi|X i,C i) = σ2v , for integers 1 ≤ i ≤ m, 1 ≤ j, t ≤ ni, (j =6 t). The detailed proof of (5.11) and (5.12) is given in Section 5.4.2 and Section 5.4.3, respectively. 77 The above results can also be written in the following matrix form: E(ỹ i|X i,C i) = QiX iβ, V ar(ỹ |X ,C ) = K + σ21 1T + σ2i i i i v n n eIn := Σi i i i Cov(ỹ i, vi|X i,C i) = σ2v1n .i where 1n is the ni i × 1 vector of ones, In is the ni × ni i identity matrix, and K i = (ki )nini is an n × n diagonal matrix with diagonal entry kijj′ j=1j′=1 i i jj equal to ∑Ni kijj = q i jj′(1− qijj′)(xT 2ij′β) . j′=1 Here, K i can also be written in a matrix form, that is, {[ ] } K i = diag Qi ◦ (1n 1TN −Qi) [(X iβ) ◦ (X iβ)] .i i where M 1 ◦M 2 is the Hadamard product of any two matrix M 1 and M 2 of the same dimension, and diag{v} is a diagonal matrix with diagonal entries same as entries of a vector v. Note that both Σi and K i depends on unknown parameter φ = (βT , σ2v , σ 2 e) from the nested error linear model and unknown parameter ψ = (π,m1, . . . ,mK , u1, . . . , uK) from the mixture model. Based on the above analysis, the conditional distribution of (ỹ i, vi) given X i and C i is then given by      ỹ i |X i,C i ∼ N Q 2iX iβ  Σi 1n σi v,  . vi 0 1 T 2 2 n σ σi v v 78 Therefore, for known φ and ψ, the BP estimate of vi and θi are given by v̂BP 2 T −1i = E(vi|ỹ i,X i,C i) = σv1n Σi (ỹ i −QiX iβ),i θ̂BP = x̄Ti i β + σ 2 v1 T n Σ −1 i i (ỹ i −QiX iβ) (5.13) The EBP estimate θ̂EBPi is obtained by replacing φ and ψ by their estimates φ̂ and ψ̂ in formula (5.13). That is, θ̂EBP −1 i = x̄ T i β̂ + σ̂ 2 T v1n Σ̂i i (ỹ i − Q̂iX iβ̂), (5.14) where Σ̂i = Σi(φ̂, ψ̂) is an estimate of Σi. The methods for estimating unknown parameters φ and ψ will be introduced in the subsequent part. The jackknife methods described in this dissertation can be used to the mean squared error of θ̂EBPi . The choice of the jackknife methods depends on whether a closed-form expression for MSE(θ̂BPi ) or CMSE(θ̂ BP i ) is available. We prove in Section 5.4.4 that CMSE(θ̂BPi ) = σ 2 − σ21Tv v n Σ−1 2i i 1n σi v , MSE(θ̂BP 2 2 Ti ) = σv − σv1n E(Σ−1i )1n σ2v . (5.15)i i Note that there exists a closed-form expression for CMSE(θ̂BPi ), but not for MSE(θ̂BPi ) due to the difficulty in calculating the component E(Σ −1 i ). Therefore, the estimate of MSE(θ̂EBPi ) can be obtained by using the jackknife method pro- posed by Lohr and Rao (2009) rather than the one proposed by Jiang, Lahiri and Wan (2002). 79 5.3.3 Estimation of φ: Maximum Likelihood Method When ψ is known, we can use the maximum likelihood method to estimate φ. Based on the facts that f(ỹ i,X i,C i;φ,ψ) = f(ỹ i|X i,C i;φ,ψ)f(X i,C i;ψ) and f(X i,C i;ψ) does not depends on φ, the log-likelihood function of φ based on data {y i,X i,C i, i = 1, . . . ,m} can be expressed as ∑m l(φ; ỹ ,X ,C ) ∝ ln (f(ỹ i|X i,C i;φ,ψ)) i=1 ∑m { } = −1 ni ln(2π) + ln |Σi |+ (ỹ T −1i −QiX iβ) Σi (ỹ i −QiX iβ) .2 i=1 The first derivatives of l(φ; ỹ ,X ,C ) are given by: m { ∂l(φ; ỹ ,X ,C ) 1 ∑ ( ) = − tr Σ−1i Di,k − 2δTkXT T −1i Qi Σi (ỹ i −QiX iβ)∂βk 2 i=1 } − (ỹ i −Qi{X iβ) TΣ−1i D −1 i,kΣ ∑ i (ỹ i −QiX iβ) , (5.16) m ∂l(φ; ỹ ,X ,C ) −1 ( = tr Σ− ) 1 i 1n 1 T ∂σ2 2 i niv i=1 } − (ỹ −Q X β)TΣ−11 1Ti i{ i i n n Σ −1 i (ỹ i −QiX iβ) ,i i m } ∂l(φ; ỹ ,X ,C ) 1 ∑ ( ) = − tr Σ−1 − (ỹ −Q X β)TΣ−1Σ−1(ỹ −Q X β) , ∂σ2 i i i i i i i i i e 2 i=1 where tr(M ) is the trace of any square matrixM , δk is the kth column of the identity matrix I p∑, andDi,k = (d i nini jj′,k)j=1j′=1 is a ni×ni diagonal matrix with diagonal entries di = 2 Ni qi (1− qijj,k j′=1 jj′ jj′)(xTij′β)xij′k for i = 1, . . . ,m, j = 1, . . . , ni, k = 1, . . . , p. 80 The second derivatives of l(φ; ỹ ,X ,C ) are given by: m { ∂2l(φ; ỹ ,X ,C ) 1 ∑ ( ) = − tr −Σ−1i D −1 −1i,tΣ∂β i Di,k + Σi Di,kt t∂βk 2 i=1 + 2δTXT T −1 −1k i Qi Σi Di,tΣi (ỹ i −QiX iβ) + 2δTXT T −1 T T T −1 −1k i Qi Σi QiX iδ t + 2δ t X i Qi Σi Di,kΣi (ỹ i −QiX iβ) + 2(ỹ i −QiX iβ)TΣ−1i D Σ−1i,t i D −1i,kΣi (ỹ i −}QiX iβ) − (ỹ −Q X β)TΣ−1i i{ i i D Σ −1 i,kt i (ỹ i −QiX iβ) ∂2 m l(φ; ỹ ,X ,C ) 1 ∑ ( ) = − − tr Σ−1i D Σ−11 T2 i,k i n 1∂β ∂σ 2 i nik v i=1 + 2δT T T −1 T −1kX i Qi Σi 1n 1 Σi ni i (ỹ i −QiX iβ) } + 2(ỹ −Q X β)TΣ−1D −1i {i i i i,kΣi 1 1 T Σ−1n n i (ỹ i −QiX iβ) ,i i ∂2 m l(φ; ỹ ,X ,C ) 1 ∑ ( ) = − − tr Σ−1i D −1i,kΣi + 2δTkXTi QTΣ−1Σ−1i i i (ỹ i −Q2 } i X iβ) ∂βk∂σe 2 i=1 + 2(ỹ i −Q X β)TΣ−1 −1 −1{i i i Di,kΣi Σi (ỹ i −QiX iβ) , (5.17)m ∂2l(φ; ỹ ,X ,C ) ∑ ( ) = −1 tr −Σ−11 1T −1 T ∂σ2 2 n ∂σ 2 i i n Σi 1n 1i i ni v v i=1 } + 2(ỹ i −QiX iβ)TΣ−1i 1n 1T Σ−11 T −1i ni i n 1n Σi (ỹ i −QiX iβ) , 2 ∑ i im {∂ l(φ; ỹ ,X ,C ) −1 (= − tr Σ− )11 1T Σ−1n ∂σ2v∂σ 2 e 2 i i ni i i=1 } + 2(ỹ −Q X T −1 T −1 −1i {i iβ) Σi 1n 1n Σi Σi (ỹ i −QiX iβ) ,i i ∂2 m l(φ; ỹ ,X ,C ) 1 ∑ ( ) = − − tr Σ−1Σ−1 ∂σ2∂σ2 2 i ie e i=1 } + 2(ỹ i −QiX iβ)TΣ−1 −1 −1i Σi Σi (ỹ i −QiX iβ) , where D i nini ii,kt = (djj′,kt)j=1j′=1 is an ni × ni matrix with diagonal entries djj,kt = 81 ∑ 2 m i ii=1 qjj′(1−qjj′)xij′txij′k, i = 1, . . . ,m, j = 1, . . . , ni, k = 1, . . . , p and t = 1, . . . , p. Recall that the maximum likelihood estimation (MLE) of φ is obtained by setting the score function S(φ) = ∂l(φ;ỹ,X,C) to zero. Here, S(φ) is our estimating ∂φ functions. The components of S(φ) are given by equations in (5.16). Based on the above analysis, we can see that there is no closed-form expression for the solutions of S(φ) = 0. In this case, the equations S(φ) = 0 must simultaneously be solved numerically by using some iterative algorithms. In the following part, we discuss the advantages and drawbacks of two existing algorithms for optimizing l(φ; ỹ ,X ,C ) (Newton-Raphson method and Fisher scoring algorithm), and propose a new algo- rithm which is expected to be computationally friendly and robust to poor starting values. 5.3.4 Numerical Algorithms The Newton-Raphson method starts with an initial guess φ(0) for the root of S(φ), and then takes the following iteration step until a convergence criterion is reached: φ(t+1) = φ(t) + J−1(φ(t))S(φ(t)) (5.18) where φ(t) is the approximation of φ at iteration t, and J (φ) is the negative value 2 of the gradient of S(φ), that is, J (φ) = −∂S(φ) = −∂ l(φ;ỹ,X,C)T . The components∂φ ∂φ∂φ of J (φ) can be obtained by taking the negative values of the second derivatives of l(φ; ỹ ,X ,C ) given by equations in (5.17). We can see that J (φ) is a very complicated matrix depending on data {ỹ i,C i,X i, i = 1, . . . ,m} and parameters φ and ψ. 82 The Newton-Raphson method is easy to understand and can be implemented here since the matrix J (φ) is available. It is expected that the Newton- Raphson’s algorithm would converge rapidly because of its quadratic converge rate. However, it can still take long time to converge due to the difficulty in calculating the com- plicated matrix J (φ(t)) in each iteration. Also, the Newton-Raphson method may not be effective if the iterations begin with a poor starting values φ(0). One alternative to Newton-Raphson method is the Fisher scoring algorithm, which is commonly used for optimizing log-likelihood functions. The Fisher scoring algorithm is obtained by replacing the matrix J (φ) in the iteration equation (5.22) by its expectation, i.e., the Fisher information matrix I (φ). The iteration equation then becomes φ(t+1) = φ(t) + I−1(φ(t))S(φ(t)). The Fisher scoring algorithm can reduce the computations in Newton-Raphson method by simplifying the matrix J (φ). Jennrich and Sampson (1976) demonstrated its robustness to poor staring values. However, the Fisher scoring algorithm requires a closed form expression of the Fisher information matrix I (φ), which is not available in our case. By the law of total expectations, [ ] [ ( )] ∂2− l(φ; ỹ ,X ,C ) ∂ 2l(φ; ỹ ,X ,C ) I (φ) = E = E E − |X,C . ∂φ∂φT ∂φ∂φT Therefore, in order to derive I (φ), we need to take expectation of both sides of the equations in (5.19) with respect to X and C . The calculation involves taking expectations of complicated terms QTΣ−1Q −1 T −1 −1 −1i i i, Di,tDi,k, Σi 1n 1n Σi i i , Σi Σi and 83 Σ −1D −1i i,kΣi , which is not practical. Therefore, the Fisher scoring algorithm cannot be implemented here despite of its advantages over the Newton-Raphson method. Based on the above analysis, both the Newton-Raphson method and the Fisher scoring algorithm are not ideal for approximating the MLE estimate of φ. Here, inspired by the derivation of the Fisher scoring algorithm, we propose a new iterative algorithm: the quasi-scoring algorithm. The matrix J (φ) in the Newton-Raphson method is replaced by its conditional expectation given the observed comparison vectors X and C , I c(φ) = E[J (φ)|X,C ], rather than its expected value I (φ). Then the interaction equation used for the quasi-scoring algorithm is given by: φ(t+1) = φ(t) + I−1c (φ (t))S(φ(t)). ( The components of I)c(φ) are given by: 2 ∑m { ( )} −∂ l(φ; ỹ ,X ,C ) 1E |X,C = 2δTXTQTΣ−1Q −1 −1 ( ) k i i i i X iδ t + tr Σi Di,tΣi Di,k ,∂βt∂βk 2 i=1 2 m −∂ l(φ; ỹ ,C ) 1 ∑ E |X,C = 1T Σ−1D Σ−1n i i,k i 1 ,( ∂β ∂σ2k v ) n2 i ii=1 2 m −∂ l(φ; ỹ ,X ,C ) | 1 ∑ ( ) E X,C = tr Σ−1D −1i,kΣ , (5.19)( ∂βk∂σ2 i ie ) 2 i=1 2 m −∂ l(φ; ỹ ,X ,C ) | 1 ∑ E X,C = 1T Σ−11 1T Σ−11 , ( ∂σ2∂σ2 ) 2 ni i ni ni i niv v i=1 ∂2 m − l(φ; ỹ ,X ,C ) | 1 ∑ E X,C = 1T Σ−1Σ−11 ( n , ∂σ2∂σ2 ni i i iv e ) 2 ∑i=12 m ( ) E −∂ l(φ; ỹ ,X ,C ) | 1X,C = tr Σ−1 −1 ∂σ2∂σ2 2 i Σi . e e i=1 ind Th[e derivation of (5.19) is based on] the fact that ỹ i|X i,C i ∼ N(QiX iβ,Σi) and E (ỹ i −QiX β)Ti M (ỹ i −QiX iβ)|C = tr (MΣi) for any ni × ni matrix M , which may depends on X and C but not on ỹ. 84 We can see that the format of I c(φ) is much simpler than that of J (φ) by comparing their components shown in (5.19) and (5.17). So theoretically the quasi- scoring algorithm can reduce computations at each iteration when compared to the Newton-Raphson method, while the Newton-Raphson method can take less number of iterations compared to the quasi-scoring algorithm. Also, the quasi- scoring algorithm may preserve the property of robustness to poor starting values of Fisher scoring, while the Newton-Raphson method may diverge if the starting point is too far from the root of S(φ). Here, it is highly recommended to start an iterative procedure with the quasi-scoring algorithms in the first few steps and then change to Newton-Raphson method later. In this way, the quasi-scoring algorithm can reduce computations and generate a find a good starting value for iterations using the Newton-Raphson method. All iterative methods do have their drawbacks. First, they may not converge well due to poor starting values. The common approach is to use different starting values and hope the algorithm will converge. Second, the resulting approximated solutions φ̃ = (β̃ , σ̃2v , σ̃ 2 T e) of S(φ) = 0 may not be within the parameter space of φ = (β, σ2v , σ 2 e) T . For example, σ̃2 2v ≤ 0 or σ̃e ≤ 0. This inaccuracy may be caused by the different ranges of the parameters. One may improve the accuracy by scaling the parameters σ2v and σ 2 e so that they all have approximate the same range as β . 85 5.3.5 Estimation of φ: Pseudo Maximum Likelihood Method Pseudo maximum likelihood method was first introduced by Samart and Chambers (2014) as an appropriate modifications to the maximum likelihood method of variance components estimation in presence of linkage errors. Based on the above analysis shown in 5.3.3, we can see that the maximum likelihood method may not be easy to use in practice. This is mainly caused by the difficulty of calcu- lating the derivatives of l(φ; ỹ i,X i,C i) due to the complexity of Σi. The matrix Σi depends on not only the variance components σ2, σ2v e but also the regression coeffi- cient β . The basic idea of the pseudo maximum likelihood method is to reduce the computation complexity by assuming that Σi is fixed in β , and depends on σ 2 v and σ2e only. Whenψ is known, the log-likelihood function of φ based on data {ỹ i,X i,C i, i = 1, . . . ,m} is given by m −1 ∑{ l(φ; ỹ ,X ,C ) = n ln(2π) + ln |Σ |+ (ỹ −Q X β)TΣ − } 1 i i i i i i (ỹ i −QiX iβ) . 2 i=1 Under the assumption that Σi is fixed in β , the first derivatives of l(φ; ỹ ,X ,C ) with respect to β , σ2v and σ 2 e are given by m ∂l(φ; ỹ ,X ,C ) ∑ = XT T −1i Qi Σi (ỹ i −QiX iβ),∂β i=1 m ∂l(φ; ỹ ,X ,C ) −1 ∑{ } = 1T Σ−1n i 1n − (ỹ i −Q X β)TΣ−1 T −12 i i i i i 1n 1i n Σ (ỹ −Qi i i iX iβ) ,∂σv 2 i=1 m ∂l(φ; ỹ ,X ,C ) 1 ∑{ ( ) } = − tr Σ−1i − (ỹ −Q X β)TΣ−1i i i i Σ−1i (ỹ i −QiX iβ) . (5.20)∂σ2e 2 i=1 The pseudo maximum likelihood estimate (PMLE) of φ is obtained by solving 86 the following estimating functio(ns: )T ∂l(φ) ∂l(φ) ∂l(φ) S̃(φ) = , , = 0 ∂β ∂σ2 ∂σ2v e Note that the resulting solution of S̃(φ) = 0 is referred as the PMLE estimator of ψ because the estimating equations S̃(φ) is derived under the assumption Σi is fixed in β . However, Σi is actually a function of β , σ 2 v and σ 2 e . Therefore, the estimating equations S̃(φ) = 0 cannot be solved analytically. Again, we can use either Newton-Raphson algorithm or quasi-scoring algorithm to solve the estimating equations iteratively. From (5.20), we can obtain the second derivatives of l(φ): ∂2 m l(φ; ỹ ,X ,C ) ∑ = − XTQTΣ−1QiX i, ∂β∂βT i i i i=1 m ∂2l(φ; ỹ ,X ,C ) ∑ = − XTi QT −1 T −12 i Σi 1n 1n Σ (ỹi i i i −QiX iβ),∂β∂σv i=1 m ∂2l(φ; ỹ ,X ,C ) ∑ = − XTQTΣ−1 −1i i i Σi (ỹ i −QiX iβ),∂β∂σ2e i=1 m { ∂2l(φ; ỹ ,X ,C ) 1 ∑ = − − 1T Σ−11 1T Σ−11 2 2 ni i ni ni i n (5.21) ∂σv∂σv 2 i i=1 } + 2(ỹ T −1 T −1 T −1i −Q{iX iβ) Σi 1n 1n Σi 1n 1n Σi (ỹ i −QiX iβ) , ∂2l(φ; ỹ ,X ,C ) 1 ∑ i i i i m = − − 1T Σ−1Σ−11 ∂σ2∂σ2 2 ni i i ni v e i=1 } + 2(ỹ −Q X β)TΣ−1Σ−1i {i i i i 1 1 T Σ−1ni ni i (ỹ i −QiX iβ) , 2 m∂ l(φ; ỹ ,X ,C ) ∑ ( ) = −1 − tr Σ−1Σ−1 ∂σ2e∂σ 2 e 2 i i i=1 } + 2(ỹ i −QiX β)TΣ−1 −1i i Σi Σ−1i (ỹ i −QiX iβ) . The Newton-Raphson for approximating the pseudo maximum likelihood es- 87 timator of φ = (β, σ2v , σ 2 e) has the following iteration equations: φ(t+1) = φ(t) + J̃−1(φ(t))S̃(φ(t)) (5.22) where φ(t) is the approximation of φ at iteration t, and J̃ (φ) is the negative value of the gradient of S̃(φ). The components of J̃ (φ) can be obtained by taking the negative values of the second derivatives o[f l(φ; ỹ ,X ,C ) given by equations in](5.21). By using the fact that E (ỹ i −Q X β)Ti i M (ỹ i −QiX iβ)|X,C = tr (MΣi) and that E(ỹ i|X i,C i) = QiX iβ , we can obtain the following results from (5.21):[ ] m −∂ 2l(φ; ỹ ,X ,C ) ∑ E |X,C = XTQTΣ−1i i i QiX[ T ] i , ∂β∂β i=1 ∂2− l(φ; ỹ ,X ,C )E [ |X,C = 0 ,∂β∂σ2 pv ] ∂2− l(φ; ỹ ,X ,C )E [ |X,C = 0p, (5.23)∂β∂σ2e ] 2 ∑m −∂ l(φ; ỹ ,X ,C ) | 1E X,C = 1T Σ−11 1T Σ−1 [ ∂σ2∂σ2 ] 2 ni i ni n 1 , i i ni v v i=1 ∂2 m − l(φ; ỹ ,X ,C ) ∑ E | 1X,C = 1T Σ−1Σ−11 , [ ∂σ2 nv∂σ2 ni i i ie ] 2 i=1 2 m −∂ l(φ; ỹ ,X ,C ) | 1 ∑ ( ) E X,C = tr Σ−1Σ−1 , ∂σ2∂σ2 i ie e 2 i=1 where 0p is a p-order vector of zeros. Thus, the quasi-scoring algorithm for approximating the pseudo maximum likelihood estimator of φ = (β, σ2v , σ 2 e) has the following iteration equations: φ(t+1) = φ(t) + {Ĩc(φ(t))}−1S̃(φ(t)) 88 where φ(t) is the estimate of φ at iteration t, and Ĩc(φ) is given by ∑  X −1iQiΣi QiX i 0p 0p  m 1  Ĩc(φ) =  0T 1T −1 T −1 T −1 −1  . (5.24)2 p n Σi i 1n 1 Σi ni i 1n 1i n Σii=1 − − ( i Σi 1ni 0T 1T Σ 1Σ 11 tr Σ− ) 1Σ−1p ni i i ni i i 5.4 Proofs 5.4.1 Proof of (5.2) Theorem: When the linear mixed model (5.1), the linkage error model (4.3) and the mixture model (2.4) are used as the three components of the general integrated model, the conditional distribution of ỹ i and v i given X i and C i can be derived under the assumption that v i and ei are independent of X i and C i given X̃ i. That is,      ỹ i QiX iβ| indX  i,C i ∼ N    Σi Ũ iGi,  . (5.25) v i 0 G Ũ T i i Gi Here, Q = (qi ni Nii jj′)j=1,j′=1 is the matrix of matching probabilities, and Σi = K i + Ũ TiGiŨ i + Ri, where K i = (k i ni,ni jt)j=1,t=1 is a ni × ni diagonal matrix with diagonal entry kijj equal to ∑Ni kijj = q i jj′(1− qi T Tjj′)xij′ββ xij′ , j = 1, . . . , ni, i = 1, . . . ,m. j′=1 Proof: Under the LAR assumption that P (L|ỹ ,X ,C ) = P (L|C ), we can obtain the following from the mixture model described in (2.4): E[Li|ỹ i,X i,C i] = E[Li|C i] = Qi. (5.26) 89 Recall that Qi depends on C i but not X i. Therefore, E(Li|X i,C i) = E[E(Li|ỹ i,X i,C i)|X i,C i] = E[Qi|X i,C i] = Qi. (5.27) Under the linkage error model that X̃ i = LiX i, by applying the result in (5.27), it is not difficult to obtain that [ ] E [X̃ iβ |X i,C i] = E [[LiX iβ |X i,C i] = E] [Li|X[i,C i]X iβ =]Q( iX[iβ, ])T V ar X̃ iβ |X i,C i = E [X̃ iββ TX̃Ti |X i,C i]− E X̃ iβ |X i,C i E X̃ iβ |X i,C i = E X̃ ββTX̃Ti i |X i,C i −QiX iββTXT Ti Qi . Now, we partition X̃ T T iT iTi and Qi into ni row vectors x̃i1, . . . , x̃in andi ∑q1 , . . . , qn ,i respectively, where q iT ij = (qj1, . . . , q i jN ). By∑using the fact that x̃ = Ni i T i ij j′=1 ljj′xij′ under the lin[kage model ]and that q iTX = Ni i Tj i j′=1 qjj′xij′ , we can achieve the (j, t) entry of V ar X̃ iβ |X i,C i , that is, [ ] [ ] ( ) V ar X̃ β |X ,C = E X̃ ββT T T T Ti i i i X̃ |X i,C i − QiX iββ X Q j,t [ i ] i ij,t j,t = E x̃(T T iTijββ x̃it|X i),C i −(qj X iββTX ∑ ∑ ) T i q i t  Ni N T ∣ i ∣ = E li T T i Tjj′xij′ ββ l ∣ tt′xit′ ∣X i,C i( j′=1 ) ( t′∑ =1 )N Ti ∑Ni − qi xT ββT i T[ jj′ ij′ qjt′xit′j′=∑1 t′=1N ∣ ]i ∑Ni ∣ = E li T T i ∣jj′xij′ββ xit′ltt′∣X i,C i ∑ j′=1 t′=1Ni ∑Ni − qi xT ββT ijj′ ij′ X it′qtt′ j∑′=1 t∑′=1Ni Ni ( [ ] ) = E li li i i T Tjj′ tt′|C i − qjj′qtt′ xij′ββ xit′ , j′=1 t′=1 90 for j = 1, . . . , ni and t = 1, . . . , ni. Since lijj′s are conditionally independent given C i with P (l i jj′ = 1|C i) = qijj′ under the mixture model for j = 1, . . . , ni and j ′ = 1, . . . , Ni, then [ ]   i ′qjj′ if j = j and t = t ′ E li ijj′ltt′ |C i =  .qijj′qitt′ otherwise Therefo[re, the (j, t) ]off-diagonal entry (i.e., when j 6= t) and the (j, j) diagonal entry of V ar X̃ iβ |X i,C i are given by [ ] ∑Ni ∑Ni ( ) V ar X̃ β |X ,C = qi i i i T Ti i i jj′qtt′ − qjj′qtt′ xij′ββ xit′ = 0 (j 6= t), [ ]j,t j∑′=1 (t′=1Ni [ ] ) V ar X̃ β |X ,C = E li i i i T Ti i i jj′ljj′ |C i − qjj′qjj′ xij′ββ xij′ j,j ∑j′=1Ni ∑( [ ] ) + E li ijj′ljt′|C i − qi i T Tjj′qjt′ xij′ββ xit′ j∑′=1 t(′ 6=j′Ni ) = qijj′ − qi i T Tjj′qjj′ xij′ββ xij′ ∑j′=1Ni ∑( ) + qi qi − qi i T Tjj′ jt′ jj′qjt′ xij′ββ xit′ j∑′=1 t′ 6=j′Ni ( ) = qi i Tjj′ 1− qjj′ xij′ββTxij′ j′=1 = kijj. Based on the above analysis, [ ] [ ] E X̃ iβ |X i,C i = QiX iβ, V ar X̃ iβ |X i,C i = K i. (5.28) From the general linear mixed model, we have [ ] ( ) ( ) E ỹ i|X̃ i = X̃ iβ, V ar ỹ Ti|X̃ i = Ũ iGiŨ i +Ri, Cov ỹ i, v i|X̃ i = Ũ iGi (5.29) 91 Under the assumption that v i and ei is independent of X i and C i given X̃ i, ỹ i is independent of X i and C i given X̃ i. By combining (5.28) and (5.29), we have [ ( ) ] [ ( ) ] E [ỹ i|X i,C i] = E [E ỹ i|X̃]i,X i,C i |X i,C i = E E ỹ i|X̃ i |X i,C i = E X̃(iβ |C[ i = QiX iβ,] ) V ar (ỹ i|X i,C i) = V a[r E(ỹ i|X̃ i,X i,C i) |X i,C i] + E V(ar [ỹ i|X̃ i,X] i,C i |X) i,C i[ ( ) ] = V ar (E ỹ i|X̃ i, )|X i,C i[ + E V ar ỹ i|X̃ i |]X i,C i = V ar X̃ iβ |X i,C i + E Ũ iGiŨ Ti +Ri|X i,C i = K i + Ũ iG Ũ T i i +Ri = Σi, ( [ ] [ ] ) Cov (ỹ i, v i|X i,C i) = Co[v E(ỹ i|X̃ i,X i,C i , E) v i|X̃ i,]X i,C i |X i,C i + E C(ov [ỹ i, v i|X̃] i,X[i,C i ]|X i,C i ) = Co[v E(ỹ i|X̃ i , E) v i|X̃ i]|X i,C i + E C(ov ỹ i, v i|X̃ i |)X i,C i[ ] = Cov X̃ iβ, 0|X i,C i + E Ũ iGi|X i,C i = Ũ iGi. Therefore, the conditional distribution of ỹ i andv i given X i and   C i is ỹ i QiX βind i   Σi Ũ iGi|X i,C i ∼ N ,  . v i 0 GiŨ T i Gi 92 5.4.2 Proof of (5.11) (∑ )Ni E(x̃Tijβ |X ,C ) = E li xTi i jj′ ij′β |X i,C i ∑ j′=(1Ni ) = E lijj′|X i,C i xTij′β j∑′=1Ni = E(li ijj′ |cjj′)xij′β j∑′=1Ni = qi T(jj′xij′β,j′=1 ∑ )Ni V ar(x̃Tijβ |X i,C i) = V ar li Tjj′xij′β |X i,C i ∑ j′=1Ni = V ar(lijj′|X i,C i)(xTij′β)2 j∑′=1Ni = V ar(li i T 2jj′|cjj)(xij′β) j∑′=1Ni = qi(jj′(1− q i T 2 jj′)(xij′β) . j′=1 ∑ )Ni ∑Ni Cov(x̃Tijβ, x̃ T itβ |X i,C i) = Cov lijj′xT i Tij′β, ltt′xit′β |X i,C i ∑∑j′=1 t′=1Ni Ni ( ) = Cov li T i Tjj′xij′β, ltt′xit′β |X i,C i j∑′=1 t′=1Ni ∑Ni = Cov(li i T Tjj′ , ltt′|C i)(xij′β)(xit′β) j′=1 t′=1 = 0. 93 5.4.3 Proof of (5.12) Assuming vi and eij is conditionally independent of X i and C i given X̃ i, then we have E(ỹij|x̃ij,X i,C i) = E(ỹij|x̃ij), V ar(ỹij|x̃ij,X i,C i) = V ar(ỹij|x̃ij), Cov(ỹij, ỹit|x̃ij, x̃it,X i,C i) = Cov(ỹij, ỹit|x̃ij, x̃it), and E(vi|x̃ij,X i,C i) = E(vi|x̃ij). By applying the law of total conditional expectation, we have ∑Ni E(ỹij|X i,C i) = E [E(ỹij|x̃ij,X i,C i)|X ,C Ti i] = E[x̃ijβ |X i,C i] = qijj′xTij′β. j′=1 By applying the law of total conditional variance, we have V ar(ỹij|X i,C i) = V ar (E(ỹij|x̃ij,X i,C i)|X i,C i) + E (V ar(ỹij|x̃ij,X i,C i)|X i,C i) = V ar ((E(ỹij|x̃ij)|X)i,C i)(+ E (V ar(ỹij|x̃)ij)|X i,C i) = V ar x̃Tijβ |X i,C i + E σ2v + σ2e |X i,C∑ iNi = qi (1− qi T 2 2 2jj′ jj′)(xij′β) + σv + σe . j′=1 By applying the law of total conditional covariance, we have Cov(ỹij, ỹit|X i,C i) = Cov (E(ỹij|x̃ij, x̃it,X i,C i), E(ỹit|x̃ij, x̃it,X i,C i)|X i,C i) + E [Cov(ỹij, ỹit|x̃ij, x̃it,X i,C i)|X i,C i] = Cov (E(ỹij|x̃ij), E(ỹit|x̃it)|X i,C i) + E [Co(v(ỹij, ỹit|x̃ij, x̃it)|)X i,C i[] ] = Cov x̃Tβ, x̃Tij itβ |X i,C i + E σ2v |X i,C i = 0 + σ2v = σ2v , 94 Cov(ỹij, vi|X i,C i) = E[Cov(ỹij, vi|x̃ij,X i,C i)|X i,C i] + Cov (E(ỹij|x̃ij,X i,C i), E(vi|x̃ij,X i,C i)|X i,C i) = E[Cov(ỹij, vi|x̃ij)|X(i,C i] + Cov (E)(ỹij|x̃ij), E(vi|x̃ij)|X i,C i) = E[σ2v |X i,C i] + Cov x̃Tijβ, 0|X i,C i = σ2v . 5.4.4 Proof of (5.15) Based on the general integrated model, the conditional distribution of (y i, vi) given X i and C iis  ỹ i   QiX iβ    Σi 1n σ2i v|X ,C ∼ N , i i  . vi 0 1 T σ2n v σ 2 i v It is also not difficult to obtain that V ar(vi|ỹ i,C i) = σ2v − σ21T −1 2v n Σi i 1n σv .i The conditional mean squared error (CMSE) of θ̂BPi is equal to [ ] CMSE(θ̂BP BP 2i ) = E (θ̂(i − θi) |ỹ i,X i,C i) ( [ ])2 = V ar θ̂BPi − θi|ỹ i,X i,(C i + E θ̂ BP i − θi|ỹ i),X i,C i2 = V ar (θ |ỹ ,X ,C ) + θ̂BPi i i i i − E [θi|ỹ i,X i,C i] = V ar (θi|ỹ i,X i,C i) + 0 = V ar(x̄iβ + vi|ỹ i,X i,C i) = V ar(vi|ỹ i,X i,C i) = σ2v − σ21T −1 2v n Σi i 1n σi v . 95 The mean squared error (MSE) of θ̂BPi is equal to[ ] MSE(θ̂BPi ) = E (θ̂ BP { i[ − θ 2 i) ]} = E E (θ̂BP[ i − θ 2 i)] |ỹ i,X i,C i = E [CMSE(θ̂ BP i ) ] = E σ2 − σ21T Σ−1v v n i 1n σ2i i v = σ2v − σ2 Tv1n E(Σ−1i )1 2n σv .i i 5.4.5 Proof of (5.16) and (5.17) When the mixture model parameter ψ is known, the log-likelihood function of φ based on observed data {ỹ i,X i,C i, i = 1, . . . ,m} is given by ∑m1 { } l(φ;y,X,C ) = − ni ln(2π) + ln |Σ T −1i|+ (ỹ i −QiX iβ) Σi (ỹ i −QiX iβ) .2 i=1 Since the first derivative of kijj with respective to the kth element of β , βk, is equal to ∂ki ∑Nijj = 2 qi i Tjj′(1− qjj′)(xij′β)x iij′k := djj,k, (5.30)∂βk j′=1 for k = 1, . . . , p, where xij′k is the kth element of xij′ . Define Di,k as a ni × ni matrix with diagonal entries di ∂K ijj,k. Then = Di,k, the first derivatives of Σ∂β i =k K + σ21 1T + σ2i v n Ii ni e n arei ∂Σi ∂K ∂Σ ∂σ 21 1T 2i i v ni n T ∂Σi i ∂σeIn= = D , = = 1 1 , = ii,k n n = In , (5.31)∂βk ∂β ∂σ2 ∂σ2 i i 2 2 i k v v ∂σe ∂σe and ∂β = δk. (5.32) ∂βk 96 where δk is defined as the kth column of identity matrix I p. By using results from (5.31) and (5.32), the first partial derivatives of the likelihood function l(φ; ỹ ,X ,C ) with respect to parameters βk, σ 2 v and σ 2 e are given by: ∂l(φ; ỹ ,X ,C ) ∂βk m { −1 ∑ ∂ ln |Σ Ti | − ∂β= 0 + XT T −1i Qi Σi (ỹ i −QiX iβ)2 ∂β ∂β i=1 k k −1 }∂Σ + (ỹ −Q X β)T ii i{ i ( (ỹ)i − ∂β QiX iβ)− (ỹ −Q T −1i iX iβ) Σ QiX i 1 ∑ ∂β i k ∂βk m − ∂Σi= tr Σ−1 − 2δTXTQTΣ−1i k i i i (ỹ i −QiX iβ)2 ∂β i=1 k } − ∂Σi(ỹ i −Qi{X iβ) TΣ−1 Σ−1i i (ỹ i −QiX iβ)∑ ( ∂β)km1 = − tr Σ−1i D T T T −1i,k − 2δ2 k X i Qi Σi (ỹ i −QiX iβ) i=1 } − (ỹ −Q X β)TΣ−1D Σ−1i i i i i,k i (ỹ i −QiX iβ) ; (5.33) ∂l(φ; ỹ ,X ,C ) ∂σ∑2vm { } −1 ∂ ln |Σ | ∂Σ −1 i = 0 + + (ỹ i −QiX iβ)T i (ỹ −Q X β) 2 { ( ∂σ2 ) ∂σ2 i i i∑i=1 v vm }1 ∂Σi ∂Σi = − tr Σ−1 − (ỹ −Q T −1 −1i iX iβ) Σ Σ (ỹ i −QiX iβ) 2 i 2 i∑{ ( ∂σv ) ∂σ 2 i i=1 v m } = −1 tr Σ−1i 1n 1Tn − (ỹ i −QiX iβ)TΣ−11 T −1i i i n 12 i n Σ i i (ỹ i −QiX iβ) ; (5.34) i=1 97 ∂l(φ; ỹ ,X ,C ) ∂σ∑2em { } −1 ∂ ln |Σi | − T ∂Σ −1 = 0 + + (ỹ i QiX β) i (ỹ −Q X { ( 2 ) i 2 i i i β) 2 ∑ ∂σ ∂σi=1 e em } = −1 ∂Σi ∂Σitr Σ−1i − (ỹ i −QiX iβ)TΣ−1 −12 i Σi (ỹ i −Q2 iX iβ)2 ∑{ ( )∂σi=1 e ∂σem } = −1 tr Σ−1i − (ỹ i −Q T −1 −1iX iβ) Σi Σi (ỹ i −QiX iβ) . (5.35)2 i=1 From (5.30), we can get ∂di mjj′ ∑,k = 2 qijj′(1− qijj′)z iij′tzij′k := djj,kt, for integers 1 ≤ k, t ≤ p.∂βt i=1 Define Di,kt as a diagonal matrix with diagonal entries d i jj,kt, then ∂Di,k = Di,kt. (5.36) ∂βt Again, by using (5.31)-(5.35), and (5.36), we can obtain the second derivatives of the log-likelihood function l(φ; ỹ ,X ,C ). That is, ∂2l(φ; ỹ ,X ,C ) ∂β∑∂σ2k vm { ( ) −1 ∂Σ −1 T = tr i ∂β 1 1T + 2 XTQTΣ−1n n i i i 1 1 T Σ−1n n i (ỹ i −QiX iβ)2 ∂β i i i i i=1 k ∂βk −1 } − ∂Σ2(ỹ i −Q{iX β) T i T −1 i ( 1n 1n Σi (ỹ i∑ ∂β i ikm ) −QiX iβ) −1 − −1∂Σi= tr Σ −1i Σi 1 1T + 2δTXTn QTΣ−11 1T Σ−1(ỹ −Q X β)2 ∂β i ni k i i i} ni ni i i i ii=1 k T −1∂Σi+ 2(ỹ −1i −Q{iX iβ) Σi Σi 1 1 T −1 n n Σi (ỹ i −QiX iβ)∑ ( ∂β i i k m 1 ) = − − tr Σ−1i D −1i,kΣi 1 Tn 1n + 2δTkXT T −1 T −12 i i i Qi Σi 1n 1n Σi (ỹ i −QiX} i i i β) i=1 + 2(ỹ −Q X β)TΣ−1 −1 T −1i i i i Di,kΣi 1n 1n Σi (ỹ i −QiX iβ)i i 98 ∂2l(φ; ỹ ,X ,C ) ∂βt∑∂βkm { ( )1 ∂ tr Σ−1D ∂Σ−1 = − i i,k − 2δT T T i 2 ∂β k X i Qi (ỹ i −QiX iβ) i=1 t ∂βt ∂β ∂βT + 2δTXTQTΣ−1Q X + XTQTΣ−1 −1k i i i i i i i i Di,kΣi (ỹ i −QiX iβ)∂βt ∂βt −1 − ∂Σ(ỹ i −QiX β)T i D Σ−1i i,k i (ỹ i −QiX iβ)∂βt − − T −1∂Di,k(ỹ −1i QiX iβ) Σi Σi (ỹ i −QiX iβ)∂βt −1 } − (ỹ −Q X β)T(Σ−1 ∂Σi − − T −1 −1 ∂βi i{ i i Di,k ) (ỹ i QiX iβ) + (ỹ i QiX iβ) Σi Di,kΣ QiX i∑ ∂β it ∂βtm ∂ tr Σ−1D −1 = −1 i i,k − T T T ∂Σ2δ ikX i Qi (ỹ i −QiX iβ)2 ∂βt ∂βi=1 t T + 2δTXTQTΣ−1 ∂β ∂β k i i i QiX + 2 X T i i Q T −1 −1 i Σi Di,kΣi (ỹ i −QiX iβ)∂βt ∂βt ∂Σ−1− 2(ỹ −Q X β)T i D −1i i i i,kΣi (ỹ i −QiX iβ})∂βt − (ỹ −Q X β)TΣ−1∂Di,k −1i i{ i ( i Σi (ỹ i −QiX iβ) 1 ∑ ∂βtm )− ∂Σ−1i −1∂Di,k T T T −1∂Σi= tr Di,k + Σi + 2δkX −12 ∂β ∂β i Qi Σi Σi (ỹ i −QiX iβ)∂β i=1 t t t + 2δTXTQT −1 T T T −1 −1k i i Σi QiX iδ t + 2δ t X i Qi Σi Di,kΣi (ỹ i −QiX iβ) − T −1∂Σi+ 2(ỹ Q X β) Σ Σ−1D Σ−1i i i i i i,k i (ỹ i −Q}iX iβ)∂βt − ∂Di,k(ỹ −Q T −1i i{X iβ) Σi Σ −1 i (ỹ i −QiX iβ) 1 ∑ ( ∂βtm ) = − tr −Σ−1D Σ−1 −1i i,t i Di,k + Σi Di,kt + 2δTXTQTΣ−1 −1k i i i Di,tΣi (ỹ i −QiX iβ)2 i=1 + 2δTXTQT −1k i i Σi QiX iδ t + 2δ T t X T i Q T −1 −1 i Σi Di,kΣi (ỹ i −QiX iβ) + 2(ỹ i −QiX iβ)TΣ−1 −1i Di,tΣi Di,kΣ−1i (ỹ i −}QiX iβ) − (ỹ i −QiX iβ)TΣ−1i D −1i,ktΣi (ỹ i −QiX iβ) 99 ∂2l(φ; ỹ ,X ,C ) ∂β 2k∑∂σem { ( −1) T = −1 ∂Σi ∂βtr + 2 XT T −1 −1i Qi Σi Σi (ỹ −Q X} i i i β) 2 ∂β i=1 k ∂βk − ∂Σ −1 2(ỹ i −Q{iX β) T i i ( Σ −1 i (ỹ i)−QiX iβ)∂βkm −1 ∑ ∂Σi = − tr Σ−1 Σ−1 + 2δTXTQTΣ−1 −1i i k i i i Σi (ỹ i −Q} i X iβ) 2 ∂β i=1 k ∂Σi + 2(ỹ T −1 −1 −1i −Q{iX iβ) Σi Σi Σi (ỹ i −QiX iβ)∑ ( ∂βkm ) = −1 − tr Σ−1D Σ−1 + 2δTXTQTΣ−1Σ−1i i,k i k i i i i (ỹ i −Q} i X iβ) 2 i=1 + 2(ỹ T −1 −1 −1i −QiX iβ) Σi Di,kΣi Σi (ỹ i −QiX iβ) ∂2l(φ; ỹ ,X ,C ) ∂σ2 2v∑∂σvm { ( ) }∂ tr Σ−11 1T −1 = −1 i ni ni − 2(ỹ −Q X β)T ∂Σi 1 1T Σ−1(ỹ −Q X β) 2 { ( i i i n i i i∂σ2 2 i ni ii=1 v ) ∂σvm −1 ∑ − −1∂Σi= tr Σi Σ−11n 1T2 ∂σ2 i i ni i=1 v } ∂Σi + 2(ỹ i −Q{iX iβ) TΣ−1 −1 T −1 ∑ i Σi 1n 1n Σi (ỹ i −QiX iβ) ( ∂σ2 i i v m 1 − )= − tr −Σ 11 1T Σ−1i n n i 1n 1T2 i i i ni i=1 } + 2(ỹ T −1 T −1 T −1i −QiX iβ) Σi 1n 1i n Σi i 1n 1n Σi (ỹ i −QiX iβ)i i ∂2l(φ; ỹ ,X ,C ) ∂σ2v∑∂σ2em { ( ) }1 ∂ tr Σ−1 −1 = − i − 2(ỹ −Q X β)T ∂Σii i i Σ−1i (ỹ{ (2 ) 2 i −QiX iβ) 2 ∑ ∂σv ∂σi=1 vm } −1 − −1∂Σi −1 − T −1∂Σi= tr Σi Σ + 2(ỹ i Q −1 −1{ i X iβ) Σ Σ Σ (ỹ i −QiX iβ) 2 ∑ ( ∂σ 2 i i 2 i i i=1 v ) ∂σvm } = −1 − tr Σ−11 1T Σ−1 + 2(ỹ −Q X β)TΣ−1 Ti n n i i i i i 1n 1n Σ−1i Σ−1i (ỹ i −QiXi i i i iβ)2 i=1 100 ∂2l(φ; ỹ ,X ,C ) ∂σ2e∑∂σ2em { ( ) }1 ∂ tr Σ−1− i − ∂Σ−1= 2(ỹ −Q X β)T i Σ−1(ỹ −Q X β) 2 ∑i=1m { ∂σ( 2 i e ) i i 2 i i i i∂σe } −1= − tr Σ−1∂ΣiΣ−1 − T ∂Σii i + 2(ỹ i QiX iβ) Σ−1 −1 −1i Σi Σi (ỹ i −QiX{ 2 2 } i β) 2 ∑ ( ∂σe) ∂σi=1 em −1= − tr Σ−1 −1 T −1 −1 −1 2 i Σi + 2(ỹ i −QiX iβ) Σi Σi Σi (ỹ i −QiX iβ) i=1 5.4.6 Proof of (5.19) ind Since ỹ i|X i,C i ∼ N(QiX iβ,Σi), then it is not difficult to obtain that [ ] E(ỹ i −QiZ iβ |X i,C i) = 0, E (ỹ i −QiX iβ)(ỹ −Q X β)Ti i i |X i,C i = Σi. (5.37) By using basic properties of trace, we can prove that for any ni × ni matrix M = M (X,C,φ,ψ), which may depends on X and C but not on ỹ, [ ] E (ỹ i −Q X β)Ti i M (ỹ i −QiX iβ)|X,C = tr (MΣi) , (5.38) since [ ] E (ỹ i −[Q T iX( iβ) M (ỹ i −QiX iβ)|X,C ) ] = E [tr ((ỹ −Q X β) T i i i M (ỹ i −QiX iβ)) |X,C] = E (tr [M (ỹ i −QiX iβ)(ỹ −Q X β) T i i i |X,C]) = tr (E M T [(ỹ i −QiX iβ)(ỹ i −QiX iβ) |X,C]) = tr ME (ỹ Ti −QiX iβ)(ỹ i −QiX iβ) |X,C = tr (MΣi) . 101 By using the fact (5.37) and (5.38), we can obtain the following results: ( ) −∂ 2l(φ;y,X,C ) E |X,C ∑{∂βt∂βkm1 ( − − − )= tr −Σ 1 1 1i Di,tΣi Di,k + Σi Di,kt2 i=1 + 2δTXTQTΣ−1k i i i Di,tΣ −1 i E[ỹ i −QiX iβ |X,C ] + 2δTk[X TQT −1 T T T −1 −1i i Σi QiX iδ t + 2δ t X i Qi Σi Di,kΣi E[ỹ i −QiX]iβ |X,C ] + 2E[ (ỹ −Q X β) TΣ−1D −1 −1i i i i i,tΣi Di,kΣi (ỹ i −QiX i]β})|X,C − E (ỹ i{−QiX iβ) TΣ−1D −1i i,ktΣi (ỹ i −QiX iβ)|X,C 1 ∑m ( = tr −Σ− ) 1D Σ−1D + Σ−1D + 2δTXTQTΣ−1i i,t i i,k i }i,kt2 k i i i QiX iδ t i=(1 ) ( ) + 2 tr Σ−1D Σ−1{i i,t i Di,k − tr Σ −1 i Di,kt m 1 ∑ − ( − )}= 2δTkXT T 1 1i Qi Σi QiX iδ t + tr Σi Di,tΣ−1( ) i Di,k , 2 i=1 −∂ 2l(φ;y,C ) E |X,C ∑∂{β 2k∂σvm1 ( − − )= − tr Σ 1 1 T 2 i Di,kΣi 1n 1i ni i=1 + 2δT T Tk[X i Qi Σ −11 1T Σ−1i n n E[ỹ −Q X β |X,C ]i i i i i i ]} + 2E (ỹ{i −Q X β) T i i Σ −1D Σ−1 T −1i i,k i 1n 1n Σi (ỹ i −QiX iβ)|X,C∑ im1 ( − − ) i ( } = − tr Σ 1D Σ 11 1T + 2 tr Σ−1D Σ− ) 1 T i i,k i ni ni i i,k i 1n 1 2 i ni i=1 m 1 ∑ ( = tr Σ− ) 1 i D Σ −1 T i,k 2 i 1n 1i ni ∑i=1m1 = 1T Σ−1 −1 2 ni i Di,kΣi 1n ,i i=1 102 ( ) −∂ 2l(φ;y,X,C ) E |X,C ∑{∂βk∂σm1 ( 2 e ) = − tr Σ−1D Σ−1 + 2δTXTQTΣ−1 −1i i,k i k i i i Σi E[ỹ i −QiX i|X,C ]2 i=[1 ]} + 2E (ỹ{i −QiX T −1 −1 −1 iβ) Σi Di,kΣi Σi (ỹ i −QiX iβ)∑m ( ) ( )} |X,C 1 = − tr Σ−1D Σ−1 + 2 tr Σ−1 −1i i,k2 i i Di,kΣi 1 ∑i=1m ( = tr Σ−1D − ) 1 ( i i,k Σi ,2 i=1 ) 2 −∂ l(φ;y,X,C )E { |X,C 1 ∑ ∂σ 2 2 v∂σ(vm ) = − tr Σ−1i 1n 1Tn Σ−1i 1n 1T2 i i i ni i=[1 ]} + 2E (ỹ{i −Q X β) TΣ−1i i i 1n 1 T −1 T −1 ∑ i n Σi 1n 1n Σi (ỹ i −QiX iβ)|X,Ci i m ( ) i ( )}1 = − tr Σ−1i 1n 1Tn Σ−1 T −1 Ti 1n 1n + 2 tr Σi 1n 1n Σ−1 T2 i i i i i i i 1n 1i ni 1 ∑i=1m ( − )= tr Σ 11 1Ti n n Σ−11 1T2 i i i ni ni∑i=1m1 = 1T Σ−1 T −1 (2 ni i 1n 1i n Σi i 1n ,i i=1 ) 2 −∂ l(φ;y,X,C )E { |X,C∑ ∂σ2∂σ2v em1 ( = − tr Σ−11 T − ) 1 i n 1 Σ2 i ni i i=[1 ]} + 2E (ỹ{i −QiX iβ) TΣ−1i 1n 1 T n Σ −1 Σ−1i i (ỹ −Q X β)|X,C∑ i i i i i m 1 ( } = − tr Σ− ) ( ) 11 1T −1 −1 T −1 2 i ni n Σi i + 2 tr Σi 1n 1i n Σi i 1 ∑i=1m ( ) = tr Σ−1i 1 T −1 n 1 Σ 2 i ni i∑i=1m1 = 1T −1 −1 2 n Σi Σi i 1n ,i i=1 103 ( ) −∂ 2l(φ;y,X,C ) E |X,C ∑{∂σ2e∂σm1 ( 2 e ) = − tr Σ−1Σ−1 2 i i i=[1 ]} + 2E (ỹ −Q X T −1 −1 −1{i i iβ) Σi Σi Σi (ỹ i −QiX iβ)|X,Cm 1 ∑ ( − − ) ( )}= − tr Σ 1Σ 1 + 2 tr Σ−1Σ−1 2 i i i i 1 ∑i=1m ( ) = tr Σ−1Σ−1 2 i i . i=1 5.4.7 Proof of (5.20) and (5.21) When ψ is known, the log-likelihood function of φ based on data {ỹ i,C i, i = 1, . . . ,m} is given by m 1 ∑{ } l(φ; ỹ ,X ,C ) = − ni ln(2π) + ln |Σ |+ (ỹ −Q X β)T −1i i i i Σi (ỹ i −QiX iβ) . 2 i=1 Assuming that the matrix Σi = K + σ 2 i v1n 1 T n + σ 2 i i e In is fixed in β , we havei ∂Σi ∂Σi T ∂Σi= 0, = 1n 1n , = In ,∂β ∂σ2 i i 2 iv ∂σe and ∂Σ−1i = −Σ−1∂Σi −1 −1 T −1 ∂σ2 (i Σ = −Σ∂σ2 i) i 1n 1n Σi i i ,v v ∂ ln |Σi| −1∂Σi ( − )= tr Σi = tr Σ 11n 1T ,∂σ2v ∂σ2 i i niv ∂Σ−1i = − ∂ΣiΣ−1(i Σ −1 = −Σ−1Σ−1, ∂σ2 ∂σ2 i i ie e ) ∂ ln |Σi| − ∂Σi ( − )= tr Σ 1 = tr Σ 1i i .∂σ2 2e ∂σe Using the above results, the first derivative of l(φ; ỹ ,X ,C ) with respect to 104 parameters β , σ2 and σ2v v are given by ∂l(φ; ỹ ,X ,C ) ∂β 1 ∑m { } = − −XTQTi i Σ−1 T −1i (ỹ i −QiX iβ)− (ỹ i −QiX iβ) Σi QiX i ∑2 i=1m = XTQTi i Σ −1 i (ỹ i −QiX iβ), i=1 ∂l(φ; ỹ ,X ,C ) ∂σ∑2vm { } −1 ∂ ln |Σi| ∂Σ −1 = + (ỹ i −QiX iβ)T i (ỹ −Q X β) 2 ∂σ2 ∂σ2 i i i ∑i=1m1 { ( v ) v } = − tr Σ−1i 1 Tn 1n − (ỹ i −QiX β)T −1i Σi 1n 1T −12 i i i n Σ (ỹ −Q X β) i i i i i 1 ∑i=1m { = − 1T −1 T − } 1 T −1 n Σi 1n − (ỹ i −Q Xi i i iβ) Σi 1n 1n Σi (ỹ −Q Xi i i i iβ) ,2 i=1 ∂l(φ; ỹ ,X ,C ) ∂σ 1 ∑ 2 e m { } − ∂ ln |Σ −1 i| ∂Σ = + (ỹ i −QiX iβ)T i (ỹ i −QiX iβ) 2 2 2∑ ∂σi=1 { ( em1 − ) ∂σe } = − tr Σ 1i − (ỹ i −QiX iβ)TΣ−1Σ−12 i i (ỹ i −QiX iβ) . i=1 The second derivatives of l(φ; ỹ ,X ,C ) with respect to parameters β , σ2v and σ2v are given by ∂2 m l(φ; ỹ ,X ,C ) ∑ = − XTQTΣ−1i i i QiX i,∂β∂βT ∑i=12 m∂ l(φ; ỹ ,X ,C ) T T ∂Σ−1= X Q i (ỹ i −QiX iβ) ∂β∂σ2 i iv ∑ ∂σ 2 i=1 v m = − XTQTΣ−11 1T Σ−1i i i n n i (ỹ i −Qi i iX iβ), i=1 105 ∂2l(φ; ỹ ,X ,C ) ∑∂β∂σ2em ∂Σ−1 = XT T ii Qi (ỹ i −QiX2 iβ) i=1∑ ∂σem = − XTQTΣ−1i i i Σ−1i (ỹ i −QiX iβ), i=1 ∂2l(φ; ỹ ,X ,C ) ∂σ2v∂σ 2 v m { −1 ∑ ∂Σ−1T i − − T ∂Σ−1= 1n 1n (ỹ i QiX iβ) i 1 1T Σ−1n (ỹ −Q X β)2 i ∂σ2 i ∂σ2} i ni i i i ii=1 v v − ∂Σ −1 (ỹ i −Qi{X iβ) TΣ−1 T ii 1n 1n (ỹ i −Q X β)∑ i i i i ∂σ2v m −1= − 1T Σ−1n i 1n 1T Σ−112 i i ni i ni i=1 + (ỹ −Q X β)TΣ−1 T −1i i i i 1n 1n Σi 1 Tn 1n Σ−1i i i i i (ỹ i −QiX iβ)} + (ỹ T −1 T −1 T −1i −Qi{X iβ) Σi 1n 1∑ i n Σi 1n 1n Σi (ỹ i −QiX iβ)i i i m = −1 − 1T Σ−11 1T Σ−1ni i n 12 i ni i ni i=1 } + 2(ỹ −Q X β)TΣ−11 1T −1i i i i n n Σi 1 1T −1i i ni n Σi i (ỹ i −QiX iβ) , ∂2l(φ; ỹ ,X ,C ) ∂σ2v∑∂σ2em {1 ∂Σ−1 −1 = − 1T in 1n − − ∂Σ (ỹ i QiX T i T −1 iβ) 1n 1 Σ (ỹ i −QiX iβ) 2 i ∂σ2 i ∂σ2 i ni i i=1 e e −1 } − (ỹ −Q X β)TΣ−1 T ∂Σii i{ i i 1n 1n (ỹ i −QiXi i 2 iβ)∑ ∂σem1 = − − 1T −1 −1 T −1 −1n Σi Σi 1n + (ỹ i −QiX iβ) Σi Σi 1 1T −1n n Σi (ỹ i −QiXi i } i i i β) 2 i=1 + (ỹ i −Q X β)TΣ−11 1T Σ−1 −1i i n Σ (ỹ −Q X β) ∑{ i i ni i i i i i m } −1= − 1T Σ−1n i Σ−1i 1n + 2(ỹ i −QiX β)TΣ−1Σ−11 1T −1i i i i i n2 i n Σ i i (ỹ i −QiX iβ) , i=1 106 ∂2l(φ; ỹ ,X ,C ) ∂σ2e∑∂σ2em { ( )1 ∂Σ−1 ∂Σ−1 = − tr i − (ỹ T ii −QiX iβ) Σ−1(ỹ2 } 2 i i −QiX iβ) 2 ∂σe ∂σi=1 e −1 − − ∂Σ(ỹ i Q X β)TΣ−1 ii{ i i ∂σ2em −1 ∑ ( ) = − tr Σ−1Σ−1i i + (ỹ −Q X β)TΣ−1Σ−1 −1i i i i i Σi (ỹ i −QiX iβ)2 i=1 } + (ỹ i −Qi{X iβ) TΣ−1i Σ −1Σ−1i i (ỹ i −QiX iβ)∑m ( ) } = −1 − tr Σ−1Σ−1 + 2(ỹ −Q T −1 −1 −1i i i iX iβ) Σi Σi Σi (ỹ i −QiX iβ) .2 i=1 5.4.8 Proof of (5.23) | i∼ndSince ỹ i X i,C i N(QiX iβ,Σi), then for any ni × ni matrix M which does not depend on ỹ, we have [ ] E(ỹ i −QiX iβ |X i,C i) = 0n , E (ỹ i −QiX iβ)TM (ỹ i −QiX iβ)|X i,C i = tr (MΣi i) , where 0n is a ni × 1 vector of zeros.i Using the above results, we can get [ ] ∂2 m − l(φ; ỹ ,C ) ∑ E |X T T −1 [ T i ,C i = X Q Σ ∂β∂β ] i i i QiX i, i=1 m −∂ 2l(φ; ỹ ,C ) ∑ E |X ,C = XTQTΣ−11 1T Σ−1i i i i i n n i E [ỹ i −QiX iβ |X i,C[ 2 ] i i i ] = 0p, ∂β∂σv i=1 ∂2 m − l(φ; ỹ ,C ) ∑ E |X i,C i = XTQTΣ−1Σ−1E [ỹ i −QiX iβ |X i,C i] = 0p, ∂β∂σ2 i i i ie i=1 107 [ ] 2 −∂ l(φ; ỹ ,C )E |X i,C i ∑∂σ{2v∂σ2vm1 = − 1Tn Σ−1 T −12 i i 1n 1i n Σi 1i ni i=[1 ]} + 2E (ỹ −Q X β)TΣ−1{i i i i 1n 1 T Σ−11 1T −1 i ni i ni n Σ (ỹ i i i −QiX iβ)|X i ∑m ( )} ,C i , 1 = − 1T Σ−11 1T −1 −1 T −1 T { ni i ni n Σi 1n + 2 tr Σi i i 1n 1 Σ2 i ni i 1n 1i ni ∑i=1m }1 = − 1T −1 T −1 T −1 T −1 2 n Σi 1 1i ni n Σi i 1n + 21i n Σi 1n 1i i n Σ 1i i ni ∑i=1m1 = 1T Σ−11 1T Σ−11 , [ ni i n2 i ni i nii=1 ] −∂ 2l(φ; ỹ ,C ) E |X i,C i ∑∂σ{2 2v∂σem1 = − 1T Σ−1Σ−1 2 ni i i 1ni i=[1 ]} + 2E (ỹ −Q X β)TΣ−1Σ−1{i i i i i 1 T −1 n 1n Σi (ỹ i −Q Xi i i iβ)|X i,C i ,∑m1 − − ( )}= − 1Tn Σ 1i Σ 1i 1n + 2 tr Σ−1Σ−1 T2 { i i i i 1n 1}i ni 1 ∑i=1m = − 1T Σ−1n i Σ−11 + 21T Σ−1Σ−1i n n i i 12 i i i ni i=1 m 1 ∑ = 1T Σ−1Σ−1n i i 1 ,[ n2 i ii=1 ] 2 −∂ l(φ; ỹ ,C )E |X ,C ∑∂σ2∂σ2 i i e e m { 1 ( ) = − tr Σ−1i Σ−12 i i=[1 } + 2E (ỹ −Q X β)TΣ−1Σ−1Σ− ] 1 {i i i i i i (ỹ i −QiX}iβ)|X i,C i∑m1 ( = − tr Σ−1Σ− ) ( ) 1 i i + 2 tr Σ −1Σ−1 2 i i i=1 m 1 ∑ ( ) = tr Σ−1Σ−1 . 2 i i i=1 108 Chapter 6: Monte Carlo Simulation Study 6.1 Introduction In this section, we design a Monte Carlo simulation study to compare finite sample performances of different estimators of regression coefficient β in simple linear and logistic models in the presence of linkage errors. Four different estimators are evaluated: naive estimator β̂N that ignores linkage errors, proposed estimator β̂F that incorporates linkage errors, and two of its computational simpler versions β̂M and β̂M2. These estimators can be derived by solving a set of corresponding estimating equations ( See Table 6.1 ), where QMi and Q M2 i are simplified versions of design matrix Qi = (q i ni,Ni jj′)j=1,j′=1, i = 1, . . . G. All the entries except the largest one are set to zero on each row in QMi , while all the entries except the first two largest are set to zero on each row inQM2i . In our simulation, we assume that linkage errors only exist within blocks, but not across blocks. The conditional independence assumption is made. That is, the agreement on one matching field is independent from that on others. Simulation results for both equal and unequal scenarios are presented. Files Fy and Fx are of the same size (i.e., the number of units) within each block in the equal scenario while they are different in the unequal scenario. However, even in 109 the equal scenario, we allow block sizes to vary across blocks. In each scenario, two sets of simulation conditions are considered in order to compare the performances of different estimators under different levels of linkage errors. Results for these two different scenarios are shown in section 4.1 and 4.2, respectively. We consider G = 100 blocks and R = 100 independent simulation replications throughout each section. In section 6.4, we conduct another Monte Carlo simulation to investigate the difference between the standard and simplified jackknife methods in estimating the variance of the estimators of β. The simulation is performed under two different scenarios: the equal scenario and the unequal scenario. In the equal scenario, sim- ulation is done for a simple logistic model. In the unequal scenario, simulation is done for a simple linear model. Table 6.1: Estimating equations used for four different estimators of regression coefficient β in simple linear and logistic models. Notation is followed from Chapter 2. Est. L∑inear Model L∑ogistic Model β̂N ∑G Ti=1X i {ỹ i −X β} = 0 ∑G Ti i=1X i (ỹ i − g(X i, β)) = 0 β̂ ∑GF i=1(Q X T G T∑ i i ) (ỹ i −QiX iβ) = 0 i=1(QiX i) (ỹ i −Qig(X i, β)) = 0 β̂ G ∑ ( ) (QM TM i=1 i X i) (ỹ M G M i −Qi X iβ) = 0 ∑i=1(Qi X i)T (ỹ Mi −Qi g(X i, β) = 0 β̂ G ) (QM2X )TM2 i=1 i i (ỹ i −QM2i X iβ) = 0 G M2 T M2 i=1(Qi X i) ỹ i −Qi g(X i, β) = 0 exp(x β) Note: g(X , β) = col ij . Q = (qi )ni,Nii 1≤j≤Ni 1+exp(x β) i jj′ij j=1,j′=1, i = 1, . . . G is the design matrix, and QM and QM2i i are its simplified versions. All the entries except the largest one are set to zero on each row in QMi , while all the entries except the first two largest are set to zero on each row in QM2i . 110 6.2 The Equal Scenario Simulation Conditions: The number of records in each block i, ni, across different simulation rep∑lications varies from 10 to 40 in case 1, and from 20 to 40 in case 2. Then there are G 2 2i=1 ni potential links in total with ni potential links in block i ranging from 100 to 1600 in case 1 and from 400 to 1600 in case 2. The number of matching fields, K, across different simulation replications varies between 8 and 10 in case 1, and between 6 and 10 in case 2. Across different replications, probability of agreement on matching field k among matches, mk, and among mismatches, uk, take values from interval (0.55, 0.95) and (0.10, 0.50), respectively, in case 1, whereas they take values from interval (0.55, 0.85) and (0.20, 0.50) in case 2. In general, linkage errors are less likely to occur under case 1 than under case 2 when combining files Fy and Fx, since it has smaller block sizes, more matching fields, and larger probabilities of agreement among matches and smaller probabilities of agreement among mismatches. Hence, we would expect to have better estimates in case 1 than those in case 2. A summary of simulation conditions for case 1 and case 2 is shown in Table 6.2. Simulation Steps: (1) Data Generation: N values of x in Fx and ỹ in Fy are generated based on the selected model and simulated regression coefficient β. A comparison vector c can be generated for each record pair based on their true matching status, the number of matching fields K, probabilities of agreement on matching fields among matches {mk, k = 1, . . . , K}, and among mismatches {uk, k = 1, . . . , K}. Note that only the records within the same blocks are compared. 111 Table 6.2: Simulation conditions for case 1 and case 2 under the equal scenario. Symbol Description Case 1 Case 2 lower upper lower upper limit limit limit limit G number of blocks 100 100 R number of simulations 100 100 x univariate covariate x ∼ N(0, 1) x ∼ N(0, 1) β regression coefficient 0 1 0 1 σ2e regression variance σ 2 = 2 2 2e ni size of block i 10 ∑1− β σe = 1− β40 20 ∑ 40 N size of file Fy and Fx N = G i=1 n G i N = i=1 ni K number of matching fields 8 12 6 10 mk probability of agreement on 0.55 0.95 0.55 0.85 field k for a match uk probability of agreement on 0.10 0.50 0.20 0.50 field k for a mismatch Note: the condition for σ2 is used for simulation of linear regression on linked data only. The observed data for the following linkage step and statistical analysis includes X in Fx, ỹ in Fy, and comparison vector C . (2) Record Linkage: A two class mixture model is fitted to observed com- parison vectors c using the expectation maximization (EM) algorithm. All the parameters in the mixture model are estimated. These parameters consist of weights of class, π, probabilities of agreement on matching fields among matches, {mk, k = 1, . . . , K}, and among mismatches, {uk, k = 1, . . . , K}. The probability of a record pair (j, j′) within block i being a link, qijj′ , is the same as the probability of its corresponding vector cijj′ belonging to class M . It can be estimated by applying Bayes’ Theorem and can be used to partition the record pairs into designated links and non-links. (3) Parameter Estimation: For the naive estimator, it is essential to determine 112 designated links. The designate link to a record j within block i in Fy is a record j′ within the same block in F ix whose corresponding linkage probability qjj′ is the largest among {qijt, t = 1, . . . , ni}. In our case, it is possible a record j′ in Fy is linked to two or more records in Fx since one-to-one assignment is not enforced. For our proposed estimators, the design matrix Q Mi, Qi and Q M2 i for block i need to be constructed based on the estimated linkage probabilities {qijj′ , i = 1, ..., G, j = 1, ..., ni, j = 1, ..., ni}. Qi = (qi )ni,ni M M2jj′ j=1,j′=1, i = 1, . . . m, Qi and Qi are simplified versions of Qi. All the entries except the largest one are set to zero on each row in QMi , while all the entries except the first two largest are set to zero on each row in QM2i . Then the four estimators β̂N , β̂F , β̂M and β̂M2 can be estimated by solving the estimating equations shown in Table 6.1. (4)Variance Estimation: Jackknife is used to estimate bias, variance and mean squared error of each estimator of β. Jackknife replicates are generated by leaving out data of one block from the two files at a time. Hence, there are G = 100 jackknife replicate in total. For each jackknife replicate, step 2 and step 3 are performed and estimates of β are re-evaluated. The jackknife estimates of the bias, variance and mean squared error of an estimator can be obtained by aggregating G replicate estimates of β. And a 95% confidence interval can be obtained for each estimate of β. Step (1) to (4) are performed for R = 100 simulation runs. Performance Evaluation: The performance of the four estimators can be evaluated by the average absolute deviation (AAD) and average squared deviation (ASD) over all the simulation runs. The formulas for AAD and ASD of an estimator 113 β̂ are shown below. ∑R r=1 |β̂(r) − β|AAD(β̂) = ∑ ,RR (β̂(r) − β)2 ASD(β̂) = r=1 , R where β̂(r) is value of β̂ calculated based on simulation r. We can also measures improvement of an estimator β̂ over β̂N with respect to AAD and ASD by relative percent improvement (RPI). The formulas are shown below. AAD(β̂N)− AAD(β̂) RPIAAD(β̂) = × 100%, AAD(β̂N) ASD(β̂N)− ASD(β̂) RPIASD(β̂) = × 100.% ASD(β̂N) To learn more about the properties of these estimators, Monte Carlo estimates of sampling mean, bias, relative bias, variance, relative variance, mean square error, relative mean square error of each estimator are obtained. Suppose R independent replicates are generated, and β̂(r) is the estimate of β computed based on replicate r, r = 1, . . . R. Then the Monte Carlo estimate of sampling mean, variance, and mean square error of β̂ are given by ∑R1 ∑R Ê(β̂) = β̂(r) 1 ,( ) B̂ias(β̂) = β̂ (r) − β, R R r=1 r=1 R R 2 R 1 ∑ 1 ∑ 1 ∑ V̂ (β̂) = β̂(r) − β̂(r) , M̂SE(β̂) = (β̂(r) − β)2. R− 1 R R r=1 r=1 r=1 And the Monte Carlo standard deviations of the estimated mean, bias, mean square error are given by sd(β̂) SE(Ê(β̂)) = SE((B̂ias(β̂)))= √ ,R sd (β̂ − β)2 SE(M̂SE(β̂)) = √ . R 114 6.2.1 Linear regression with linked data In each simulation run, values of a scalar independent variable X are randomly and independently generated from N(0, 1), and the corresponding values of Y are given by yij = xijβ + e, i = 1, . . . , G, j = 1, . . . , ni. where the regression coefficient β is randomly selected from a uniform distribution in [0, 1], and the random errors  are randomly and independently sampled from N(0, σ2e) with σ 2 = 1− β2e . Scatter plots of β̂N , β̂F , β̂M , β̂M2 estimates versus true values of β are shown in Fig 6.1. The true values of β is on the x-axis and the estimated value of β is on the y-axis. A 45 degree straight line is plotted in red, and a fitted straight line is plotted in blue. If an estimator performs well, the data points should gather closely around red line and the blue line should be close to the red line. Based on the results, we can see that naive estimator β̂N usually underestimate β under both of the two cases. This phenomenon is even more obvious under case 2. This is because the linkage errors in the linked data weakened the correlation between y and x, which introduce a bias toward zero when estimating the slope of the regression line. All of our proposed estimators correct this bias, with full estimator and max2 estimator being the most efficient, and max estimator being the least efficient. The max estimator β̂M seems to overestimate β a little bit when the true value of β increases, but it still behaves much better than β̂N especially in Case 2. In general, our proposed estimators behave better than the naive estimator based the visual results. This is 115 Figure 6.1: Simulation results for a simple linear regression under case 1 and case 2 in the equal scenario: Scatter plots of naive, full, max and max2 estimates versus true values of regression coefficient β. Diagonal lines with slop 1 are plotted in red. Fitted lines are plotted in blue. probably because they take account of the linkage errors in the linked data. Fig 6.2 shows heat maps of 100 absolute deviations and squared deviations of each estimators from true values of β in a linear model under two cases. The darker the color is, the smaller the absolute deviation is. We can clearly see that our proposed estimators performs much better than the naive estimator, especially in case 2. Table 6.3 shows the average absolute deviations (AAD) and average squared deviations (ASD) of our proposed estimators for β, as well as the relative percent improvement (RPI) over the naive estimator under Case 1 and Case 2 in the equal scenario. Values of AAD and ASD are shown in black and values of RPI are shown in blue. Under both cases, β̂N has the smallest values of AAD and ASD among the four estimators, implying it performs the worst in estimating the regression coefficient in 116 Figure 6.2: Simulation results for linear regression under case 1 and case 2 in the equal scenario: Heat map of absolute deviations (top 2) and squared deviations (bottom 2) for estimates of regression coefficient β. a simple linear model; β̂M2 performs better, but not as well as β̂F and β̂M2. We can also see that values of AAD and ASD increase under Case 2 when compared to Case 1. This is as expected because Case 2 has more difficult simulation conditions (less matching fields, larger block sizes, small probabilities of agreement among matches and larger probabilities of agreement among mismatches) and then linkage errors are more likely to occur. However, our proposed estimators improved more over the naive estimator in Case 2 then in Case 1, indicated by the larger values of RPI under Case 2. It shows that our proposed estimator would be especially useful when linkage errors are more likely to occur. 117 Table 6.3: Simulation results for linear regression under case 1 and case 2 in the equal scenario: Average absolute deviations (AAD) and average squared deviations (ASD) of naive, full, max and max2 estimators of regression coefficient β. The percent relative improvement (PRI) of the proposed estimators over naive estimator is shown in blue. Estimator Case 1 Case 2 AAD ASD AAD ASD β̂N 0.0601 0.0070 0.2211 0.0670 β̂F 0.0166 0.0005 0.0286 0.0015 72.43% 92.33% 87.08% 97.82% β̂M 0.0435 0.0027 0.0664 0.0060 27.63% 61.92% 69.95% 91.00% β̂M2 0.0176 0.0006 0.0308 0.0017 70.65% 91.85% 86.05% 97.48% 6.2.2 Logistic regression with linked data In this section, we want to compare the performances of different estimators of regression coefficient β in a simple logistic model. Simulation for logistic regression basically follows the same steps as those for linear regression. The only difference is in the generation of values of x and y. In each simulation run, values of a scalar independent variable x are randomly and independently selected from N(0, 1), and the corresponding values of y are given by exp(βxij) P (yij = 1|xij) = g(xij) = , i = 1, . . . ,m, j = 1, . . . , ni, 1 + exp(βxij) where the regression coefficient β is randomly selected from a uniform distribution in [0, 1]. Fig 6.3 shows scatter plots of naive, full, max and max2 estimates and true values of β in a simple logistic regression on 100 simulation runs under Case 1 118 and Case 2 in the equal scenario. Heat map for absolute deviations and squared deviations for each estimator on 100 simulation runs is in Fig 6.4, and values of average absolute deviations (AAD) and average squared deviations (ASD) of our proposed estimators and their relative percent improvement (RPI) are displayed in Table 6.4. Results shown in Fig 6.3, Fig 6.4 and Table 6.4 for logistic regression on linked data are similar to those for linear regression. Again, a bias toward zero is introduced to the naive estimator. The proposed estimators correct this bias, with max2 estimator and full estimator being the most efficient, and max estimator being the least efficient. The increased relative percent improvements from Case 1 to Case 2 implied again that our proposed estimators improve more over the naive estimator when the linkage error are more likely to occur. Table 6.4: Simulation results for logistic regression under case 1 and case 2 in the equal scenario: Average absolute deviations (AAD) and average squared deviations (ASD) of naive, full, max and max2 estimators of regression coefficient β. The percent relative improvement (PRI) of the proposed estimator over naive estimator is shown in blue. Estimator Case 1 Case 2 AAD ASD AAD ASD β̂N 0.0811 0.0112 0.2320 0.0811 β̂F 0.0527 0.0045 0.0755 0.0098 35.01% 59.78% 66.61% 87.90% β̂M 0.0681 0.0080 0.1215 0.0269 16.02% 28.60% 47.65% 66.78% β̂M2 0.0517 0.0043 0.0796 0.0106 36.24% 61.28% 65.70% 86.99% In order to further evaluate the performances of these four estimators, another set of 100 simulation runs for logistic regression under case 1 in equal scenario is 119 Figure 6.3: Simulation results for logistic regression under case 1 and case 2 in the equal scenario: Scatter plot of naive, full, max and max2 estimates versus true values of regression coefficient β. Diagonal lines with slop 1 are plotted in red. Fitted lines are plotted in blue. performed with β fixed at 0.5. Box plot of deviations and relative deviations of different estimators from the true value of β is shown in Fig 6.5. Table 6.5 gives the Monte Carlo estimate of bias, relative bias, mean square error (MSE), relative mean square error, length and coverage of nominal 95% confidence intervals of β for each method. The standard errors of these estimates are shown in blue. The negative values of bias and relative bias of naive estimator implies it underestimate values of β, and the other three estimators correct this bias, with max2 estimator and full estimator being the most efficient. The correctness of bias and relative bias also lead to the decrease of mean square error and relative mean square error. In terms of mean square error and relative mean square error, the max2 estimator β̂M2 performs the best, followed closely by the full estimator β̂F . The relative efficiency 120 Figure 6.4: Simulation results for logistic regression under case 1 and case 2 in the equal scenario: Heat map for absolute deviations (top 2) and squared deviations (bottom 2) of estimates of regression coefficient β. of each proposed estimator to the naive estimator with respect to mean square error is given in Table 6.6. We can also see that the coverage rates of confidence intervals produced by max2 and full estimators and their jackknife variances are very close to their desired nominal level, while those produced by naive estimator is lower than the desired nominal level. 6.3 The Unequal Scenario In this section, we attempt to compare performances of different estimators under the unequal scenario where the number of records in the same blocks are different in Fx and Fy. Our parameter of interest is the regression coefficient β in a simple linear model. Two sets of simulation conditions are considered. They are similar to those used for the equal scenario, but slightly different. The number of observations of y in block i of file Fy, ni, is different from the number of observations of x in the same block of file Fx, Ni, i = 1, . . . ,m. ni ranges from 10 to 20 under 121 Figure 6.5: Simulation results for logistic regression under case 1 in the unequal scenario: Box plot of deviations and relative deviations of different estimators from the true value of β over 100 simulation runs. Figure 6.6: Simulation results for logistic regression under case 1 in the equal scenario: Plot of Monte Carlo estimate of bias, relative bias, vari- ance, relative variance, mean square error (MSE), relative mean squared error(RMSE), length and coverage of nominal 95% confidence intervals of regression coefficient β by different methods over 100 simulation runs. Value of β is set to 0.5 for each simulation. 122 Table 6.5: Simulation results for logistic regression under case 1 in the equal scenario: Monte Carlo estimate of bias, relative bias (R.Bias), variance, relative variance (R.Var), mean square error (MSE), relative mean squared error(R.MSE), length and coverage rate (C.R.) of nominal 95% confidence intervals of regression coefficient β by different methods over 100 simulation runs. Value of β is set to 0.5 for each simulation. The corresponding estimated standard deviation is show in blue. Bias R.Bias Var R.Var MSE R.MSE Length C.R. Naive -0.0731 -0.1462 0.0044 0.0175 0.0097 0.0387 0.1869 63% 0.0066 0.0132 0.0011 0.0045 0.0022 0.0485 Full -0.0096 -0.0192 0.0046 0.0185 0.0047 0.0187 0.2429 93% 0.0068 0.0136 0.0007 0.0027 0.0048 0.0256 Max 0.0430 0.0859 0.0042 0.0167 0.0060 0.0239 0.2388 90% 0.0065 0.0129 0.0009 0.0035 0.0051 0.0302 Max2 -0.0007 -0.0014 0.0037 0.0148 0.0037 0.0147 0.2259 95% 0.0061 0.0122 0.0005 0.0019 0.0034 0.0219 Table 6.6: Simulation results for logistic regression under case 1 in the equal scenario: Relative efficiency (RE) of proposed estimators to naive estimator with respect to mean square error. Naive Full Max Max2 MSE 0.0097 0.0047 0.0060 0.0037 RE 0.4821 0.6179 0.3798 Case 1 and from 20 to 30 under Case 2. The ratio of ni and Ni is denoted by ri, which varies from 1.5 to 3, and Ni, is set to Ni = bniric. With this setup, linkage errors would be more likely to occur compared to the equal size scenario. For the details of the simulation conditions, see Table 6.7. 6.3.1 Linear regression with linked data Fig 6.7 displays scatter plots of naive, full, max and max2 estimates versus the true values of β in a simple linear model under Case 1 and Case 2 in the unequal scenario. Similar to the results for the equal scenario, the naive estimator underes- timate the regression coefficient, indicated by the obvious discrepancy between the 123 Table 6.7: Simulation conditions for Case 1 and Case 2 under the unequal scenario. Symbol Description Case 1 Case 2 lower upper lower upper limit limit limit limit G number of blocks 100 100 R number of simulations 100 100 x univariate covariate x ∼ N(0, 1) x ∼ N(0, 1) β regression coefficient 0 1 0 1 σ2 regression variance σ2e e = 1− β2 σ2e = 1− β2 ni size of block i in Fy 10 20 20 30 ri ratio of sizes of block i in Fy 1.5 3 1.5 3 and Fx Ni size of block i in Fx Ni = bniric Ni = bniric K number of matching fields 8 12 6 10 mk probability of agreement on 0.55 0.95 0.55 0.85 field k for a match uk probability of agreement on 0.10 0.50 0.20 0.50 field k for a mismatch red line and the blue line. Our proposed estimators, especially the full estimator, correct the bias. The fitted blue line for the full estimator almost coincide with the diagonal red line. Fig 6.8 shows the heat map of absolute deviations and squared deviations of each estimator over the 100 simulation runs. The simulation runs are sorted in a decreasing order of values of full estimator. The darker the color is, the smaller the absolute deviation is. Table 6.8 gives the average absolute devia- tions (AAD) and average squared deviations (ASD) of our proposed estimator and its relative percent improvement over the naive estimator. Results show that our proposed estimators, especially full estimator and max2 estimator, performs better than the naive estimator, especially in the case where linkage errors are more likely to occur. In order to further evaluate the performances of these four estimators, another 124 Figure 6.7: Simulation results for linear regression under case 1 and case 2 in the unequal scenario: Scatter plot of naive, full, max and max2 estimates versus true values of regression coefficient β in a simple linear model on 100 simulation runs. Diagonal lines with slop 1 are plotted in red. Fitted lines are plotted in blue. 125 Figure 6.8: Simulation results for linear regression under case 1 and case 2 in the unequal scenario: Heat map of absolute deviations and squared deviations for estimates of regression coefficient β in a simple linear model on 100 simulation runs. set of 100 simulation runs for linear regression under case 1 of unequal scenario is performed with β fixed at 0.5. Box plot of deviations and relative deviations of different estimators from the true value of β is shown in Fig 6.9. Table 6.9 gives the Monte Carlo estimate of bias, relative bias, mean square error (MSE), relative mean square error(RMSE), length and coverage of nominal 95% confidence intervals of β for each method. The standard errors of these estimates are shown in blue. The negative values of bias and relative bias of naive estimator implies it underestimate values of β, and the other three estimators correct this bias, with full estimator being the best. Our proposed estimators also efficiently decreased the variance by about 50% when compared to the naive estimator. We can also see that the coverage rate of confidence intervals produced by full estimators and their jackknife variances is 126 Table 6.8: Simulation results for linear regression under case 1 and case 2 in the unequal scenario: Average absolute deviations (AAD) and average squared devia- tions (ASD) of naive, full, max and max2 estimators of regression coefficient β in a simple linear model over 100 simulation runs. The percent relative improvement (PRI) of the proposed estimator over naive estimator is shown in blue. Estimator Case 1 Case 2 AAD ASD AAD ASD β̂N 0.0573 0.0061 0.2780 0.1056 β̂F 0.0192 0.0006 0.0366 0.0023 66.51% 89.88% 86.84% 97.80% β̂M 0.0456 0.0028 0.0665 0.0062 20.55% 53.06% 76.08% 94.09% β̂M2 0.0201 0.0007 0.0402 0.0026 64.96% 88.45% 85.53% 97.52% very close to its desired nominal level, with only 1 percent off, while those produced by other estimators are lower than the desired nominal level. In terms of mean square error and relative mean square error, the full estimator β̂F performs the best, followed by the max2 estimator β̂M2 and max estimator β̂M . The relative efficiency of each proposed estimator to the naive estimator with respect to mean square error is given in Table 6.10. 6.4 Comparison of the Standard and Simplified Jackknife Methods Inspired by Jiang, Lahiri and Wan (2005), we proposed to use Jackknife method to estimate variance of each estimator of β. A Jackknife replicate i is constructed by leaving out data from blocks i, i = 1, . . . , G in file Fy and Fx. The estimate of mixture model parameters ψ = {π,m1, . . . ,mK , u1, . . . , uK} are re- estimated at each replicate data, and then are used to estimate the probability of 127 Figure 6.9: Simulation results for linear regression under case 1 in the unequal scenario: Box plot of deviations and relative deviations of dif- ferent estimates from the true value of β over 100 simulation runs. Figure 6.10: Simulation results for linear regression under case 1 in the unequal scenario: Plot of Monte Carlo estimate of bias, relative bias, variance, relative variance, mean square error (MSE), relative mean squared error(RMSE), length and coverage of nominal 95% confidence intervals of regression coefficient β by different methods over 100 simu- lation runs. Value of β is set to 0.5 for each simulation. 128 Table 6.9: Simulation results for linear regression under case 1 in the unequal sce- nario: Monte Carlo estimate of bias, relative bias(R.Bias), variance, relative variance (R.Var), mean square error (MSE), relative mean squared error(RMSE), length and coverage rate (C.R.) of nominal 95% confidence intervals of regression coefficient β by different methods over 100 simulation runs. Value of β is set to 0.5 for each simulation. The corresponding estimated standard deviation is show in blue. Bias R.Bias Var R.Var MSE R.MSE Length C.R. Naive -0.0785 -0.1569 0.0028 0.0112 0.0089 0.0357 0.1149 35% 0.0053 0.0106 0.0012 0.0049 0.0025 0.0479 Full 0.0061 0.0122 0.0011 0.0043 0.0011 0.0044 0.1225 94% 0.0033 0.0066 0.0001 0.0006 0.0023 0.0239 Max 0.0501 0.1003 0.0016 0.0064 0.0041 0.0164 0.1294 69% 0.0040 0.0080 0.0005 0.0020 0.0031 0.0465 Max2 0.0117 0.0235 0.0014 0.0056 0.0015 0.0061 0.1240 88% 0.0037 0.0075 0.0002 0.0009 0.0027 0.0327 Table 6.10: Simulation results for linear regression under case 1 in the unequal scenario: Relative efficiency (RE) of proposed estimators to naive estimator with respect to mean square error. Naive Full Max Max2 MSE 0.0089 0.0011 0.0041 0.0015 RE 0.1244 0.4607 0.1713 linkage and designate the record pairs as links and non-links. Then β̂−i, the esti- mate of β for replicate i, can be obtained based on replicate data and ψ̂−i. And the jackknife variance of an estimator of β can be obtained by aggregating these G replicate estimates of β. This way, the estimated variance V̂ cannot only cap- ture the variability caused by linkage errors but also cover the variability caused by expectation maximization algorithm However, the above jackknife method of estimating variance is time consum- ing because of the complexity of process. The expectation maximization algorithm is operated G + 1 times estimate mixture model parameters during the entire pro- 129 cess, 1 time on the full data to obtain ψ̂, G times on all jackknife replicates to get {ψ̂−i, i = 1, . . . , G}. Also, the design matrix Q used in the estimating functions need to be re-constructed for each jackknife replicate since it depends on the mix- ture model parameters. We wonder whether the jackknife method can be simplified to decrease the computation time without losing much accuracy. Based on some of our findings that (1) the expectation maximization algorithm takes most part of the computation time during the entire process; (2) there’s no big difference between the estimate of mixture model parameters on the full data and on the jackknife replicates, we propose a simplified version of the above jackknife method. Instead of re-estimating mixture model parameters ψ for each jackknife replicate, we use the estimate obtained from the full data ψ̂ during the entire process. Let V̂0 denote the estimated variance of an estimate obtained from the simplified jackknife method. Next, we conduct a Monte Carlo simulation study to show that the difference be- tween these two estimated variances V̂ and V̂0 is quite small. Again, the simulation is performed under two different scenarios: the equal scenario and the unequal scenario. In the equal scenario, simulation is done for a simple logistic model under Case 1 with simulation conditions as shown in Table 6.2. In the unequal scenario, simulation is done for a simple linear model under Case 1 with simulation conditions as shown in Table 6.7. Under both scenarios, the true value of the regression coefficient β is set to 0.5. For each of the R simulation runs, two different jackknife methods are used to estimate variance of an estimate. 130 6.4.1 Equal Scenario Figure 6.11 shows the box plot of the relative difference of V̂0 and V̂ of each estimate of β. First, most of the relative differences among R simulation runs are above 0, showing that V̂ is usually greater than V̂0. This is expected since V̂0 ignores the variability in estimating mixture model parameters while V̂0 does not. Also, we can clearly see the relative difference of V̂ and V̂0 is large for the naive estimator but small for our propose estimators, especially for the full estimator β̂F , which is within 0.1. We wonder whether the absolute relative difference of V̂0 and V̂ is smaller than a positive constant L. We would like to test the hypotheses H0: d = C. H0: d < C where d >= 0 is the absolute relative difference of V̂0 and V̂ . A one sample t test may be considered. However, the histogram of the absolute relative difference, shown in Figure 6.12, looks strongly skewed, suggesting lack of normality. We would therefore to use the Wilcoxon signed rank test, a non- parametric statistical hypothesis test. The test statistic W+is the sum of ranks of the positive difference of d and C. That is, ∑R W+ = |d− C| × I{sign(d− C) = 1} i=1 Under the null hypothesis, W+ has mean R(R + 1) µW+ = 4 131 and standard deviation √ R(R + 1)(2R + 1) σW+ = 24 Under the null hypothesis, the distribution of the signed rank statistic W+ converges to a normal distribution when the number of replicate R becomes large. Then we can use the normal probability calculation (with continuity correct) to approximate P-value for W+. The P-value is equal to ( ) ( ) + + P (X < W+ W − µW+ W − µW+ W+ ) = P Z <= = Ψ σW+ σW+ The Wilcoxon signed rank statistics and their corresponding p values of the hypotheses test for different choices of C are shown in Table 6.11. Based on the result, there is strong evidence that the absolute relative difference between V̂ and V̂0 is less than 0.02 for full estimator, and less than 0.04 for all the proposed estimators. There’s no evidence that the absolute relative difference is within 0.05 for the naive estimator. Table 6.11: Simulation results for logistic regression under case 1 in the equal sce- nario: Table of test statistic and p values of one tailed Wilcoxon signed rank test. The alternative hypothesis is that the absolute relative difference between the two jackknife variances of the estimate of β is less than C. C 0.01 0.02 0.03 0.04 0.05 Estimator w p.value w p.value w p.value w p.value w p.value Naive 4954 1.00 4787 1.00 4618 1.00 4412 1.00 4104 1.00 Full 3440 1.00 1367 0.00 654 0.00 244 0.00 45 0.00 Max 4234 1.00 3452 1.00 2578 0.57 1936 0.02 1388 0.00 Max2 4327 1.00 3073 0.97 2196 0.13 1439 0.00 864 0.00 132 Figure 6.11: Simulation results for logistic regression under case 1 in the equal scenario: Box plot of relative difference between the two jackknife variances of each estimate of β. The true value of β is set to 0.5. Figure 6.12: Simulation results for logistic regression under case 1 in the equal scenario: Relative frequency histogram of absolute relative difference between the two jackknife variances of each estimate of β. The black line is the kernel density estimate. 133 Figure 6.13: Simulation results for linear regression under case 1 in the unequal scenario: Box plot of relative difference between the two jack- knife variances of each estimate of β. The true value of β is set to 0.5. 6.4.2 Unequal Scenario Similar results are obtained for the linear regression under case 1 in the unequal scenario. The box plot of relative differences of V̂ and V̂0 is displayed in Figure 6.13, the histogram of absolute relative differences is shown in Figure 6.14, and the results for the hypothesis test is given in Table 6.12. Based on the result, there is strong evidence that the absolute relative difference of V̂ and V̂0 is within 0.05 for all the proposed estimators, but not for the naive estimator. 6.4.3 Conclusions Based on the results of the one-sided hypothesis test, we concludes that the ab- solute relative difference between the variances obtained from the standard jackknife 134 Figure 6.14: Simulation results for linear regression under case 1 in the unequal scenario: Relative frequency histogram of absolute relative dif- ference between the two jackknife variances of each estimate of β. The black line is the kernel density estimate. Table 6.12: Simulation results for linear regression under case 1 in the unequal scenario: Table of test statistic and p values of one tailed Wilcoxon signed rank test. The alternative hypothesis is that the absolute relative difference between the two jackknife variances of the estimate of β is less than C. C 0.01 0.02 0.03 0.04 0.05 Estimator w p.value w p.value w p.value w p.value w p.value Naive 5034 1.00 4994 1.00 4895 1.00 4767 1.00 4519 1.00 Full 4918 1.00 4354 1.00 3534 1.00 2773 0.80 1911 0.02 Max 4713 1.00 3887 1.00 3066 0.97 2343 0.27 1860 0.01 Max2 4866 1.00 4225 1.00 3443 1.00 2631 0.64 2019 0.04 method and from the simplified jackknife method are small for our proposed estima- tors, but not for the naive estimator, under both the equal scenario and the unequal scenario. Hence, we would recommend to use the simplified jackknife method to estimate variance for any of the proposed estimators if one would like to pursue less computation time. 135 Chapter 7: Future Research Our research has initiated some new ideas and points to several directions for future research. A key assumption of the general methodology for small area estimation is that the records from multiple sources can be partitioned into small areas without error, and small areas coincide with blocks. In reality, however, this may not hold in different situations: (1) The variable specifying the membership of small areas is not present in all files to be linked. Thus, it is impossible to partition the records in each file into small areas before the record linkage process. (2) Even when the records in multiple files can be divided into small areas successfully, the number of record pairs within some small areas could be large, and thus blocking within small areas may be a reasonable option in order to reduce computational burden. In this case, our general methodology works when a small modification to the linkage error model is made. The matrix of matching status indicators Li = diag(Li1, . . . ,LiG ) turns block-diagonal, assuming that the recordsi in small area i are partitioned into Gi blocks without error. (3) It is also possible that small areas are nested within blocks. In this case, the 136 model is likely to introduce correlation across small areas. Our general methodology works for point estimation because global parameters like the regression coefficients and variance components will be estimated properly as long as we have a large number of blocks. As for variance estimation, a new method is required since the current jackknife methods used in the dissertation requires the measurements to independent across small areas. Our research is limited to performing statistical analysis on data from two different files. Specifically, we consider the case where the variable of interest and its predictors are observed separately for two samples of population units. When developing the current methodology, we see the potential of extending it to an even more general case, where observations on the variable of interest and some of its predictors are recorded in one file and observations on the rest of its predictors are stored in other multiple files. The basic idea is to use a system of linkage error models and a system of mixture models. The validity of the idea need to be investigated in the future. Our proposed methodology requires the measurements to be independent across blocks or small areas. This is mainly due to the assumption of the jackknife methods we used for estimating bias, variance, and mean squared errors. Recently, Jiang and Mahmoud proposed a Monte-Carlo-assisted approach to mean squared error estimation of a small area estimate, which allows correlation across small areas. It can be a potential tool to improve our research. The dissertation is focused on the classical method of small area estimation using data from multiple files. The classical unit-level models are used for describing 137 the relationship between the study variable and auxiliary variables, and the mixture model is used for the purpose of record linkage. In the literature of small area esti- mation, Hierarchical Bayesian approaches have been suggested due to the following advantages: (1) It is straightforward to take into account all sources of variation. (2) The MCMC techniques have made it computationally feasible and easy to estimate the model. (3) The Bayesian approaches allow the use of the one-to-one matching assump- tion, so that we do not need to be concerned about the one-to-many and many-to-one linkage problem that usually occurs when classical methods are used. In the future, we would like extend our research to use Bayesian methods for small area estimation using data from multiple files. In this dissertation, Monte Carlo simulations are used to provide preliminary evidence supporting the validity of our general methodology. In the future, we would like to apply the classical and Bayesian methods of statistical analysis using data from multiple files to address some real issues. Poverty mapping and nonresponse adjustment are two possible applications. We may have very limited information about the individuals in poverty or nonrespondents from the sampling frame, but more valuable information about them can be obtained if we can link the survey and administrative data. As the amount of information increases, more advanced models can be built to help us understand their behavior and further improve the accuracy of poverty estimates or efficiency of weight adjustment. For example, weighting class adjustment method is commonly used for nonresponse adjustment 138 when relatively few variables are available. If additional variables can be obtained from record linkage, response propensity models, using logistic regression, can be applied to predict the likelihood of response versus nonresponse, and then provide a weighting factor. When it comes to poverty mapping, the more advanced models can better predict the poverty status of an individual that was not sampled in the survey, and further provide a more reliable poverty estimate for each small area. 139 Bibliography [1] Alvey, M. and Jamerson, N. (1997). Record linkage techniques-1997. Proceed- ings of an International Workshop and Exposition (1997 March), pp. 20-21. [2] Armstrong, J.B. and Mayda, J.E. (1993) Model-based estimation of record linkage error rates, Survey Methodology, 19, 137-147. [3] Becker, M.P. and Yang, I. (1998) Latent class marginal models for cross- classifications of counts, Sociological Methodology, 28, 293-325. [4] Battese, G. E., Harter, R. M. and Fuller, W. A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Jour- nal of the American Statistical Association, 83, 28-36. [5] Belin, T.R. Evaluation of sources of variation in record linkage through a fac- torial experiment, Survey Methodology, 19 (1993) 13-29. [6] Belin, T.R and Rubin, D.B (1995) A method for calibrating false-match rates in record linkage, Journal of the American Statistical Association, Vol. 90, No. 430, 694-707. [7] Binder, D.A. (1983) On the variances of asymptotically normal estimators from complex surveys, International Statistical Review, 51, 279-292. [8] R. Chambers, Regression analysis of probability-linked data, Statisphere (2009), 4. [9] Chambers, R., Chipperfield, J.O., Davis, W. and Kovacevic, M. (2009b). ”In- ference based on estimating equations and probability-linked data”. In: Centre for Statistical and Survey Methodology, Working Paper Series. Wollongong: University of Wollongong, p. 38. 140 [10] Chambers, R. and Kim, G. (2016). Secondary analysis of linked data in (K., Harron, H., Goldstein and C., Dibben, Eds) Methodological developments in data linkage, pp. 83-108,Chichester: Wiley. [11] Chipperfield, J.O., Bishop, G.R. and Campbell, P. (2011). Maximum likelihood estimation for contingency tables and logistic regression with incorrectly linked data, Survey Methodology, vol. 37, pp. 13-24. [12] Chipperfield, J.O., Chambers, R. (2015). ”Using the bootstrap to analyse bi- nary data obtained via probabilistic linkage”. Journal of Official Statistics, vol. 31, pp. 397-414. [13] Dasylva, A. (2014). ”Design-based Estimation with Record-Linked Admin- istrative Files”. In: Statistics Canada. Beyond traditional survey tak- ing: adapting to a changing world: Proceedings of the 2014 International Methodology Symposium, 29-31 October 2014, Ottawa, Canada. Ottawa: Statistics Canada; 2014.http://www.statcan.gc.ca/sites/default/files/ media/14265-eng.pdf [14] Dempster, A., Laird, N., and Rubin, D. (1977). ”Maximum Likelihood from In- complete Data via the EM Algorithm”. Journal of the Royal Statistical Society Series B, 39, pp. 1-38. [15] Fellegi, I.P. and Sunter, A.B. (1969) A theory for record linkage, Journal of the American Statistical Association, Vol. 64 1183-1210. [16] Fortini, M., Liseo, B., Nuccitelli, A., and Scanu, M. (2000), On Bayesian record linkage, Bayesian Methods with Applications to Science, Policy, and Official Statistics: Selected Papers from ISBA 2000: The Sixth World Meeting of the International Society for Bayesian Analysis, E. I. George ed., 155-164. [17] Fortini, M., Nuccitelli, A., Liseo, B., and Scanu, M. (2002), Modelling issues in record linkage: A Bayesian perspective, Proc. Amer. Statist. Assoc. Survey Research Meth. Sec., [18] Gill, R. (1986), On Estimating Transition Intensities of a Markov Process with Aggregate Data of a Certain Type: “Occurrences but No Exposures”, Scand. Jour. Statist., 13, 113-134. [19] Gill, L. E. (1997), OX-LINK: The Oxford Medical Record Linkage System Demonstration of the PC Version, in Record Linkage Techniques - 1997, Proc. International Workshop Exposition, Federal Committee on Statistical Method- ology, Office of Management of the Budget, page 491. 141 [20] Goldstein, H., Harron, K. and Wade, A. (2015). The analysis of record-linked data using multiple imputation with data value priors, Statistics in Medicine, vol. 21, pp. 1485-1496. [21] Thomas R. Belin, Evaluation of sources of variation in record linkage through a factorial experiment, Survey Methodology, 19 (1993) 13-29. [22] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood from in- complete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society, Ser. B, 39 (1977) 1-38. [23] Gomatam, S., and Larsen, M.D. (2004), Record linkage and counterterrorism, Chance 17, 25-29. [24] Harron, K., Goldstein, H. and Dibben, C., eds. (2016). Methodological Devel- opments in Data Linkage. John Wiley & Sons, Hoboken, NJ. [25] Herzog, T.N., Scheuren, F.J. and Winkler, W.E. (2007) Data Quality and Record Linkage Techniques, Springer, New York, NY. [26] Herzog, T. N., Scheuren, F., and Winkler, W.E., (2010), Record Linkage, in (D. W. Scott, Y. Said, and E.Wegman, eds.) Wiley Interdisciplinary Reviews: Computational Statistics, New York, N. Y.: Wiley, 2 (5), September/October, 535-543 [27] Hof, M. H. P. and Zwinderman, A.H. (2012) Methods for analyzing data from probabilistic linkage strategies based on partially identifying variables, Statistics in Medicine,31, 42314242, DOI: 10.1002/sim.5498. [28] Howe, G.R. (1981) A generalized iterative record linkage computer system for use in medical follow-up studies, Computers and Biomedical Research, 14, pp. 327-340. [29] Jaro, M. A. (1989). Advances in record linkage methodology to matching the 1985 census of Tampa, Florida, Journal of the American Statistical Association, 84, pp. 414-420. [30] Jiang, J. (2007), Linear and Generalized Linear Mixed Models and Their Ap- plications. New York: Springer. [31] Jiang, J., Lahiri, P. and Wan, S-W. (2002) A unified jackknife theory for em- pirical best prediction with M-estimation, Annals of Statistics, 30, 1782-1810. 142 [32] Kandari, N. and Lahiri, P. (2016), Prediction of a function of misclassified binary data, Statistics in Transition new series, Vol. 17, No. 3, 429-447. [33] Kim, G. and Chambers, R. (2012a). Regression analysis under incomplete link- age. Computational Statistics and Data Analysis, 56, no. 9, pp. 27562770. [34] Kim, G. and Chambers, R. (2012b). Regression analysis under probabilistic multi-linkage. Statistica Neerlandica, 66, no. 1, pp. 6479. [35] Kim, G. and Chambers, R. (2013). Bias reduction for correlated linkage error. Working Papers Series. Wollongong: NIASRA, University of Wollongong. [36] Krewski, D., Dewanji, A., Wang, Y., Bartlett, S., Zielinkski, J.M. and Mallick, R. (2001). Regression Analysis with Linked Data. Survey Methodology, 31(1), 13-22. [37] Lahiri, P. and Larsen, M.D. (2005) Analysis with linked data, Journal of the American Statistical Association, 100, No. 469, 222-230. [38] Lane, J. (2010). Linking administrative and survey data.In P. V. Marsden & J. D. Wright (Eds.), Handbook of survey research (p. 659-680). Bingley: Emerald. [39] Larsen, K. (2005), Generalized Naive Bayes Classifiers, SIGKDD Explorations, 7 (1), 2005, 76-81,doi¿10.1145/1089815.1089826. [40] Larsen, M.D. (1999a), Multiple imputation analysis of records linkage using mixture models, Proc. Statist. Soc. Canada Survey Meth. Sec., 65-71. [41] Larsen, M.D. (1999b), Predicting the residency status for administrative records that do not match Census records,Administrative Records Research Memoran- dum Series, #20,Bureau of the Census, U.S. Department of Commerce. [42] Larsen, M.D. (2002), Comment on hierarchical Bayesian record linkage, Proc. Sec. Bayesian Statist. Sci., American Statistical Association meeting, New York City, NY, CDROM: 1995-2000. [43] Larsen, M. D. (2004), Record linkage using finite mixture models, Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives, A. Gelman and X-L. Meng, eds. 309-318. [44] Larsen, M.D. and Rubin, D.B. (2001) Iterative automated record linkage using mixture models, Journal of the American Statistical Association, Vol. 96, 32-34. 143 [45] Livingston, E. H., and Ko, C. Y. (2005), Effect of diabetes and hypertension on obesity-related mortality, SURGERY 137, 16-25. [46] Lohr, S. L., and Rao, J.N,K. (2009), Jackknife estimation of mean squared error of small area predictors in nonlinear mixed models, Boimetrika 96, 457- 468, 16-25. [47] McCutcheon, A. L. (1987), Latent class analysis, Sage Publications, Inc.: New- bury Park, CA; London. [48] McGlinchy, M. (2004), A Bayesian record linkage methodology for multiple imputation of missing links, Proc. Amer. Statist.Assoc. Survey Research Meth. Sec., Alexandria, VA, CDROM. [49] McLachlan, G. and Peel, D. (2000), Finite Mixture Models,1st ed., Wiley- Interscience. [50] Meng, X.L. and Rubin, D.B. (1993) Maximum likelihood estimation via the ECM algorithm: a general framework, Biometrika, Vol. 80 267-278. [51] Neter, J., Maynes, E. and Ramanathan, R. (1965) The Effect of Mismatching on the Measurement of Response Error, Journal of the American Statistical Association, 60. [52] Newcombe, H. B., Kennedy, J. M., Axford, S. J. and James, A. P.(1959), Automatic linkage of vital records, Sceince 130,954-959. [53] Newcombe, H.B., and Kennedy, J. M. (1962). Record Linkage:Making Maxi- mum Use of the Discriminating Power of Identifying Information Communica- tions of the Association for Computing Machinery, 5, 563-567. [54] D’Orazio, M., Zio, M.D., and Scanu, M. (2006) Statistical Matching: Theory and Practice, John Wiley & Sons. [55] Rao, J.N.K. (2002) Estimating equations for the analysis of survey data using poststratification information, Sankhyā, 64, 364-378. [56] Rässler, S. (2002) Statistical matching: a frequentist theory, practical ap- plications, and alternative Bayesian approaches, Lecture Notes in Statistics, Springer. 144 [57] Sadinle, M. and Fienberg, S. (2013). A generalized Fellegi-Sunter framework for multiple record linkage with application to homicide record-systems Journal of the American Statistical Association, 108, pp. 385-397. [58] Rao, J.N.K. and Molina, I. (2015), Small Area Estimation, 2nd ed., Wiley. [59] Särndal, C-E, Swensson, B., and Wretman, J. (1992) Model Assisted Survey Sampling, Springer-Verlag. [60] Samart, K. and Chambers R. (2014) Linear regression with nested errors using probability-linked data, Australian and New Zealand Journal of Statistics,56(1), 27-46. [61] Scheuren, F.J. and Winkler, W.E. (1993) Regression analysis of data files that are computer matched, Survey Methodology, 19, 39-58. [62] Scheuren, F.J. and Winkler W.E. (1997). Regression Analysis of Data that are Computer Matched - Part ii. Survey Methodology, 23(2):157-165. [63] Schnell, R. (2013), Linking Surveys and Administrative Data, German Record Linkage Center Working Paper Series, No. WP-GRLC-2013-03. [64] Steorts, R. C., Hall, R. and Fienberg, S. E. (2015). A Bayesian Approach to Graphical Record Linkage and De-Duplication. Journal of the American Statistical Association, in press. [65] Tancredi, A., and Liseo, B. (2011) A hierarchical Bayesian approach to record linkage and population size problems. Annals of Applied Statistics, 5, 1553-1585. [66] Tancredi, A. and Liseo, B.(2015). Regression analysis with linked data: prob- lems and solutions, Statistica. [67] Tancredi, A., Steorts, R.C., Liseo, B. (2017). A Bayesian approach for dedupli- cation, record linkage and inference with linked data. Working paper, MEMO- TEF, Sapienza Università di Roma. [68] Thibaudeau, Y. (1993) The discrimination power of dependency structures in record linkage, Survey Methodology, 19, 31-38. [69] Wang, J. and Donnan, P. (2002). Adjusting for missing record-linkage in out- come studies. Journal of Applied Statistics, Aug; 29(6):873-884. 145 [70] Winkler, W. E. (1988), Using the EM algorithm for weight computation in the Fellegi-Sunter model of record linkage, Proc. Amer.Statist. Assoc. Survey Research Meth. Sec., 667- 671. [71] Winkler, W. E. (1989), Near automatic weight computation in the Fellegi- Sunter model of record linkage, Proc. Bureau of the Census Annual Research Conference 5, 145-155. [72] Winkler, W.E. (1990) String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage, Proceedings of the Section on Survey Research Methods, American Statistical Association (1990) 354-359. [73] Winkler, W. E. (1993), Improved decision rules in the Fellegi-Sunter model of record linkage, Proc. Amer. Stastist. Assoc. Survey Research Meth. Sec., 274- 279. [74] Winkler, W. E. (1994), Advanced methods for record linkage, Amer. Statist. Assoc. Survey Research Meth. Sec., 467-472. [75] Winkler, W. E. (1995), Matching and record linkage, in Business Survey Meth- ods, Cox, B. G., Binder, D. A., Chinnappa, B. N., Christianson, A., Colledge, M. J., and Kott, P. S. eds., Wiley, New York, 355-384. [76] Winkler, W. E. (1995), Editing Discrete Data, Proceedings of the Section on Survey Research Methods, American Statistical Association, 108-113. [77] Winkler, W. E. (2007). Examples of easy-to-implement, widely used methods of masking for which analytic properties are not justified. Technical report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC. [78] Winkler, W. E. (2014). Matching and Record Linkage. Wiley Interdisciplinary Reviews. Computational Statistics, 6, 313-325. 146