ABSTRACT
Title of dissertation: Statistical Inference Using Data From Multiple
Files Combined Through Record Linkage
Ying Han, Doctor of Philosophy, 2018
Dissertation directed by: Professor Partha Lahiri
Joint Program of Survey Methodology &
Department of Mathematics
Record linkage methods help us combine multiple data sets from different
sources when a single data set with all necessary information is unavailable or when
data collection on additional variables is time consuming and extremely costly. Link-
age errors are inevitable in the linked data set because of the unavailability of an
error-free and unique identifier and because of possible errors in measuring or record-
ing. It has been realized that even a small amount of linkage errors can lead to
substantial bias and increase variability in estimating the parameters of a statistical
model. The importance of incorporating uncertainty of the record linkage process
into the statistical analysis step cannot be overemphasized.
The current research is mainly focused on the regression analysis of the linked
data. The record linkage and statistical analysis processes are treated as two sepa-
rate steps. Due to the limited information about the record linkage process, simpli-
fying assumptions on the linkage mechanism have to be made. In reality, however,
these assumptions may be violated. Also, most of the existing linkage error models
are built on the linked data set, which only contains records for the designated links.
Information about linkage errors carried by the designated non-links is missing.
In the dissertation, we provide general methodologies for both regression anal-
ysis and small area estimation using data from multiple files. A general integrated
model is proposed to combine the record linkage and statistical analysis processes.
The proposed linkage error models are built directly on the data values from the
original sources, and based on the actual record linkage method that is used. We
have adapted the jackknife methods to estimate bias, variance, and mean squared
error of our proposed estimators. To illustrate the general methodology, we give one
example of estimating the regression coefficients in the linear and logistic regression
models, and another example of estimating small area mean under the nested-error
linear regression model. In order to reduce the computational burden, simplified
version of the proposed estimators, jackknife methods, and numerical algorithms
are given. A Monte Carlo simulation study is devised to evaluate the performance
of the proposed estimators and to investigate the difference between the standard
and simplified jackknife methods.
Statistical Inference Using Multiple Data Files Combined Through
Record Linkage Techniques
by
Ying Han
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2018
Advisory Committee:
Professor Partha Lahiri, Chair/Advisor
Professor Cinzia Cirillo
Professor Paul Smith
Professor Takumi Saegusa
Professor Yan Li
©c Copyright by
Ying Han
2018

Dedication
I dedicate this dissertation to my caring husband, Alexander Estes.
ii
Acknowledgments
First of all, I would like to thank my dissertation advisor, Partha Lahiri. This
dissertation would not have been possible completed without his consistent guidance
and support. It is him who brought me into the area of Survey Methodology, the
area I would devote myself into. It is him who gave me the opportunity to work
on multiple interesting projects for the past years. It is him who helped me in my
job hunting when graduation was approaching. It has been a great pleasure to work
with him during my graduate life.
I would also like to give my special thanks to the rest of my committee mem-
bers, Cinzia Cirillo, Paul Smith, Takumi Saegusa, and Yan Li for taking the time
to read the thesis and give me their insightful feedbacks.
Thank you to all the other faculties in the Department of Mathematics, Eric
Slud, Leonid Koralov, Benjamin Kedem, Abram Kagan, and Joan Ren, for helping
me building a solid background in mathematics and statistics.
Thank you to all my friends, especially, Yimei Fan, Xia Li, Xuan Yao, and
Chen Wang, for encouraging me when I got frustrated, and providing suggestions
whenever I need some.
Thank you to my parents, Zhongtang Han and Shuxiangfan, and my sister,
Shujuan Han, for supporting me over the years. Thank you to the rest of the family
for their support and collective wisdom. Thank you to my husband, Alex Estes, for
being with me the whole way.
Thank you to everyone else who has helped me.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables vii
List of Figures viii
List of Abbreviations ix
List of Notations x
1 Introduction 1
1.1 Record Linkage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Fellegi and Sunter Model . . . . . . . . . . . . . . . . . . . . . 2
1.2 Statistical Analysis of Linked Data . . . . . . . . . . . . . . . . . . . 7
1.2.1 Linkage Mechanisms . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Linkage Error Model: Chambers (2009) . . . . . . . . . . . . . 11
1.2.3 Linkage Error Model: Scheuren and Winkler (1993) . . . . . 14
1.3 Discussion and Overview of the Dissertation . . . . . . . . . . . . . . 16
2 Regression Analysis of Data from Two Files 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Problem Description and Data Availability . . . . . . . . . . . . . . . 20
2.3 General Integrated Model for Regression Analysis . . . . . . . . . . . 21
2.3.1 Regression Model . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.2 Linkage Error Model . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.4 Designation of Links and non-Links . . . . . . . . . . . . . . . 26
2.4 Estimation of Regression Coefficients . . . . . . . . . . . . . . . . . . 27
2.5 Variance Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.7 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
iv
2.7.1 Proof of (2.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7.2 Proof of (2.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 Applications to Linear and Logistic Regression 39
3.1 Linear Regression using Data from Two Files . . . . . . . . . . . . . 39
3.2 Logistic Regression using Data from Two Files . . . . . . . . . . . . . 45
3.3 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.1 Proof of (3.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.3.2 Proof of (3.4), (3.6), and (3.7) . . . . . . . . . . . . . . . . . . 50
3.3.3 Proof of (3.10) . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.4 Proof of (3.11) . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.5 Proof of (3.13) . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4 Small Area Estimation with Data from Two Files 57
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2 Small Area Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 General Integrated Model for Small Area Estimation . . . . . . . . . 59
4.3.1 Problem Description and Data Availability . . . . . . . . . . . 59
4.3.2 General Integrated Model for Small Are Estimation . . . . . . 61
4.4 Empirical Best Prediction Estimator . . . . . . . . . . . . . . . . . . 66
EBP
4.5 Estimation of the Mean Squared Error of θ̂i . . . . . . . . . . . . 68
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5 Application to the General Linear Mixed Model 71
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 General Linear Mixed Model with Block Diagonal Covariance . . . . 72
5.3 Nested Error Linear Model . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3.1 Estimation of Small Area Mean Using Data From a Single File 75
5.3.2 Estimation of Small Area Mean Using Data From Two Files . 75
5.3.3 Estimation of φ: Maximum Likelihood Method . . . . . . . . 80
5.3.4 Numerical Algorithms . . . . . . . . . . . . . . . . . . . . . . 82
5.3.5 Estimation of φ: Pseudo Maximum Likelihood Method . . . . 86
5.4 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.1 Proof of (5.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.4.2 Proof of (5.11) . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4.3 Proof of (5.12) . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.4 Proof of (5.15) . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.5 Proof of (5.16) and (5.17) . . . . . . . . . . . . . . . . . . . . 96
5.4.6 Proof of (5.19) . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.7 Proof of (5.20) and (5.21) . . . . . . . . . . . . . . . . . . . . 104
5.4.8 Proof of (5.23) . . . . . . . . . . . . . . . . . . . . . . . . . . 107
v
6 Monte Carlo Simulation Study 109
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 The Equal Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6.2.1 Linear regression with linked data . . . . . . . . . . . . . . . 115
6.2.2 Logistic regression with linked data . . . . . . . . . . . . . . . 118
6.3 The Unequal Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.3.1 Linear regression with linked data . . . . . . . . . . . . . . . 123
6.4 Comparison of the Standard and Simplified Jackknife Methods . . . . 127
6.4.1 Equal Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.4.2 Unequal Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.4.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
7 Future Research 136
Bibliography 140
vi
List of Tables
2.1 Data layout for observations on y and x in Fy and Fx. . . . . . . . . . 21
4.1 Data layout for joint observations on y and x for each small area . . . 59
4.2 Data layout for observations on y and x for each small area in Fy and
Fx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1 Estimating equations for β estimators in linear and logistic models. . 110
6.2 Simulation conditions for two cases under the equal scenario. . . . . . 112
6.3 AAD, ASD and PRI of β estimators for linear regression (Equal Sce-
nario). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.4 AAD, ASD and PRI of β estimators for logistic regression (Equal
Scenario). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
6.5 Monte Carlo estimates and standard deviations for logistic regression
(Equal, Case 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.6 RE of β estimators for logistic regression (Equal Scenario, Case 1). . 123
6.7 Simulation conditions for two cases under the unequal scenario. . . . 124
6.8 AAD, ASD and PRI of β estimators for linear regression (Unequal
Scenario). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.9 Monte Carlo estimates and standard deviations for linear regression
(Unequal Scenario, Case 1) . . . . . . . . . . . . . . . . . . . . . . . . 129
6.10 RE of β estimators for linear regression (Unequal Scenario, Case 1). . 129
6.11 Wilcoxon signed rank test for absolute relative difference in estimated
variance for logistic regression (Equal Scenario, Case 1). . . . . . . . 132
6.12 Wilcoxon signed rank test for absolute relative difference in estimated
variance for linear regression (Unequal Scenario, Case 1). . . . . . . 135
vii
List of Figures
6.1 Scatter plots of β estimates in a linear model (Equal Scenario). . . . . 116
6.2 Heat maps of absolute and squared deviations of β estimates in a
linear model (Equal Scenario). . . . . . . . . . . . . . . . . . . . . . . 117
6.3 Scatter plots of β estimates in a logistic model (Equal Scenario). . . . 120
6.4 Heat maps of absolute and squared deviations of β estimates in a
logistic model (Equal Scenario). . . . . . . . . . . . . . . . . . . . . 121
6.5 Box plots of deviations and relative deviations of β estimators for
logistic regression (Equal Scenario, Case 1). . . . . . . . . . . . . . . 122
6.6 Plots of Monte Carlo estimates and standard deviations for logistic
regression (Equal Scenario, Case 1). . . . . . . . . . . . . . . . . . . . 122
6.7 Scatter plots of β estimates in a linear model (Unequal Scenario). . . 125
6.8 Heat maps of absolute and squared deviations of β estimates in a
linear model (Unequal Scenario). . . . . . . . . . . . . . . . . . . . . 126
6.9 Box plots of deviations and relative deviations of β estimates for linear
regression (Unequal Scenario, Case 1). . . . . . . . . . . . . . . . . . 128
6.10 Plots of Monte Carlo estimates and standard deviations for linear
regression (Unequal Scenario, Case 1). . . . . . . . . . . . . . . . . . 128
6.11 Box plot of relative differences in estimated variance of β estimators
for logistic regression. (Equal Scenario, Case 1). . . . . . . . . . . . . 133
6.12 Histogram of absolute relative differences in estimated variance of β
estimators for logistic regression (Equal Scenario, Case 1). . . . . . . 133
6.13 Box plot of relative differences in estimated variance of β estimators
for linear regression. (Unequal Scenario, Case 1). . . . . . . . . . . . 134
6.14 Histogram of absolute relative differences in estimated variance of β
estimators for linear regression (Unequal Scenario, Case 1). . . . . . . 135
viii
List of Abbreviations
AAD Average Absolute Deviation
ASD Averate Squared Deviation
BLUE Best Linear Unbiased Estimator
BP Best Prediction
CMSE Conditional Mean Squared Error
EBP Empirical Best Prediction
EM Expectation Maximization
ECM Expectation Conditional Maximization
ML Maximum Likelihood
MLE Maximum Likelihood Estimate
MOM Method of Moments
OLS Ordinary Least Squares
WLS Weighted Least Squares
PML Pseudo Maximum Likelihood
PMLE Pseudo Maximum Likelihood Estimate
MSE Mean Squared Error
PRI Percent Relative Improvement
RE Relative Efficiency
REML Restricted Maximum Likelihood
SE Standard Error
ix
List of Selected Notations
U the population of interest.
y a scalar random variable of interest.
x a vector random variable of order p.
w the vector of K matching fields.
c the comparison vector variable.
Fy the file containing observations on y.
Fx the file containing observations on x.
Sy the set of population units covered Fy.
Sx the set of population units covered Fx.
M the set of matches.
M c the set of mismatches.
N the number of records in Fx.
n the number of records in Fy.
G the number of blocks.
m the number of small areas.
Ni the number of records in block or small area i from file Fx.
ni the number of records in block or small area i from file Fy.
K the number of matching fields.
mk the probability of a record pair agreeing on matching field k among matches M .
µk the probability of a record pair agreeing on matching field k among mismatches M
c.
ỹj the value of y for record j in Fy.
x̃j the value of x corresponding to ỹj.
xj′ the value of x for record j
′ in Fx.
yj′ the value of y corresponding to xj′ .
w̃j the value of w for record j in Fy.
wj′ the value of w for record j
′ in Fx.
cjj′ the value of c for the record pair (j, j
′).
l ′ the true matching status of the record pair (j, j′jj ).
ỹij the value of y for record j in block or small area i from Fy.
x̃ij the value of x corresponding to ỹij.
xij′ the value of x for record j
′ in block or small area i from Fx.
yij′ the value of y corresponding to xij′ .
w̃ij the value of w for record j in block or small area i from Fy.
wij′ the value of w for record j
′ in block or small area i from Fx.
cijj′ the value of c for the record pair (j, j
′) in block or small area i.
li ′jj′ the true matching status of the record pair (j, j ) in block or small area i.
ỹ i the ni × 1 vector of observed y values in block of small area i from Fy.
X̃ i the ni × p matrix of unobserved true x values corresponding to ỹ i.
X i the Ni × p matrix of observed x values in block or small area i from Fx.
x
y i the Ni × 1 vector of unobserved true y values corresponding to X i.
C i the Nini ×K matrix of the comparison vector c in block or small area i.
Li the ni ×Ni matrix of the unknown true matching status in block or small area i.
ỹ the n× 1 vector of observed y values in Fy.
X̃ the n× p matrix of unobserved true x values corresponding to ỹ.
X the N × p matrix of observed x values in Fx.
y the N × 1 vector of unobserved true y values corresponding to X .
C the Nn×K matrix of the comparison vector c.
L the n×N matrix of the unknown true matching status.
xi
Chapter 1: Introduction
1.1 Record Linkage
In record linkage, or exact file matching, one compares two or more files on
a single population in absence of a unique and error-free identifier for purposes of
unduplication or production of an enhanced, merged database (e.g., Newcombe et
al. 1959, Fellegi and Sunter 1969, Herzog et al. 2007). Record linkage differs from
statistical matching in terms of the types of units to be linked or matched. The
primary goal of record linkage is to link an entity (e.g., person, household, farm,
etc.) from one file to the same entity in other file(s). In contrast, the primary goal
of statistical matching is to link similar units (e.g., matching the same demographic
group from different files). In this dissertation, our focus is on the statistical esti-
mation related to record linkage and not statistical matching. Readers interested in
statistical matching are referred to Rässler(2002), D’Orazio (2006), and others.
A merged or linked database, created by record linkage, is of great interest
to analysts interested in certain specialized multivariate analysis, which would be
otherwise either impossible or difficult without advanced statistical expertise as
variables are stored in different files. Record linkage is used in many applications,
including population size estimation at the Census Bureau (Winkler 1994, 1995, and
1
Jaro 1989), epidemiological and medical studies (e.g., Gill 1997), sociological studies,
survey frame improvement, and, more recently, counter-terrorism (Gomatam and
Larsen 2004). For more information on its applications, see Alvey and Jamerson
(1997) and references therein. The National Death Index is matched to existing
insurance, medical, and other databases for studies (e.g., Livingston and Ko 2005).
Record linkage techniques can be broadly classified into deterministic and
probabilistic record linkages. They both use common matching fields available from
files to be linked that are indicative of a true match status of an entity. Examples
of matching fields include last name, date of birth, address, etc. In deterministic
record linkage, a record pair is deemed a link if the two records agree on all or some
available matching fields according to a pre-specified rule, and hence there is no
stochastic element in the deterministic record linkage process. On the other hand,
if such a link is only deemed a link with certain probability it is called probabilistic
record linkage. This dissertation concerns probabilistic record linkage.
1.1.1 Fellegi and Sunter Model
Fellegi and Sunter (1969) first developed a theoretical framework for record
linkage. Suppose we have two files FA and FB, which contain records for a sample
SA of size n and a sample SB of size N , respectively, from the same population
U . Let j represent the index of a record in F ′A, and let j represent the index
of a record in FB. The goal of record linkage is to partition all record pairs in
the set FA × FB = {(j, j′) : j ∈ F , j′A ∈ FB} into two disjoint sets: the set
2
of matches M = {(j, j′) : l ′jj′ = 1, j ∈ FA, j ∈ FB} and the set of mismatches
M c = {(j, j′) : ljj′ = 0, j ∈ FA, j′ ∈ FB}. Here, ljj′ represent the true matching
status of the record pair (j, j′); that is, ljj′ = 1 if record j from FA and record j
′
from FB actually correspond to the same population unit.
The goal of record linkage is achieved by making comparisons comparisons of
information on matching fields between records in FA and records in FB. The match-
ing fields usually do not include a unique and error-free identifier, such as Social
Security Number. Examples of matching fields include name, gender, race, date of
birth, address, etc. Some matching fields (e.g., last name) have more discriminatory
power than the others (e.g., gender) in distinguishing matches from mismatches.
Let w = (w )Kk k=1 denote a vector of K matching fields, and let w
A
j and w
B
j′ represent
the values of w for record j ∈ FA and j′ ∈ FB, respectively. The record linkage
model is built on a comparison vector c = (ck)
K
k=1, which is a vector-valued variable
that displays the pattern of agreement and disagreement on matching fields. The
simplest method of constructing comparison vectors is to use exact matching. The
comparison vector c Kjj′ = (cjj′k)k=1 for the record pair (j, j
′) is defined as:
 A B
cjj′k = 
1 if wjk = wj′k
.
0 if wAjk =6 wBj′k
For example, when there are K = 3 matching fields, the possible values of a com-
parison vector are (1, 1, 1), (1, 1, 0), (1, 0, 1), (0, 1, 1), (1, 0, 0), (0, 1, 0), (0, 0, 1) and
(0, 0, 0). Matches tend to have more ones in their comparison vectors than mis-
matches.
3
Fellegi and Sunter (1969) proposed an optimal decision rule to designate record
pairs into links and non-links. The decision rule is optimal in the sense that it
minimizes the number of records requiring clerical review at a fixed error level. The
decision rule is based on the following likelihood ratio score:
P (cjj′|(j, j′) ∈M) P (cjj′ |ljj′ = 1)
Rjj′ = = (1.1)
P (cjj′|(j, j′) ∈M c) P (cjj′ |ljj′ = 0)
Intuitively, the larger the likelihood ratio Rjj′ is, the more likely the record pair
(j, j′) is to be a true match. Therefore, based on the values of Rjj′ , one of three
decisions are to be made based on the following rule:
• If Rjj′ > RU , then designate the record pair (j, j′) as a link ;
• If RL < Rjj′ < RU , then send the record pair (j, j′) to clerical review;
• If Rjj′ < RL, then designate the record pair (j, j′) as a non-link ;
where RU and RL are the optimal upper and lower thresholds, respectively, which
are determined at a pre-specified error levels for false links and false non-links. Note
that any monotone increasing function (such as logarithm) of the likelihood ratio
Rjj′ can serve equally well as a test statistic for the purpose of record linkage.
The probabilities in (1.1) are unknown and need to be estimated. To simplify
the estimation of these probabilities, Fellegi and Sunter (1969) made a conditional
independence assumption: agreements on matching fields are independent within
4
matches and mismatches. That is,
∏K ∏K
P (c |l = 1) = P (c | c ′l = 1) = m jj k(1−m )1−c′ ′ ′ ′ jj′kjj jj jj k jj k k ,
k∏=1 k∏=1K K c
P (c jj
′k 1−c
jj′|ljj′ = 0) = P (c ′ jj′kjj k|ljj′ = 0) = uk (1− uk) , (1.2)
k=1 k=1
where mk = P (cjj′k = 1|ljj′ = 1) and uk = P (cjj′k = 1|ljj′ = 0) are the probabilities
of a record pair agreeing on the matching field k among matches M and mismatches
M c, respectively. Under this assumption, estimation of the unknown probabilities in
(1.1) is reduced to estimation of the matching parameters {mk, uk, k = 1, · · · , K}.
The optimality of Fellegi and Sunter’s method heavily depends on the accuracy of
the estimates of these matching parameters.
Fellegi and Sunter (1969) also suggested a method called blocking to reduce
the computational burden caused by the large amount of comparison vectors. The-
oretically, all the record pairs in set FA × FB should be considered for comparison.
However, comparison of two moderate files can lead to an extremely large amount of
comparison vectors. For example, 105 comparison vectors will be generated for two
files of sizes 102 and 103. Fellegi and Sunter (1969) suggested to partition records
into blocks based on whether they agree on one or a set of characteristics, such as
zip code or the first three digits of phone numbers. Only records within the same
block are compared. In this way, the number of comparison vectors can be greatly
reduced.
Remark 1: Elements of the comparison vector c can be binary or continu-
ous. Besides exact matching, a comparison vector with binary elements can also
be constructed in a more general way by assigning a distance function dk(·) and a
5
threshold τk to each matching field k, k = 1, · · · , K. For the record pair (j, j′), its
comparison vector can be defined as: cjj′k = 1 if dk(wjk, wj′k) ≤ τk and cjj′k = 0
if dk(wjk, wj′k) > τk, k = 1, · · · , K. For example, the string comparator (Winkler
1990) can be used as a distance function for a string-valued matching field.
Remark 2: The conditional independence assumption has been criticized
since it often fails in practice. However, estimation under the conditional indepen-
dence assumption can still provide accurate decision rules even though the assump-
tion is violated; see e.g., Thibaudeau (1993), Winkler (1989a, 1994). The conditional
independence assumption can also be relaxed by introducing interactions among
matching fields; see e.g., Armstrong and Mayda 1993, Thibaudeau 1993, Larsen and
Rubin 2001.
Remark 3: The optimality of the decision rule heavily depends on the accu-
racy of the estimates of the matching parameters and the choices of the upper and
lower thresholds; see e.g., Belin 1993, Belin and Rubin 1995. Also, it is possible that
a record in FA is linked to two or more records in FB, since all the record pairs with
their likelihood ratio scores above the upper threshold will be designated as links.
Some programming approaches have been developed to force one-to-one linkage; see
e.g. Jaro (1989) and Fortini et al. (2002). Though these approaches can help to
avoid the occurrence of the one-to-many or many-to-one linkage problems in record
linkage, it may also remove the true matches.
6
1.2 Statistical Analysis of Linked Data
Probabilistic record linkage procedures are subject to linkage errors. There
are two types of linkage errors. The linkage error is called false positive if a true
mismatch is deemed a link by the record linkage procedure. On the other hand,
the linkage error is called false negative if a true match is deemed a non-link by
the record linkage procedure. Neter et al. (1965) showed that a relatively small
amount of linkage errors could lead to substantial bias in estimating a regression
relationship. If one simply ignores the linkage errors, analysis of linked data could
yield misleading results in a scientific study. Therefore, the importance of accounting
for linkage errors in statistical analysis cannot be overemphasized.
1.2.1 Linkage Mechanisms
Suppose that a linked data set is generated by combining the records from
two files Fy and Fx through some record linkage techniques. Here, Fy represents
the file containing the observed values of a scalar variable y, and Fx represents the
file containing the observed values of a vector-valued variable x of order p. Let X
denote the matrix of the observed x values in file Fx, let y denote the vector of the
unobserved true y values corresponding to X , and let y? denote the vector of the
y values that are selected from the file Fy and linked to X . Thus, the linked data
set contains (y?,X ). Most of the existing linkage error models are directly built
on the linked data by exploiting the relationship between y? and y. Under certain
assumptions, the randomness of the record linkage process can be generally modeled
7
via the following identity:
y? = Ty (1.3)
Here, T = (tjj′)
n,N
j=1,j′=1 is an unknown random permutation matrix, with tjj′ repre-
senting the true matching status between y?j and yj′ , where y
?
j is the y value that
is linked to xj and yj′ is the true y value corresponding to xj′ ; that is, tjj′ = 1 if
y?j and yj represent the y value for the same population unit, tjj′ = 0 otherwise. So
t ?jj = 1 indicates that (xj, yj ) is correctly linked.
The distribution of linkage errors depends on the characteristic of the proba-
bilistic record linkage method that is actually used. Here, based on the conditional
distribution of the matching status matrix T given the observed data y?, X , C , we
classify the linkage mechanisms into three categories:
• LCAR: The linkage is called linkage completely at random (LCAR) if the
conditional distribution of T given the data y?, X and C , say f(T |y?,X ,C ;γ)
does not depend on y?, X and C ; that is, f(T |y?,X ,C ;γ) = f(T ;γ) for all y?,
X , C , γ , where γ denotes a vector of unknown parameters. Note that it does
not mean that the linkage process is random, but the linkage process does not
depends on values of y?, X and C .
• LAR: The linkage is called linkage at random (LAR) if the conditional distri-
bution of T given the data y?, X and C depends only X or/and C , but not
on y?. That is, f(T |y?,X ,C ;γ) = f(T |X,C ;γ) for all y?, X , C , γ .
• LNAR: The linkage is called linkage not at random (LNAR) if the conditional
8
distribution of T given the data y?, X and C depends on y?.
However, the detailed information about record linkage may not be available
to the people who perform statistical analysis. To adjust the linkage bias in the
statistical analysis of the linked data, assumptions on the linkage mechanism have
to be made based on the available information one can obtain about linkage errors.
In secondary data analysis, researchers can only get access to the linked data
generated from the record linkage process. The information about linkage errors can
be obtained from a training sample of the linked data. The true matching status for
each linked record pair in the training sample can be determined through clerical
review. If the linked data and the training sample are available to researchers, there
is a scope for correcting the linkage bias in the statistical analysis. Neter et al.
(1965) discussed this secondary analysis using an audit sample. In the context of
understanding the effects of low-level radiation data on cancer death rate, Lahiri
(1995) and Krewski et al. (2001) suggested analysis of the Cox proportional hazard
model using information contained in a sample to correct for linkage error biases.
More recently, following Neter et al. (1965), Chambers (2009) put forward a variety
of methods for different secondary data analyses that use a sample to correct for
linkage error biases. Following the work of Chambers (2009), researchers advanced
the secondary data analysis of the linked data in several different directions; see e.g.,
Chambers et al. (2009), Chipperfield et al. (2011), Kim and Chambers (2012a,b,
2013), Samart and Chambers (2014), Dasylva (2014), Chipperfield and Chambers
(2015), and Chambers and Kim (2016). Kandari and Lahiri (2016), following up
9
on Lahiri (1995), suggested a theory of predicting a function misclassified binary
variables using information from a sample. However, due to the limited information
about the linkage process in secondary data analysis, researchers typically assume
that the linkage is LAR or LCAR (limited to dependence on X only).
In primary data analysis, researchers can get access to not only the linked
data but also some summary information generated during the record linkage pro-
cess, such as values of matching fields, values of comparison vectors, the matching
weights (such as the likelihood ratio score), the estimated linkage probabilities,
and so on. This detailed information can assist researchers to learn more about
the linkage mechanism, and can be potentially used to correct the linkage bias in
the statistical procedures. Scheuren and Winkler (1993, 1997) showed how to use
record linkage process information in correcting the linkage bias of the ordinary least
squares (OLS) estimator of the regression coefficient in a standard multiple linear
regression model. Their approach involves first estimating an analytical expression
of the bias of the OLS using the record linkage process information and then ap-
plying the estimated bias correction to the OLS. Lahiri and Larsen (2005) obtained
an exact unbiased estimator of regression coefficients by deriving the expected value
of the linked response variable when linkage errors are uncorrelated with the true
response given the comparison vectors. Hof and Zwinderman (2012) followed up
on the Lahiri-Larsen approach and showed how to extend it to link multiple files
or when one-to-one matching is not desired. In primary data analysis, with the
additional information about the linkage process, researches are able to build more
sophisticated linkage error models by assuming that the linkage depends on C , X ,
10
or both (LAR).
1.2.2 Linkage Error Model: Chambers (2009)
In this part, we introduce the linkage error model proposed by Chambers
(2009) as an example of using the LCAR linkage mechanism and a training sample
in secondary data analysis.
Chambers (2009) developed a linkage error model under the following assump-
tions:
(1) The linked data is obtained by combining two files Fy and Fx. Fy and Fx
contain the observed values of y and x, respectively, for all the units of the same
population of size N , without duplicate. Hence, Fy and Fx are of the same size N .
(2) The records in files Fy and Fx are partitioned into G blocks, with Ni records
in block i, without error. So linkage errors only occur within the same block.
(3) The resulting linkage is complete (i.e., all records are linked) and one-to-one
between Fy and Fx.
Let xij denote the observed value of x for record j in block i from Fx, let
yij denote the true value of y corresponding to xij, and let y
?
ij denote the value
of y that is recorded in block i of Fy and linked to x
Ni
ij. Let y i = (yij)j=1 and
y?i = (y
?
ij)
Ni
j=1 denote the vector of true y values and the vector of linked y values in
block i corresponding to X = (xT )Nii ij j=1, respectively. Under the above assumptions,
Chambers (2009) modeled the randomness of the outcome of the linkage process via
11
the identity
y?i = T iy i, i = 1, · · · , G.
where T = (ti )Ni,Nii jj′ j=1,j′=1 is an unknown random permutation matrix of dimension
N T Ti×Ni with 1N T i = 1N and T i1N = 1N , and the T i are independently distributedi i i i
across blocks.
Following Neter et al. (1965), Chambers (2009) further assumed that:
(4) The probability of a designated link being a true match is the same within
each block.
(5) The probability of a designated non-link being a true match is the same
within each block.
Under these assumptions, Chambers proposed the following exchangeable link-
age error model:
i 1− λiP (tjj = 1|data) = λi, P (tijj′ = 1|data) = ,Ni − 1
for i = 1, · · · , G, j = 1, · · · , N ′ ′i, j = 1, · · · , Ni, and j 6= j , where λi is an unknown
block-specific parameter.
Under the above framework for statistical analysis of the linked data, Cham-
bers (2009) took account of linkage errors into the linear regression analysis of the
linked data. Inspired by Scheuren and Winkler (1993), Chambers (2009) developed
a bias-corrected ordinary least squares (OLS) estimator of the regression coefficient
by adjusting the bias of the naive OLS estimator under the exchangeable linkage
error model. Inspired by Lahiri and Larsen (2005), Chambers (2009) developed
an unbiased OLS estimator and a best linear unbiased estimator (BLUE) of the
12
regression coefficient by exploiting the regression relationship between y?i and X i
under the exchangeable linkage error model. Chambers (2009) also extended these
ideas, developed a general estimating-equations-based theory for the regression anal-
ysis using linked data by correcting the bias of the estimating functions under the
proposed exchangeable linkage error model, and applied the theory to the linear
and logistic regressions. Subsequently, Kim and Chambers (2012a, 2012b, 2013)
extended the methodology to accommodate the situation where the linked data is
produced by linking more than two files and the linkage is incomplete. Smart and
Chambers (2014) proposed a method for estimating the regression coefficient in a
nested-error linear regression model when linked data is used. Other related articles
include Chambers et al. (2009), Chipperfield and Chambers (2015), Chambers and
Kim (2016).
Remark 1: The exchangeable linkage error model is based on the assumption
that the linkage mechanism is LCAR. Chambers (2009) realized that it was probably
the simplest way to characterize the behavior of a probability-based record linkage
process, and that more sophisticated models could be formulated with additional
information. Evidently, the optimal estimators derived from the proposed estimating
equations under the exchangeable linkage error model will not be optimal under
a more complex linkage error model, such as the one proposed by Scheuren and
Winkler (1993), which allows the probability of correct linkage to vary both within
and across blocks.
Remark 2: The estimators of the regression coefficient mentioned above are
unbiased in the sense that they are unbiased when the block-specific parameters λi,
13
i = 1, · · · , G (and variance component parameters if they are involved) are known.
In practice, however, these block-specific parameters are unknown and need to be
estimated by using a clerically-reviewed training sample of the linked data for each
block. The block-specific parameters λi can be simply estimated by the sample pro-
portions of the correctly-linked record pairs in block i, i = 1, · · · , G. Therefore, the
unbiasedness and efficiency of the proposed estimators of the regression coefficient
depends on the accuracy of the estimated block-specific parameters. The estimate
of λi can be unreliable if there are not enough samples in block i, which occurs often
in the literature of small area estimation.
1.2.3 Linkage Error Model: Scheuren and Winkler (1993)
Here, we introduce the linkage error model proposed by Scheuren and Winkler
(1993) as an example of using the LAR linkage mechanism and summary information
from the record linkage process in primary data analysis.
Using assumptions (1) and (2) in Section 1.2.2, the linkage error models pro-
posed by Scheuren and Winkler (1993) can be generally rewritten as:
y?i = T iy i, i = 1, . . . , G.
where T i = (t
i Ni,Ni
jj′)j=1,j′=1 with T i1N = 1N , and T i are independent across blocks.i i
The linkage error model proposed by Scheuren and Winkler (1993) allows
the probability of being a true match to vary across record pairs. They generally
assumed that:
P (ti ijj′ = 1|data) = qjj′ ,
14
∑
where Nij′=1 q
i
jj′ = 1, i = 1, . . . , G, j = 1, . . . , Ni, and j
′ = 1, . . . , Ni.
Based on the available information about the linkage process, more specific
assumptions can be made on the linkage mechanism to simplify the estimation of
the probabilities qijj′ . As illustrated in Scheuren and Winkler (1993), one can assume
that the probability of a record pair (j, j′) in block i being a true match only depends
on its corresponding matching weight rijj′ , which can be derived from the comparison
vector cijj′ . That is, P (t
i
jj′|data) = P (ti ijj′|rjj′). In this case, the method proposed
by Belin and Rubin (1991) can be used to estimate probabilities qijj′ by fitting
a two-class Gaussian mixture model to the transformed matching weights. Note
that a clerically-reviewed training sample is also required to estimate the unknown
parameters involved in the transformation. Under the linkage error model, Scheuren
and Winkler obtained an unbiased estimator of the regression coefficient in a multiple
linear regression model by adjusting the linkage bias of the naive OLS estimator.
Following Scheuren and Winkler (1993), Lahiri and Larsen (2005) assumed
that the probability of a record being a true match depends only on its comparison
vectors cijj′ . That is, P (t
i i i
jj′ |data) = P (tjj′ |cjj′). Assuming that the comparison
vectors follow a two-class mixture model, estimation of probabilities qijj′ reduces
to the estimation of unknown parameters in the mixture model. The maximum
likelihood estimates of the mixture model parameters can be approximated by using
the Expectation-Maximization algorithm. By exploiting the relationship between
the linked y values and the true x values, Lahiri and Larsen developed an exact
unbiased estimator of the regression coefficient in a general linear model.
Remark 1: The exchangeable linkage error model proposed by Chambers
15
(2009) can be treated as a special case of the one proposed by Scheuren and Winkler
(1993), where qijj = λi and q
i
jj′ = (1 − λi)/(Ni − 1), i = 1, . . . ,m, j = 1, . . . , Ni,
j′ = 1, . . . , Ni, j
′ 6= j, which are estimated using a sample drawn from the linked
data file. In contrast, in Chapter 2 we propose a model where the probability of
correct linkage could vary both within and across blocks. Moreover, the number of
parameters is reduced by exploiting data on matching weights rijj′ or comparison
vectors cijj′ through the mixture model and hence the parameters are estimated
efficiently using data from many blocks. The difference between the two approaches
can be attributed to the fact that Chambers (2009) focused on the secondary analysis
of linked data while we focus on a method that can be directly applied to two
separated data sets to be linked.
1.3 Discussion and Overview of the Dissertation
In Chapter 1, we have presented a brief overview of the record linkage tech-
niques for data integration. We have introduced the first statistical framework for
record linkage, and the optimal decision rules used for designating record pairs into
links and non-links, as a special example. In addition, we have discussed the effects
of linkage errors on statistical analysis and emphasized the importance of taking ac-
count of linkage errors into statistical analysis. The existing methods for correcting
the linkage bias are discussed, and several examples are provided.
In Chapter 2 and Chapter 3, we provide a methodological framework for the
regression analysis using data from two different files. Especially, we are interested
16
in estimating the regression parameters related to the conditional distribution of
the response variable y given the predictors x. Rather than separating regression
analysis from record linkage as most existing methods do, we propose a general
integrated model to combine these two processes, based on the assumption that the
sample units in one file is a subset of those in the other file. We also provide a general
class of estimating equations that can produce different estimators corrected for the
linkage bias. A jackknife method is then adapted to estimate the bias, variance
and mean squared error of our estimators. Our methodology can be widely applied
to the general linear regression models, the generalized linear regression models,
and the general linear mixed models, as long as observations on y are independent
across blocks given x. To illustrate our methodology, we implement our general
methodology to two special situations where the linear and logistic regression models
are of the focus of research.
In Chapter 4 and Chapter 5, we focus our research on small area estimation
using data from two files. Specifically, we are interested in predicting an area-specific
parameter, which can be expressed as a function of fixed and mixed effects. A new
linkage error model is developed to combine the small area model with the record
linkage model. Its difference from the previously proposed linkage error model is
discussed. Under the modified general integrated model, we provide the general
methodology for obtaining the Empirical Best Prediction (EBP) estimator of the
parameter of interest and for estimating its mean squared error. To illustrate our
methodology for small area estimation, we consider the situation where the general
linear mixed model with block-diagonal covariance structure is used as the unit-level
17
small area model. The nested-error linear model is discussed as a special example.
In Chapter 6, we devise a Monte Carlo simulation study to compare differ-
ent estimators, and we investigate the performance of the standard and simplified
jackknife methods.
In Chapter 7, we offer some scope for future research.
18
Chapter 2: Regression Analysis of Data from Two Files
2.1 Introduction
In Chapter 2, we provide a methodological framework for statistical analysis
using data from two different files. Specifically, we are interested in estimating
the regression parameters related to the conditional distribution of the response
variable y given the predictors x. We propose a general integrated model that takes
account of linkage errors in the analysis of a wide range of variables—discrete and
continuous. We also provide a general class of systems of estimating equations
that can produce various estimators corrected for the linkage bias. A jackknife
method is then adapted to estimate the bias, variance and mean squared error of
our estimators. Moreover, we also introduce some simplified versions of the proposed
estimators and the standard jackknife method in order to reduce the computational
burden. Application of our methodology only requires observations of the response
variable y to be independent across blocks given predictor x. So it is not limited to
the observations related to mutually independent population units, but can be used
for observations corresponding to units that are independent across blocks, such as
residents in a county, patients in a clinic, or students in a school.
19
2.2 Problem Description and Data Availability
Let y represent a scalar random variable of interest, and let x represent a
vector-valued variable of order p. Our goal is to model the relationship between y
and x in a population U . In particular, we are interested in estimating the regression
parameters associated with the conditional distribution of y given x. However,
the joint observations on (y,x) are not available. Instead, observations on y and
observations on x are separately recorded in two files Fy and Fx, but the matching
status between any record from Fy and any record from Fx is unknown.
To be specific, Fy contains the observed values of y for a sample Sy of n units
from U , Fx contains the observed values of x for a sample Sx of N units from U ,
and there is no duplicate in either file. In this dissertation, we assume that Sy ⊂ Sx.
The data layout for files Fy and Fx is shown in Table 2.1. Here, ỹj denotes the value
of y for record j in Fy, xj′ denotes the value of x for record j
′ in Fx, and yj′ denote
value of y corresponding to x ′ , where j = 1, . . . , n, j′j = 1, . . . , N . Since yj′s exist
but are not observed in Fx. their corresponding column in Fx is shaded in gray.
The records in Fy are not aligned to those in Fx, so ỹj and yj′ may not represent
the y-values for the same population unit even if j′ = j.
Assume that there also exists a vector of K matching fields, denoted by w,
whose observations are available in both files. Let w̃j and wj′ represent the values of
w for record j in Fy and record j
′ in Fx, respectively. It is also sufficient to assume
that only the values of comparison vector c, cjj′ , are available for each record pair
(j, j′), j ∈ S ′y, j ∈ Sx.
20
Let ỹ = (ỹj)
n
j=1 denote the n×1 vector of observed y values in Fy,X = (xTj′)Nj′=1
denote the N×p matrix of observed x values in F Nx,(y =)(yj′)j′=1 denote the unknownn
N × 1 vector of y values (asso)ciated with X , W̃ = w̃
T
j denote the n×K matrixj=1
N
of w values in Fy, W = w
T
j′ ′ denote the N ×K matrix of w values in Fj =1 x, and
C denote the Nn×K matrix of comparison vectors derived from comparing W̃ and
W . In summary, our observed data are {ỹ ,X ,W̃ ,W }, or equivalently, {ỹ ,X ,C}.
Table 2.1: Data layout for observations on y and x in Fy and Fx: Fy contains
observed values of variables w and y for a sample Sy of size n, Fx contains observed
values of variables w and x for a sample Sx of size N , and Sy ⊂ Sx. Note that the
true y values corresponding to x (shaded in gray) exist but are not observed in Fx.
Label wT y Label wT xT y
1 w̃T ỹ T T1 1 1 w1 x1 y1
· · · · · · · · · · · · · · · · · · · · ·
Fy Fj w̃T xj ỹj j
′ wT Tj′ xj′ yj′
· · · · · · · · · · · · · · · · · · · · ·
n w̃T ỹ N wTn n N x
T
N yN
2.3 General Integrated Model for Regression Analysis
In this section, we propose a general integrated model to propagate the un-
certainty of the linkage process in the later estimation step under the assumption
of data availability described in Section 2.2. The general integrated model involves
three important components: a regression model, a linkage error model and a mix-
ture model. The regression model is used to characterize the relationship between
the response variable y and the predictor x, the linkage error model is used to
characterize the randomness of the linkage process, and the mixture model on com-
21
parison vectors is used to estimate the probability of a record pair being a match
given the observed data and designate all record pairs into links and non-links. In
the following part, we introduce each component one by one.
2.3.1 Regression Model
Assume that values of (y,x) for units in the population U follow a general
regression model and the model holds for all sampled units in Sx. To illustrate the
methodology, we assume that
E(y|X ) = µ(X ;β), V ar(y|X ) = V (X ;β,τ ). (2.1)
Here, β is a p× 1 vector of unknown coefficient parameters, τ is an h× 1 vector of
other unknown variance components, µ(X ;β) = (µj′(X ;β))
N
j′=1 is an N × 1 vector,
and V (X ;β,τ ) = (v (X ;β,τ ))N,Nj′t′ j′=1,t′=1 is an N ×N matrix, where µj(·) and vj′t′(·)
are known functions. Three simple examples are given below:
Example 1: For the linear regression model y|X ∼ N(Xβ, σ2eIN) where IN is
an identity matrix of dimension N×N , we have µ(X ;β) = Xβ , V (X ;β,τ ) = σ2eIN ,
and τ = σ2e .
Example 2: For the logistic regression model where yj(′s are) independ(ent an)d
identically distributed with P (yj′ = 1|xj′) = g(xT Tj′β) = exp xj′β /[1 + exp xTj′β ],
we have the mean com[ponent µj′(]X ;β) = g(x
T
j′β) and the covariance component
vj′t′(X ;β,τ ) = g(x
T T ′ ′
j′β) 1− g(xj′β) for j = t and vj′t′(X ;β,τ ) = 0 for j′ 6= t′.
Example 3: For a nested-error linear model y i = X iβ + vi1N + ei wherei
iid 2 indv 2i ∼ N(0, σv), ei ∼ N(0N , σeIN ), vi is independent of ei, 1N is an N × 1 vectori i i i
22
of ones, 0N is a matrix of zeros of dimension Ni × Ni, IN is an identity matrix ofi i
d∑imension Ni × Ni, and Ni is the number of units in group i, i = 1, . . . , G, withG
i=1Ni = N , we have µ(X ;β) and V (X ;β,τ ) are block-diagonal with the ith block
values µi(X ;β) = X iβ , V i(X ;β,τ ) = σ
21 T 2 2 2 Tv N 1N + σi i eIN and τ = (σi v , σe) .
2.3.2 Linkage Error Model
As discussed in Chapter 1, most of the existing linkage error models are built
directly on the linked data, based on the assumptions that (1) the linked data
is obtained by linking two files of the same size that contain observations on all
population units of U , and (2) the resulting linkage is complete for Fx, i.e., each
record in Fx has a designated link selected from Fy. In reality, however, the two
files used for analysis usually come from different sources, such as survey samples
and administrative records, and thus their coverage of the population is different.
Typically, the units covered by Fy are a subset of those covered by Fx, as described
in our dissertation. Also, only the designated links obtained from the linkage process
are contained in the linked data file. In practice, the decision rules for most record
linkage techniques rely on the specified threshold values. Seldom, the choice of the
threshold values can lead to a complete linkage for Fx. In this way, information
about the linkage error carried by the designated non-links is ignored.
Here, we develop a new linkage error model that allows different file sizes as
long as Sy ⊂ Sx. The model is built directly on data from the original files by
exploiting the relationship between the observed y values in Fy and the unobserved
23
y values corresponding to the observed x in Fx. The linkage error model we proposed
here is the key to the general integrated model, serving as a connection between the
regression model introduced in Section 2.3.1 and the record linkage model, which
will be described in Section 2.3.3.
Under the assumption that there is no duplicate in each file and that Sy ⊂ Sx,
a specific observed y value in file Fy, say ỹj, and one of these unobserved y values
from the set {yj′ : j′ = 1, . . . , N} must be related to the same population unit. Let
ljj′ be the unknown binary matching status indicator for record pair (j, j
′), such
that ljj′ = 1 if record j in Fy and record j
′ in Fx represent the same population
unit, and l ′jj′ = 0 otherwise, j ∈ Sy, j ∈ Sx. Then the relationship between ỹj and
{yj′ : j′ = 1, . . . , N} can be modeled via the following identity:
∑N
ỹj = ljj′yj′ , j = 1, . . . , n. (2.2)
j′=1
Let L = (l )n,Njj′ j=1,j′=1. Then the above model (2.2) can also be written in the
following matrix form:
ỹ = Ly (2.3)
In other words, ỹ, the observed y values for all sampled units in Sy, is a n-permutation
of y, the unobserved y values for all sampled units in Sx.
In this dissertation, we assume that the linkage mechanism is at random
(LAR). That is, the conditional probability of L given y, X , and C could depend
on X and C but not on y, i.e., P (L|y,X,C ) = P (L|X,C ). Here, specifically, we
assume that the probability of a record pair being a true match only depends on its
24
comparison vector. That is,
P (L|y,X,C ) = P (L|C ).
2.3.3 Mixture Model
Following Larsen and Rubin (2001), we assume that the comparison vectors fol-
low a two-class mixture model. Jaro (1989), Winkler (1993, 1994, 1995), Thibaudeau
(1993), and Armstrong and Mayda (1993) used mixture models in record linkage
problems. The two-class mixture model on comparison vectors is motivated by the
idea that patterns of agreement and disagreement on matching fields would have
different distributions among matches M = {(j, j′) : ljj′ = 1, j ∈ Sy, j′ ∈ Sx} and
mismatches M c = {(j, j′) : l ′jj′ = 0, j ∈ Sy, j ∈ Sx}. The comparison vectors cjj′ are
assumed to be independent and identically distributed with the following probability
mass function:
P (cjj′) = πP (cjj′ |ljj′ = 1) + (1− π)P (cjj′|ljj′ = 0),
where π = P (ljj′ = 1) represents the probability of a record pair being a match,
P (cjj′|ljj′ = 1) and P (cjj′|ljj′ = 0) are the probabilities of observing cjj′ among
matches M and among mismatches M c, respectively.
Under the conditional independence assumption as shown in (1.2), the above
mixture model is simplified into:
∏K c ∏K′
P (c ) = π m jj k(1−m )(1−c c ′′ jj′k)jj k k + (1− π) u
jj k
k (1− u
(1−cjj′k)
k) . (2.4)
k=1 k=1
where cjj′k is the kth element of cjj′ , mk = P (cjj′k = 1|ljj′ = 1) and uk = P (cjj′k =
1|ljj′ = 0) are the probabilities of a record pair agreeing on matching fields k among
25
matches and mismatches, respectively. Let ψ = (π,m1, . . . ,mK , u1, . . . , uK)
T denote
the vector of unknown parameters in the mixture model.
By Bayes’ Rule, the conditional probability of a record pair being a match
given the observed comparison vector i∏s given by:
π K
c
m jj
′k(1−m )(1−cjj′k)
P (l k=1
k
jj′ = 1|cjj′ ;ψ) = ∏ kK c ′ ∏π m jj k(1−m )(1−cjj′k) c ′k=1 k k + (1− π) K u jj k(1− u )(1−cjj′k)k=1 k k
:= qjj′ .
Note that qjj′ = qjj′(cjj′ ;ψ) is a function of cjj′ and ψ. Let Q(ψ) ≡ Q(C ;ψ) =
(q )n,Njj′ j=1,j′=1. Then we have
E(L|y,X,C ) = E(L|C ) = Q(ψ).
The maximum likelihood estimator of ψ can be obtained using the expecta-
tion maximization (EM) (Dempster, Laird, and Rubin 1977) and the expectation
conditional maximization (ECM) (Meng and Rubin 1993) algorithms.
2.3.4 Designation of Links and non-Links
As mentioned before, it is not necessary to produce a linked file in the middle
of the record linkage process for the purpose of parameter estimation in our case.
If the primary goal is to generate a linked data set for secondary data users, the
estimated probabilities can be used to partition the record pairs into designated links
and non-links and to estimate error rates. The decision rule is similar to Fellegi and
Sunter’s method. A record pair (j, j′) is declared as a link if the probability qjj′ is
above a pre-specified upper threshold. Other than qjj′ , one can also consider use
one of the following as a matching weight:
26
∏ c| K jj′ (1−cP (c ′ l k jj′k)
(1) likelihood ratio: R (ψ) = jj jj
′=1)
′ = k=1
mk (1−mk)
jj P (c cjj′ |ljj′=0) ∏K ′u jj k (1−c(1−u ) jj′k)k=1 k k
P (l ′=1|c ′ ) q ′
(2) posterior likelihood ratio: r (ψ) = jj jj jjjj′ = .P (ljj′=0|cjj′ ) 1−qjj′
In our dissertation, based on the assumption that Sy ⊂ Sx, it is reasonable
to assume that the record linkage is complete for Fy; that is, each record in Fy
has a linked record from Fx. Thus, we designate record pairs as links and non-
links based on the following decision rule: for any record j in Fy, a record j
′ in Fx
is selected to be its link if its corresponding probability qjj′ is the largest among
{qjt : t = 1, . . . , N}, j = 1, . . . , n. By using this decision rule, we can generate a
linked dataset, which contains data (ỹ , X̃ ?). Here, X̃ ? = (x̃?T nj )j=1 denote the n× p
matrix of x values which are selected from Fx and linked to ỹ, and x̃
?
j denotes the
selected x value from Fx that is linked to ỹj in Fy, j = 1, . . . , n. Therefore, X̃
? in
the linked data set is an n-permutation of X in Fx. That is,
X̃ ? = AX
where A = (ajj′)
n,N
j=1,j′=1 is an n × N permutation matrix with A1N = 1n. Here,
ajj′ = 1 if qjj′ is the largest among {qjt : t = 1, . . . , N}, ajj′ = 0 otherwise. Note
that A is derived from probabilities qjj′ , which are functions of comparison vectors
cjj′ and mixture model parameters ψ. Hence, when ψ is known and C is observed,
A is fixed. Therefore, X̃ ? is fixed given X , C and ψ.
2.4 Estimation of Regression Coefficients
Assuming that y is conditionally independent of C given X , the conditional
mean and variance of ỹ given X and C can be derived under the general integrated
27
model. That is,
E(ỹ|X,C ) = Q(C ;ψ)µ(X ;β), V ar(ỹ|X,C ) = Σ(X,C ;ψ,β, τ ). (2.5)
where Σ ≡ Σ(X,C ;ψ,β, τ ) = (σjt)n,nj=1,t=1 with diagonal entries σjj and off-diagonal
entries σjt (j 6= t) equal to
∑N ∑N ∑N ( )
σjj = σjj(X,C ;ψ,β, τ ) = q
2
jj′qjt′vj′t′ + qjj′(1− qjj′) vj′j′ + µj′ ,
j∑′=1∑t′=1 j′=1N N
σjt = σjt(X,C ;ψ,β, τ ) = qjj′qtt′vj′t′ , (t 6= j).
j′=1 t′=1
The detailed proof of (2.5) is shown in Section 2.7.1.
When ψ is known, merely based on (2.5), we can estimate β by solving the
following class of system of p unbiased estimating equations:
β̂ : f(β,τ ,ψ) = H (β,τ ,ψ) [ỹ −Q(ψ)µ(β)] = 0p, (2.6)
where H (β,τ ,ψ) ≡H (C,X ;β,τ ,ψ) is a given p× n matrix which does not depend
on ỹ, and 0p is a p × 1 vector of zeros. The possible choices for matrix H (β,τ ,ψ)
includes but are not limited to X̃T ,XTQT , andXTQTΣ−1. The choices ofH (β,τ ,ψ)
for the linear and logistic regressions will be discussed in the next chapter.
When ∂f(β,τ ,ψ) exits, an application of Taylor series expansion yields
∂β
∂f(β,τ ,ψ) ( )
f(β̂ , τ ,ψ) ≈ f(β,τ ,ψ) + β̂ − β .
∂β
Based on the fact that f(β̂ , τ ,ψ) = 0p, E [f(β,τ ,ψ)] = 0p, and
V ar (f(β,τ ,ψ)|X,C ) = H (β,τ ,ψ)Σ (ψ,β, τ )HT (β,τ ,ψ),
28
we can obtain that
[ ]
| ≈ ∂f(β,τ ,ψ)
−1 [ ]
E(β̂ X,C) [ ] E f(β̂ , τ ,ψ)− f(β,τ ,ψ) + β =(β[, ] )(2.7)∂β T
| ≈ ∂f(β,τ ,ψ)
−1 ∂f(β,τ ,ψ) −1
V ar(β̂ X,C) H (β,τ ,ψ)Σ (ψ,β,τ )HT (β,τ ,ψ) ,
∂β ∂β
when the matrix ∂f(β,τ ,ψ)∂β is invertible at the true value of β . The detailed proof of (2.7)
is given in Section 2.7.2. It implies that the resulting estimator β̂ from solving (2.6) is
(approximately) unbiased for β when other parameters are unknown.
When the estimating equations in (2.6) are used, the resulting estimator β̂ may
depend on the unknown variance component τ if the selected matrix H (β,τ ,ψ) depends
on τ . In that case, methods for estimating τ need to be considered. When additional
assumptions about the regression model are made, such as normality, other unbiased
estimating functions f(β,τ ,ψ) can be derived to estimate β and τ simultaneously when ψ
is known. For example, the maximum likelihood estimator of β and τ can be obtained by
using the first partial derivatives of the log-likelihood function as the estimating functions.
An example is given for the linear regression case in the next chapter.
In order to simplify the methodology, one may replaceQ in the estimating equations
f(β,τ ,ψ) by QM (or QM2), which is a simplified version of Q with all entries in each row
set to zeros except the largest one (or two). For known ψ, let β̂F (ψ) denote an estimator
of β obtained as a solution to (2.6) for a given choice of H (β,τ ,ψ). The corresponding
estimator of β when Q is replaced by QM (or QM2) in (2.6) is denoted by β̂M (ψ) (or
β̂M2(ψ)). When ψ is unknown, one can use β̂F (ψ̂) to estimate β by replacing ψ with
one of its consistent estimators ψ̂. The corresponding estimator of β when Q is replaced
by QM (or QM2) in (2.6) is denoted by β̂M (ψ̂) (or β̂M2(ψ̂)). In this dissertation, the
maximum likelihood estimator of ψ is used as ψ̂, and it can also be treated as a solution of
29
a system of estimating equations. An estimator of the variance of β̂(ψ̂) can be obtained by
plugging in the estimates β̂ , τ̂ and ψ̂ in the variance formula from (2.7). But we do realize
that this plug-in variance estimator would underestimate V ar(β̂(ψ̂)|X,C) since it does
not take account of the variability of τ̂ and ψ̂. In the next section, a resampling method of
estimating V ar(β̂(ψ̂)|X,C) is given for the case where the measurements are uncorrelated
across blocks, and we leave the variance estimation for the correlated-across-blocks case
for future research.
Now, we consider the situation where blocking is used during the record linkage
process. The records in Fy and Fx can be partitioned into G blocks based on some basic
characteristics, such as zip code, first letter of last name, or first three digits of phone
numbers. Let ni and Ni be ∑the number of sa∑mple units in block i within Sy and Sx,
respectively, i = 1, . . . , G. So G Gi=1 ni = n and i=1Ni = N . Let ỹij denote the value of
y for record j in block i from Fy, xij′ denote the value of x for record j
′ in block i within Fx,
and yij′ denote its corresponding y value, i = 1, . . . , G, j = 1, . . . , ni, j
′ = 1, . . . , Ni. We
denote the vector of values ỹij′ and yij′ within block i by ỹ i = (ỹ
ni
ij)j=1 and y i = (yij′)
Ni
j′=1,
respectively. Similarly, we denote the matrix of x values in block i by X = (xT )Nii ij′ j′=1.
Then the regression model shown in (2.1) can be rewritten as:
E(y i|X i) = µi(X i;β), V ar(y i|X i) = V i(X i;β,τ ), (2.8)
for i = 1, . . . , G. Here, we assume that y i are independent across blocks given X .
We assume that there is zero probability that one record from Fy and another record
from Fx represent the same population unit if they are from different blocks. Therefore,
only records within the same blocks need to be compared, and linkage errors can only
occur within blocks. Let lijj′ denote the matching status of record pair (j, j
′) in block i,
and Li = (l
i )ni,Nijj′ j=1,j′=1 denote its corresponding matrix. The linkage error model in (2.3)
30
can be simplified into:
ỹ i = Liy i, i = 1, . . . , G. (2.9)
In other words, L = diag(L1, . . . ,LG).
Let cijj′ be value of the comparison vector c derived from values of matching fields
w̃ ′ij and wij′ for record pair (j, j ) in block i. The two-class mixture model in (2.4) can be
re-written as
∏K Kci ′ (1−ci ∏ i) c ′ (1−ci )
P (ci ) = π m jj kjj′ k (1−m ) jj
′k
k + (1− π) u jj kk (1− uk) jj
′k . (2.10)
k=1 k=1
Let qijj′ = P (l
i i i ni,Ni
jj′ = 1|cjj′) and Qi = (qjj′)j=1,j′=1, then E(Li|C i) = Qi.
Therefore, in case of blocking, when y i’s are independent across blocks, the estimat-
ing equations for β (and τ ) given known ψ can be generally written are
∑G
fi(β,τ ,ψ) = 0t, (2.11)
i=1
where t is equal to p for estimating β or (p + h) for estimating β and τ . In particular,
fi(β,τ ,ψ) = H i(β,τ ,ψ) [ỹ i −Qi(ψ)µi(β)] for (2.6).
2.5 Variance Estimation
As mentioned above, when the mixture model parameter ψ is known, estimate of β
(and τ ) can be obtained by solving the following system of estimating equations. That is,
∑G
fi(β,τ ,ψ) = 0t, (2.12)
i=1
In order to estimate the bias, variance, and mean squared error of an estimate β̂(ψ), the
unified jackknife theory proposed by Jiang, Lahiri and Wan (2002), henceforth referred to
as JLW, can be used. Jackknife replicate i is obtained by deleting data from block i in
31
both files Fx and Fy, (i = 1, . . . , G). The delete-i estimates of β , β̂−i(ψ), and the delete-i
estimate of τ , τ−i(ψ), are the solutions of
∑G
β̂−i(ψ), τ−i(ψ) : fi′(β,τ ,ψ) = 0t, (2.13)
i′=6 i
for i = 1, . . . , G. The jackknife estimate of bias, variance and mean squared error of β̂ ,
when ψ is known, are then given by
( ) ( )
¯̂
biasJ (β̂(ψ)) = (G− 1) β(ψ)− β̂(ψ) ,− ∑GG 1 ( )( )− ¯̂ ¯̂ TvarJ β̂(ψ) = β̂−i(ψ) β(ψ) β̂−i(ψ)− β(ψ) ,
( ) G ∑i=1− GG 1 ( )( )T
mseJ β̂(ψ) = β̂−i(ψ)− β̂(ψ) β̂−i(ψ)− β̂(ψ) .
G
∑ i=1¯̂
where β(ψ) = 1 GG i=1 β̂−i(ψ) is the average of the replicate estimates of β . The bias,
variance and mean squared error of τ̂ (ψ) can also be estimated similarly.
In practice, however, the mixture model parameter ψ is unknown. The maximum
likelihood estimate (MLE) of ψ, say ψ̂, can be obtained using the EM algorithm. The
MLE ψ̂ can also be treated as the solution of a system of estimating equations derived from
the log-likelihood function based on the distribution of comparison vectors cijj′ . In order
to account for uncertainty of ψ̂, ψ should be replaced by ψ̂ and ψ̂−i in (2.12) and (2.13),
respectively, where ψ̂−i is the delete-i estimate of ψ by removing values of comparison
vectors in block i, i = 1, . . . , G. Then the bias, variance, and mean squared error of β̂(ψ̂)
can be estimated. The properties of β̂(ψ̂) are expected to be similar to those of β̂(ψ) if ψ̂
is assumed to be independent of the response variable y; that is, the distribution of the
matching variables (e.g., last name, phone number) is assumed to be independent of the
response variable y (e.g., income) and hence of ỹ. This is true in many applications. The
bias, variance and mean squared error of any smooth function of β can be proposed in a
32
straightforward way. For large G, under regularity conditions, asymptotic properties of
β̂(ψ̂) and the jackknife estimators proposed in this section can be obtained from the unified
theory on jackknife given in Jiang et al. (2002). To reduce the computational burden,
a simplified jackknife method can be used by replacing the delete-i estimate ψ̂−i by its
full sample estimate ψ̂. Our simulation results show that the accuracy of the variance
estimate would not be jeopardized much even though the uncertainty of ψ̂ is ignored.
2.6 Summary
In chapter 2, we introduce a general methodology for regression analysis when data
values are from two different files. Rather than separating regression analysis from record
linkage as most existing methods do, we connect the regression model and the record
linkage model through our proposed new linkage error model. The general integrated
model can be implemented when the sample units in one file are a subset of those in
the other file. The y values observed in Fy are related to the x values observed in Fx
through the integrated model, and standard statistical analysis methods can be applied
for parameter estimation. For the purpose of parameter estimation, there is no need to
generate a linked file in the middle of the process. Information about linkage errors carried
by all record pairs (links and non-links) can all be passed into the estimation process and
used to correct for linkage bias. This is where our model is different from the secondary
data analysis where only the designated links are considered.
Essentially,, parameter estimation starts with deriving the conditional distribution
of the observed y values in file Fy given the observed x values in Fx and comparison
vectors. Based on their relationship, estimators can be obtained by solving a system of
estimating equations. In case of blocking, if data values are independent across blocks, the
33
jackknife resampling method proposed by Jiang, Lahiri, and Wan (2005) can then be used
to estimate the bias, variance, and mean squared errors of the estimators, taking account
of both estimation errors and linkage errors. In the following chapter, we will give two
specific examples to illustrate our methodology.
2.7 Proofs
2.7.1 Proof of (2.5)
Based on the assumption that P (L|ỹ ,X,C) = P (L|C), it can be proved from the
mixture model that ljj′ are conditionally independent given C with
P (ljj′ = 1|y,X,C) = P (ljj′ = 1|C) = P (ljj′ = 1|cjj′) = qjj′ .
Therefore, for j = 1, . . . , n, j′ = 1, . . . , N , t = 1,. . . , n, t
′ = 1, . . . , N , we have
[ ] q ′ ′jj′ if j = t and j = t
E(ljj′ |y,X,C) = qjj′ , E ljj′ ltt′ |y,X,C =  . (2.14)qjj′qtt′ otherwise
Let Q = (qjj′)
n,N
j=1,j′=1, then
E(L|y,X,C) = Q (2.15)
Now, we consider the first-order and second-order conditional expectation of ỹ given
y, X and C . Combined result (2.15) with the linkage error model ỹ = Ly, we can get
E [ỹ|y,X,C ] = E [Ly|y,X,C ] = E [L|y,X,C ]y = Qy.
∑
Since ỹ = Nj j′=1 ljj′yj′ under the linkage error model, the (j, t)th entry of the matrix ỹỹ
T
is equal to ∑ (∑ )N N
(ỹỹT )jt =  ljj′y ′j ltt′yt′ , j = 1, . . . , n, t = 1, . . . , n.
j′=1 t′=1
34
By applying the results in (2.14), we can calculate the (j, j) diagonal entries and (j, t)
off-diagonal entries of E[ỹỹT |y,X,C ].T(hat is, for j)= 1, . . . ,n and t = 1, . . . , n,∑N  ∑NE[ỹỹT |y,X,C ]jj = E ljj′yj′ ljt′yt′ |y,X,C
∑∑j′=1 [ t
′=1
N N ]
= yj′yt′E ljj′ ljt′ |y,X,C
j∑′=1 t′=1N ∑ [ ] ∑N [ ]
= yj′yt′E ljj′ ljt′ |y,X,C + yj′yj′E ljj′ ljj′ |y,X,C
j∑′=1 t∑′=6 j′ ∑ j′=1N N
= yj′yt′q
2
jj′qjt′ + yj′qjj′
j∑′=1 t∑′ 6=j′N N ∑j′=1N
= y ′y ′q 2j t jj′qjt′ + yj′qjj′(1− qjj′), (2.16)
j′=1 t′=1 j′=1
∑ (∑ ) N NE[ỹỹT |y,X,C ]jt = E ljj′yj′ l tt′yt′ |y,X,C
∑N ∑j′=1 ′N [ t =1 ]
= yj′yt′E ljj′ ltt′ |y,X,C
j∑′=1 tN ∑′=1N
= yj′yt′qjj′qtt′ , (j =6 t).
j′=1 t′=1
Assuming that the response variable y is conditionally independent of comparison
vector c given x, we can derive the first-order and second-order expectation of y given x
and c from the regression model (2.1). That is,
E(yj′ |X,C) = E(yj′ |X ) = µ ′ ,[ ] [ ] j
E yj′yt′ |X,C = E yj′yt′ |X = vj′t′ + µj′µt′ (2.17)
for j′ = 1, . . . , N , t′ = 1, . . . , N .
Based on results (2.16) and (2.17), the first-order and second-order of ỹ given X
35
and C can be derived by applying law of total expectations. That is,
E[ỹ|X,C ] = E (E[ỹ|y,X,C ]|X,C) = E (Qy|X,C) = QE (y|X,C) = Qu,
( )
E[ỹỹT |X,C ]jj = EE[ỹỹT |y,X,C ]j,j |X,C∑N ∑N ∑ N= E yj′yt′qjj′q 2jt′ + yj′qjj′(1− qjj′)|X,C
∑ j′=1 t′=1 j′=1N ∑N ( ) ∑N ( )
= qjj′qjt′E yj′yt′ |X,C + qjj′(1− qjj′)E y2j′ |X,C
j∑′=1 t′=1N ∑N [ ] ∑j′=1N [ ]
= qjj′qjt′ vj′t′ + µj′µt′ + qjj′(1− qjj′) vj′j′ + µ2j′ ,
j′=(1 t′=1 ) j′=1
E[ỹỹT |X,C ]jt = EE[ỹỹT |y,X,C ]j,t|X,C∑N ∑ N= E yj′yt′q jj′qtt′ |X,C
∑∑j′=1 t′=1N N ( )
= qjj′qtt′E yj′yt′ |X,C
j∑′=1 t∑′=1N N ( )
= qjj′qtt′ vj′t′ + µj′µt′ , (t 6= j).
j′=1 t′=1
for j = 1, . . . , n, t = 1, . . . , n.
By applying the identity V ar(ỹ|X,C) = E[ỹỹT |X,C ] − E[ỹ|X,C ]E[ỹ|X,C ]T , the
diagonal and off-diagonal entries of V ar(ỹ|X,C) are given by:
( )
V ar(ỹ|X,C) T Tjj = E[ỹỹ |X,C ] − E[ỹ|X,C ]E[ỹ|X,C ]∑N ∑ jj [ ] ∑ jjN N [ ]
= qjj′q
2
jt′ vj′t′ + µj′µt′ + qjj′(1− qjj′) vj′j′ + µj′
∑j′=1N ∑t′=1 j′=1N
− qjj′qjt′µj′µt′
j∑′=1 t∑′=1N N ∑N [ ]
= qjj′qjt′vj′t′ + qjj′(1− qjj′) vj′j′ + µ2j′ := σjj ,
j′=1 t′=1 j′=1
36
( )
V ar(ỹ|X,C) Tjt = E[ỹỹ |X,C ]jt − E[ỹ|X,C ]E[ỹ|X,C ]T∑N ∑N ( ) ∑∑ jtN N
= qjj′qtt′ vj′t′ + µj′µt′ − qjj′qtt′µj′µt′
j∑′=1 tN ∑′=1 j′=1 t′=1N
= qjj′qtt′vj′t′ := σjt, (t =6 j),
j′=1 t′=1
for j = 1, . . . , n and j = 1, . . . , n.
Let Σ ≡ Σ(X,C ;ψ,β,τ ) = (σ n,njt)j=1,t=1, then the conditional mean and variance of
ỹ given X and C can be written in the following matrix form:
E(ỹ|X,C) = Q(C ;ψ)u(X ;β), V ar(ỹ|X,C) = Σ(X,C ;ψ,β,τ ).
2.7.2 Proof of (2.7)
By using Talor expansion, the estimating function can be approximated by
≈ ∂f(β,τ ,ψ)
( )
f(β̂ , τ ,ψ) f(β,τ ,ψ) + β̂ − β .
∂β
Thus, when the matrix ∂f(β,τ ,ψ)∂β is invertible at the true value of β , we can have[ ]
− ≈ ∂f(β,τ ,ψ)
−1 [ ]
β̂ β f(β̂ , τ ,ψ)− f(β,τ ,ψ) .
∂β
Based on the fact that E [ỹ|X,C ] = Q(ψ)µ(β) and V ar (ỹ|X,C) = Σ (ψ,β,τ ), we
37
can get
E [f(β,τ ,ψ)|X,C ] = E [H (β,τ ,ψ) (ỹ −Q(ψ)µ(β)) |X,C ]
= H (β,τ ,ψ) (E [ỹ|X,C ]−Q(ψ)µ(β))
= 0p,
V ar (f(β,τ ,ψ)|X,C) = V ar (H (β,τ ,ψ) [ỹ −Q(ψ)µ(β)] |X,C)
= H (β,τ ,ψ)V ar (ỹ|X,C)HT (β,τ ,ψ)
= H (β,τ ,ψ)Σ (ψ,β,τ )HT (β,τ ,ψ).
Combing the(above results with the fact f(β̂ , τ ,ψ) = 0p, we can get[ ]
∂f(β,τ ,ψ) −1 [ ] )
E(β̂ |X,C) ≈ E f(β̂ , τ ,ψ)− f(β,τ ,ψ) + β |X,C
[ ∂β ]
∂f(β,τ ,ψ) −1
= [0p − E (f(β,τ ,ψ|X,C))] + β
∂β
= β,
V ar(β̂ |X,C) = V[ ar(β̂ − β |X],C) ( )
∂f(β,τ ,ψ) −1 ( ) [ ] T≈ − ∂f(β,τ ,ψ) −1V ar f(β̂ , τ ,ψ) f(β(, τ ,ψ)|X,C[ ∂β ] ) ∂β
∂f(β,τ ,ψ) −1
[ ] T
∂f(β,τ ,ψ) −1
= V ar (f(β,τ ,ψ)|X,C)
[ ∂β ] ∂β ([ ] )T
∂f(β,τ ,ψ) −1 ∂f(β,τ ,ψ) −1
= H (β,τ ,ψ)Σ (ψ,β,τ )HT (β,τ ,ψ) .
∂β ∂β
38
Chapter 3: Applications to Linear and Logistic Regression
To illustrate our general methodology for regression analysis using data from two files
as described in Section 2.2, we consider two special situations where regression parameters
in linear and logistic models are of interest. Here, we use the same notation as Chapter 2.
3.1 Linear Regression using Data from Two Files
Assume that the values of y and x for all sampled units in block i of Sx satisfy the
following model:
E(y i|X i) = X iβ, V ar(y i|X i) = σ2eIni , i = 1, . . . , G, (3.1)
where σ2e is an unknown constant parameter. Note that values yij in block i are uncorre-
lated and have the same variance σ2e (homoscedasticity).
When data (y i,X i) is available for each block, the Ordinary Least Squares (OLS)
estimator is the best linear unbia(sed estimat)or (B(LUE), and∑ −1 ∑ )
it is given by:
G G
β̂ = XT Ti X i X i y i (3.2)
i=1 i=1
However, the above estimator cannot be used to estimate β in our case since the y i’s are
not observed.
If a linked data file is generated based on the decision rule described in 2.3.4 during
the record linkage process and data (ỹ i, X̃
?
i ) is available, one may simply assume the linkage
39
is perfect, replace X ?i and y i in (3.2) by X̃ i and ỹ i, and obtain a naive OLS estimator
β̂N (ψ). That is, (∑ ) ( )G −1 ∑G
β̂ (ψ) = X̃ ?TX̃ ? X̃ ?TN i i i ỹ i , (3.3)
i=1 i=1
which can be treated as the solution of the following system of estimating equations:
∑G ( )
β̂N (ψ) : X̃
?T ỹ − X̃ ?i i iβ = 0p.
i=1
In Section 3.3.2, we prove that the mean and variance of β̂N (ψ) under the general inte-
grated model are
( ) (∑ )G −1(∑ )G
E β̂N (ψ)|X,C,ψ = X̃ ?Ti X̃ ?i X̃ ?Ti QiX i β, (3.4)
( ) [∑i=1 ] i=1G −1 [∑ ][ ]G ∑G −1
V ar β̂N (ψ)|X,C,ψ = X̃ ?TX̃ ? ?T ? ?Ti i X̃ i ΣiX̃ i X̃ i X̃ ?i .
i=1 i=1 i=1
Based on the result, we can see that β̂N (ψ) is an biased estimator of β .
In order to correct the linkage bias and obtain a more robust estimator of β , we
exploit the relationship between ỹ i and X i, C i under the general integrated model. By
using the linear model (3.1) as the first component of the general integrated model, the
conditional mean and variance of ỹ i given X i and C i can be derived under the assumption
that y i is independent of C i given X i . That is, for i = 1, . . . , G,
E(ỹ i|X i,C i) = QiX iβ, V ar(ỹi |X i,C i) = Σi. (3.5)
Here, Σ = (σi )ni,nii jt j=1,t=1 is the ni × ni variance-covariance matrix with diagonal element
σijj and off-diagonal element σ
i
jt (j 6= t) equal to
∑Ni ∑Ni ∑Ni
σi = qi σ2 + qi (1− qi )(xT β)2, σi = qi qi 2jj jj′ e jj′ jj′ ij′ jt jj′ tj′σe .
j′=1 j′=1 j′=1
The detailed proof for (3.5) is given in Section 3.3.1. It is consistent with the general
result shown in (2.5). Note that Σi depends on β , σ
2
e and ψ. When compared to the
40
distribution of y i given X i in (3.1), ỹ i also follows a linear regression model of β , but the
design matrix changes fromX i toQiX i and the variance matrix changes from σ
2
eIni to Σi.
The unequal diagonal entries and the non-zero off-diagonal entries of Σi imply that values
ỹij in Fy have different variances and are correlated within blocks (heteroscedasticity).
When ψ is known, several different estimators of β can be developed based on the
relationship between ỹ and X , C . These estimators include the bias-corrected estimator
β̂C(ψ), the ordinary least squares estimator β̂OLS(ψ), the weighted least squares estimator
β̂ 2WLS(ψ, σe), and the maximum likelihood estimator β̂MLE(ψ).
The development of the bias-corrected estimator β̂C(ψ) starts with investigating the
bias of the naive OLS estimator β̂N (ψ) conditional on values X and C . By using the fact
that E(ỹ i|X i,C i) = QiX iβ , we prove(in Section 3.3.2 that( ) ∑ )−1(∑ )G G
E β̂N (ψ)|X,C = X̃ ?Ti X̃ ? X̃ ?Ti i QiX i β
∑ i=1 i=1
If the matrix G X̃ ?Ti=1 i QiX i is invertible, an unbiased linear estimator β̂C(ψ) of β can
be obtained by adjusting the bia(s of β̂N (ψ). Th∑ )
at is(, )
G −1 ∑G
β̂ (ψ) = X̃ ?TQ X X̃ ?TC i i i i ỹ i ,
i=1 i=1
which can be treated as the solution of the following system of estimating equations:
∑G
β̂C(ψ) : X̃
?T
i (ỹ i −QiX iβ) = 0p.
i=1
In Section 3.3.2, we prove that the mean and variance of β̂C(ψ) are equal to:
E(β̂C(ψ)|X,C) = β[,∑ ]−1 [∑ ][∑ ]
(3.6)
G G G −1
V ar(β̂C(ψ)|X,C) = X̃ ?Ti QiX X̃ ?Ti i ΣiX̃ ?i XTi QTi X̃ ?i .
i=1 i=1 i=1
Moreover, by directly utilizing this linear regression relationship between ỹ i and X i,
C i as shown in (3.5), we can also use the ordinary least squares method to obtain an linear
41
unbiased estimator β̂OLS(ψ) of β . That is,
∑G
β̂OLS(ψ) = arg min (ỹ i −QiX iβ)T (ỹ −Q X β)( i i iβ∑ i=1 )G −1(∑ )G
= XTi Q
TQ X XTQTi i i i i ỹ i ,
i=1 i=1
which can be treated as the solution of the following system of estimating equations:
∑G
β̂ (ψ) : XTQTOLS i i (ỹ i −QiX iβ) = 0p.
i=1
The mean and variance of β̂OLS(ψ) are equal to
E(β̂OLS(ψ)|X,C) = β[,∑ ]
(3.7)
G −1 [∑ ][ ]G ∑G −1
V ar(β̂OLS(ψ)|X,C) = XT T T T T Ti Qi QiX i X i Qi ΣiQiX i X i Qi QiX i .
i=1 i=1 i=1
We do realize that the OLS estimator β̂OLS(ψ) is not the best linear unbiased
estimator of β since ỹij are correlated within blocks under the general integrated model.
One may consider to use the weighted least squares method to obtain the best linear
unbiased estimator of β by using the inverse of the variance-covariance matrix Σi as the
weight matrix. That is,
∑G
β̂ 2WLS(ψ, σe) = arg min (ỹ i −QiX iβ)TΣ−1i (ỹ i −QiX iβ) (3.8)( β∑ i=1 ) ( )G −1 ∑G
=6 XTQTi i Σ−1i QiX i X
T T −1
i Qi Σi ỹ i (3.9)
i=1 i=1
To obtain β̂WLS(ψ, σ
2
e), we take partial derivatives of the sum of weighted squares
with respect to βk (k = 1, . . . , p), and set the partial derivatives to zeros. Then the
weighted least squares (WLS) estimator β̂ 2WLS(ψ, σe) of β is obtained by solving the
following set of estim{ating equations:∑G }
β̂WLS(ψ, σ
2
e) : 2δ
T T T T −1 −1
kX i Qi + (ỹ i −QiX iβ) Σi Di,k Σi (ỹ i −QiX iβ) = 0, (3.10)
i=1
42
for k = 1, . . . , p. The detailed proof is given in Section 3.3.3. Here, δ = ∂βk ∂β is the kthk
column of the identity matrix I of dimension p× p, and D = ∂Σip i,k ∂β is an ni × ni matrixk
∂σi
with diagonal entries jj∂β andk
∂σi ∑Nijj ∂σijt
= 2 qi ijj′(1− qjj′)(xTij′β)xij′kδk, = 0,∂βk
j′
∂βk
=1
We can see that the WLS estimator β̂WLS(ψ, σ
2
e) obtained from optimizing (3.8) is
not the best linear unbiased estimator that we expect as shown in (3.9). This is mainly
because the variance-covariance matrix Σi is not free of β . For the same reason, the
resulting β̂ 2WLS(ψ, σe) may not possess the nice properties of the weighted least squares
estimator of β obtained in the case where the variance-covariance matrix of the linear
regression model is free of β . For example, (1) there is no close-form expressions for
β̂WLS(ψ, σ
2
e), and β̂WLS(ψ, σ
2 2
e) is not a linear estimator; (2) β̂WLS(ψ, σe) is not identical
to the MLE estimator, which can be seen by comparing their estimating equations. In
addition, β̂ (ψ, σ2) depends on parameter σ2WLS e e , which is usually unknown and need to
be estimated. Otherwise, β̂WLS(ψ, σ
2
e) cannot be evaluated even if ψ is known.
Under the assumption of normality, ỹ i|X i,C i ∼ N(QiX iβ,Σi), we can also derive
the maximum likelihood estimator (MLE) of β and σ2e simultaneously when ψ is known.
The log-likelihood function of β and σ2e based on data {ỹ i,X i,C i, i = 1, . . . , G} is given
by:
∑m1 { − }l(β , σ2e) = − ni ln(2π) + ln |Σi |+ (ỹ i −QiX iβ)TΣ 1i (ỹ i −QiX iβ) .2
i=1
The MLE estimators β̂MLE(ψ) and σ̂
2
e(ψ) can be treated as solutions of the following
43
system {of estimating equations:∑G ( − ) [ ] }tr Σ 1i Di,k − 2δTXTk i QTi + (ỹ i −QiX iβ)TΣ−1 −1i Di,k Σi (ỹ i −QiX iβ) = 0
∑i=1G { (
tr Σ−
) }
1
i Di,σ − (ỹ i −QiX iβ)
TΣ−1i Di,σΣ
−1
i (ỹ i −QiX iβ) = 0 (3.11)
i=1
(for k )= 1, . . . , p. The detailed proof of (3.11) is in Section 3.3.4. Here, D = ∂Σii,σ =∂σ2e
i ni,ni∂σjt
2 denotes the partial derivative of Σi with respect to σ
2
∂σ e
with
e
j=1,t=1
∂σi ∑Ni ∂σi ∑Nijj jt
= qijj′ , = q
i qi
2 2 jj′ tj′
, (t =6 j).
∂σe ′ ∂σj =1 e j′=1
Remark 1: Here, based on the linear relationship between ỹ and X , C under the
general integrated model, we derive four different estimators for the regression coefficient
β in a multivariate linear regression model: β̂C(ψ), β̂OLS(ψ), β̂
2
WLS(ψ, σe), and β̂MLE(ψ).
Note that all these four estimators depend on ψ. When ψ is unknown, one can estimate β
by substituting ψ with its maximum likelihood estimate ψ̂, which can be obtained by the
expectation-maximization algorithm. Besides ψ, β̂ (ψ, σ2) also depends on σ2WLS e e . One
may consider to use the linked data (ỹ , X̃ ) to obtain an estimate of σ2e . An estimator of
σ2e under the exchangeable linkage error model is given by Chambers (2009).
Remark 2: In order to estimate the variances of estimators β̂N (ψ̂), β̂C(ψ̂), β̂OLS(ψ̂),
β̂ 2WLS(ψ̂, σ̂e), and β̂MLE(ψ̂), we can simply replacing the unknown parameters β , σ
2
e , and
ψ with β̂ , σ̂2e , and ψ̂ in the formula of their corresponding theoretical variances of β̂N (ψ),
β̂C(ψ), β̂OLS(ψ), β̂
2
WLS(ψ, σe), and β̂MLE(ψ). The expression for the theoretical variances
of β̂N (ψ), β̂C(ψ) and β̂OLS(ψ) are given in (3.4), (3.6), and (3.7), respectively. However,
the variability of σ̂2e , and ψ̂ would be ignored in this way. Since these estimators of β
and the maximum likelihood estimator of ψ can all be treated as solutions to a system
of estimating equations, the jackknife method proposed by Jiang, Lahiri, and Wan (2005)
can then be used to estimate their bias, variance, and mean squared error.
44
Remark 3: As shown above, there is no closed-from expressions for the WLS
estimator β̂WLS and the MLE estimator β̂MLE . So numerical algorithms, such as Newton-
Raphson method and Fisher scoring algorithm, are needed to find the solutions to the
estimating equations. Initial values are required for these numerical algorithms. The
naive OLS estimate β̂N (ψ̂), bias-corrected estimate β̂C(ψ̂), and OLS estimate β̂OLS(ψ̂)
can be chosen as the initial values.
3.2 Logistic Regression using Data from Two Files
Assuming there is no sampling bias, the logistic regression model also holds for all
samples units in Sx. That is, yij′ are indepen(dent )and identically distributed with
exp xT ′β
P (y ′ = 1|xT
ij
ij ij′ ;β) = g(x
T ′
ij′β) = ( ) , i = 1, . . . , G, j = 1, . . . , Ni.
( ) 1 + exp x
T
ij′β
Ni
Here, let g(X iβ) = g(x
T
ij′β) denote the Ni × 1 vector of means.
j′=1
When the joint observations (yij′ ,xij′) are available, we can estimate β by using the
maximum likelihood method. The MLE estimate β̂ of β is the solution of the following
estimating equations:
∑G
β̂ : XTi (y i − g(X iβ)) = 0p.
i=1
When data values are from two different files and a linked data set with values
(ỹ i, X̃
?
i ) (i = 1, . . . , G) is produced by any record linkage process, we can simply ignore
the linkage errors in the linked file and obtain a naive MLE estimator β̂N (ψ) of β :
∑G ( )
β̂N (ψ) : X̃
?T ỹ − g(X̃ ?i i iβ) = 0p. (3.12)
i=1
However, the existence of linkage errors can lead to significant bias of the estimators,
this is probably because they weaken the relationship between y and x. Similarly to the
45
linear regression case, we can correct the linkage bias by utilizing the relationship between
ỹ and X , C under the general integrated model.
Applying the fact that
E(yij′ |xT Tij′) = g(xij′β),[ ]
V ar(yij′ |xTij′) = g(xTij′β) 1− g(xTij′β) ,
Cov(yij′ , yit′ |X i) = 0,
for i = 1, . . . , N , j′ = 1, . . . , N , t′ = 1, . . . , N , t′i i i 6= j′, we follow the steps in Section
3.3.1 and obtain the conditional mean and variance of ỹ i given X i, C i under the general
integrated model:
E(ỹ i|X i,C i) = Qig (X iβ) , V ar(ỹi |X i,C i) = Σi. (3.13)
Here, Σ = (σi ni,nii jt)j=1,t=1 is the ni×ni variance-covariance matrix depending on parameters
β and ψ with diagonal element σi and off-diagonal element σijj jt (j 6= t) equal to:
∑Ni [ ] ∑N
σi i T i T i i i T Tjj = qjj′g(xij′β) 1− qjj′g(xij′β) , σjt = qjj′qtj′g(xij′β)[1− g(xij′β)].
j′=1 j′=1
Note that under the general integrated model, ỹ i does not follow a generalized linear model
anymore given X i and C i, and the ỹij are correlated within each blocks.
Based on the fact that E(ỹ i|X i,C i) = Qig (X iβ), we can obtain a set of unbiased
estimating equations by adjusting the bias of the estimating function used for the naive
MLE estima[tor, as shown in (3.12). Noting]that∑G ( ) ∑G ( )
E X̃ ?T ỹ − g(X̃ ?i i iβ) |X i,C ?T ?i = X̃ i Qig(X iβ)− g(X̃ iβ) ,
i=1 i=1
the unbiased estimating equations for the bias-corrected estimator β̂C(β) are given by:
∑G
β̂C(β) : X̃
?T
i (ỹ i −Qig(X iβ)) = 0p. (3.14)
i=1
46
More generally, when ψ is known, we can focus on the following system of p unbiased
estimating equations: ∑G
H i [ỹ i −Qig(X iβ)] = 0p, (3.15)
i=1
where H i ≡H i(C i,X i;β,ψ) is a given p× n matrix. The possible choices for H i includes
X̃ ?Ti , X̃
?T T
i Qi , and X̃
?TQTΣ−1i i i .
3.3 Proofs
3.3.1 Proof of (3.5)
∑
Under the linkage error model ỹ i = Liy i, we have ỹ =
Ni
ij j′=1 l
i
jj′yij′ . Assum-
ing the linkage is at random (LAR), it is true that P (li ijj′ = 1|y i,X i,C i) = P (ljj′ =
1|C i i i i ii) = P (ljj′ = 1|cjj′) = qjj′ under the mixture model. Thus, E(ljj′ |y i,X i,C i) = qjj′ ,
Cov(li , li |y ,X ,C ) = qi (1− qi ) if j = t and j′ ′ i ijj′ tt′ i i i jj′ jj′ = t , and Cov(ljj′ , ltt′ |y i,X i,C i) = 0
otherwise. Here, i = 1, . . . , G, j = 1, . . . , n , j′i = 1, . . . , Ni, t = 1, . . . , ni, t
′ = 1, . . . , Ni.
By using these facts, we can get: ∑ Ni
E(ỹij |y i,X i,C i) = E lijj′yij′ |y i,X i,C i
∑ j′=1Ni
= E(lijj′ |y i,X i,C i)yij′
j∑′=1Ni
= qijj′yij′ ,
j∑′=1Ni ∑Ni ( )
Cov (ỹij , ỹit|y i,X i,C i ii) = yij′yit′Cov ljj′ , ltt′ |y i,X i,C i
j′=1 t′=1
= 0,
47
V ar(ỹij |y i,X i,C i) = Cov (ỹij , ỹij |y i,X i,C i)∑ Ni ∑Ni= Cov lijj′yij′ , lijt′yit′ |y i,X i,C i
∑ j′=1 t′=1Ni ∑Ni ( )
= yij′y
i i
it′Cov ljj′ , ljt′ |y i,X i,C i
j∑′=1 t′=1Ni ( )
= yij′yij′Cov l
i i
jj′ , ljj′ |y i,X i,C i
∑j′=1Ni ∑ ( )
+ yij′yit′Cov l
i , lijj′ jt′ |y i,X i,C i
j∑′=1 t′=6 j′Ni
= y2 i iij′qjj′(1− qjj′),
j′=1
Based on the linear regression model (3.1) and the assumption that the response
variable y is conditionally independent of comparison vector c given x, we have
E(y Tij′ |X i,C i) = E(yij′ |X i) = xij′β,
Cov(y 2ij′ , yij′ |X i,C i) = Cov(yij′ , yij′ |X i) = σe ,
Cov(yij′ , yit′ |X i,C i) = Cov(yij′ , yij′ |X i) = 0, (j′ 6= t′)
for i = 1, . . . , G, j′ = 1, . . . , Ni, t
′ = 1, . . . , Ni. By applying the law of total expectation,
the law of total variance, and the law of total covariance, we can get
E(ỹij |X i,C i) = E[E(ỹij |y i,X∑ i
,C i)|X i,C i]
N
= qijj′E[yij′ |X i]
j∑′=1N
= qi xTjj′ ij′β,
j′=1
48
V ar(ỹij |X i,C i) = E [V ar(ỹij |y i,X i,C i)|X i,C i] +V ar (E[ỹij |y i,X i,C i]|X i,C i)∑Ni ∑Ni ∑ Ni= E y2 qiij′ jj′(1− qi ′)|X i,C ijj + Cov qi i jj′yij′ , qjt′yit′ |X i,C i
∑ j′=1 j′=1 t′=1Ni [ ] ∑Ni ∑Ni ( )
= qijj′(1− qi 2jj′)E yij′ |X i ii,C i + qjj′qjt′Cov yij′ , yit′ |X i,C i
j∑′=1 ( j
′=1 t′=1
Ni )
= qi i 2jj′(1− qjj′) V ar(yij′ |X i) + (E[yij′ |X i])
∑j′=1Ni ( ) ∑Ni ∑ ( )
+ qi qijj′ jj′Cov yij′ , yij′ |X i ii + qjj′qjt′Cov yij′ , yit′ |X i
j∑′=1 j′=1 t′ 6=j′Ni ∑Ni ∑Ni
= qijj′(1− qi 2 i i T 2 i 2 2jj′)σe + qjj′(1− qjj′)(xij′β) + (qjj′) σe + 0
j∑′=1 ∑ j′=1 j′=1Ni Ni
= qi σ2 + qijj′ e jj′(1− qi T 2 ijj′)(xij′β) := σjj ,
j′=1 j′=1
Cov (ỹij , ỹit|X i,C i) = E [Cov (ỹij , ỹit|y i,X i,C i) |X i,C i]
+ Cov (E[ỹij |y i,X i,C i], E[ỹit|y i,X i,C i]|X i,C i)∑ Ni ∑Ni= 0 + Cov qijj′yij′ , qi tt′yit′ |X i,C i
∑∑ j′=1 t′=1Ni Ni ( )
= qi ijj′qtt′Cov yij′ , yit′ |X i,C i
j∑′=1 t′=1Ni ( ) ∑Ni ∑ ( )
= qi qijj′ tj′Cov y
i i
ij′ , yij′ |X i + qjj′qtt′Cov yij′ , yit′ |X i
j∑′=1 j′=1 t′=6 j′N
= qi i 2 ijj′qtj′σe := σjt, (t =6 j).
j′=1
Let Σi = (σ
i
jt)
ni,ni
j=1,t=1, then the above result can be written in the following matrix form:
E(ỹ i|X i,C i) = QiX iβ, V ar(ỹ i|X i,C i) = Σi, i = 1, . . . , G.
49
3.3.2 Proof of (3.4), (3.6), and (3.7)
In case of blocking, the x values in the linked data set are related to the x values in
Fx through the following identity:
X̃ ?i = AiX i, i = 1, . . . , G.
where X̃ ?i is the ni × p matrix of x values linked to ỹ i in Fy, X i is the Ni × p matrix of
x values in Fx, and Ai = (a
i ni,Ni
jj′)j=1,j′=1 is the ni ×Ni matrix of linkage status indicators,
where ai ijj′ = 1 if qjj′ is the largest among probabilities {qijt′ : t′ = 1, . . . , Ni}, and aijj′ = 0
otherwise.
Note that Ai is derived from Qi, which is a function of C i and ψ. Thus, Ai is fixed
when C i and ψ are known, and X̃
?
i is fixed whenX i, C i and ψ are known. Also recall that
E[ỹ i|X i,C i] = QiX iβ , V ar(ỹ i|X i,C i) = Σi under the general integrated model. Based
on these facts, when ψ is known, the(mean and va)rianc(e of β̂N (ψ) are given by:( ) ∑ ) G −1 ∑G
E β̂N (ψ)|X,C = E  X̃ ?Ti X̃ ?i X̃ ?T i ỹ i |X,C
(∑ i=1 ) (∑ i=1−1 )G G
= X̃ ?TX̃ ? X̃ ?T
( i i i
E[ỹ i|X,C ]
∑i=1 )−1(G ∑i=1 )G
= X̃ ?Ti X̃
?
i X̃
?T
i E[ỹ |X( i i
,C i]
∑i=1 ) i=1G −1(∑ )G
= X̃ ?T ? ?T
( i
X̃ i X̃ Q X β
∑i=1 ) (
i i i
∑i=1−1 )G G
= X̃ ?Ti X̃
?
i X̃
?T
i QiX i β,
i=1 i=1
50
( ) [∑ ]−1 [G ∑ ] G
V ar β̂ (ψ)|X,C = V ar X̃ ?T ? ?TN i X̃ i X̃ i ỹ i |X,C
[∑ i=1 ] (∑i=1 )[ ]G −1 G ∑G −1
= X̃ ?TX̃ ?i i V ar X̃
?T
i ỹ i|X,C X̃ ?T ?[ ] [ ][ i
X̃ i
∑i=1 ∑ i=1 i=1 ]G −1 G ∑G −1
= X̃ ?TX̃ ? X̃ ?TV ar (ỹ |X,C)X̃ ? X̃ ?T ?
[ i i] i i i i
X̃ i
∑i=1 i=1 i=1G −1 [∑ ][ ]G ∑G −1
= X̃ ?TX̃ ?i i X̃
?T
i ΣiX̃
? X̃ ?T ?i i X̃ i .
i=1 i=1 i=1
Similarly, the mean and variance of the bias-corrected estimator β̂C(ψ) are given
by: [∑ ]G −1 [∑ ] GE(β̂ (ψ)|X,C) = E X̃ ?TQ ?TC i iX i X̃ ỹ i |X,Ci
[∑ i=1 ]G −1∑ i=1G
= X̃ ?TQ X X̃ ?Ti i i i E (ỹ i|X,C)[∑i=1 ]G −1∑i=1G
= X̃ ?TQ ?Ti iX i X̃ i QiXβ
i=1 i=1
= β, [∑ ]−1 [∑ ] G G
V ar(β̂ (ψ)|X,C) = V ar X̃ ?TQ X X̃ ?TC i i i i ỹ i |X,C
[∑ i=1 ] i=1−1 (∑ )[∑ ]G G G −1
= X̃ ?Ti QiX i V ar X̃
?T
i ỹ i|X,C XTQT ?[ ] [ i i
X̃ i
∑i=1 ∑ i=1 i=1−1 ][ ]G G ∑G −1
= X̃ ?TQ X ?Ti i i X̃ i V ar (ỹ i|X,C)X̃ ? XTQTX̃ ?[ ] i i i i∑i=1 i=1 i=1G −1 [∑ ][ ]G ∑G −1
= X̃ ?Ti Q
?T
iX i X̃ i ΣiX̃
?
i X
T
i Q
T
i X̃
?
i .
i=1 i=1 i=1
51
Similarly, the mean and variance of the OLS estimator β̂OLS(ψ) are given by:
E(β̂OLS
[
(ψ)|X,C)
∑ ]−1 [∑ ] G G
= E XTQT T T i i QiX i X i Qi ỹ i |X,C
[∑ i=1 ] ∑ i=1G −1 G
= XTQTQ T Ti i iX i X i Qi E (ỹ i|X,C)[∑i=1 ] i=1G −1∑G
= XTQTQ X XTQTi i i i i i QiX iβ
i=1 i=1
= β,
V ar(β̂OLS[(ψ)|X,C) ∑ ]− [∑ ] G 1 G= V ar XTQTi i Q T TiX i X Q i i ỹ i |X,C
[∑ i=1 ] ( i=1 )[ ]G −1 ∑G ∑G −1
= XTQTi i QiX i V ar X
TQT T T
[ ] [ i i
ỹ i|X,C X][i
Qi QiX i
∑i=1 i=1 i=1−1 ∑ ∑ ]G G G −1
= XT T T Ti Qi QiX i X i Qi V ar (ỹ i|X,C)Q X XTQTi i i i QiX[ ] [ ][ ] i∑i=1 ∑i=1 ∑ i=1G −1 G G −1
= XT Ti Qi QiX X
T
i i Q
T
i Σ Q
T T
i iX i X i Qi QiX i .
i=1 i=1 i=1
3.3.3 Proof of (3.10)
Recall that the diagonal entries σijj and off-diagonal entries σ
i
jt of the variance-
covariance matrix Σi are equal to:
∑Ni ∑Ni ∑Ni
σi = qi σ2 + qi (1− qi )(xT 2 i i i 2jj jj′ e jj′ jj′ ij′β) , σjt = qjj′qtj′σe , (t =6 j).
j′=1 j′=1 j′=1
By taking the first partial derivative with respect to βk, we can get:
∂σi ∑Ni ∑Ni ijj ∂σ
= 2 qi (1− qi ∂β jt′ ′)(xT i i T′β)xij′k = 2 q ′(1− q ′)(x ′β)xij′kδk, = 0,
∂β jj jj ij ∂β jj jj ijk ′ k ∂βkj =1 j′=1
52
for k = 1, . . . , p, where δk is the kth column of the identity matrix I p of dimension p× p.
∂σi
Let D = ∂Σi = ( jt )ni,nii,k ∂β ∂β j=1,t=1 denote the first partial derivative of Σi with respect to βk.k k
When σ2e andψ are known, the weighted least squares (WLS) estimator β̂
2
WLS(σe ,ψ)
is defined to be the value of β that minimizes the weighted sum of squares (WSS) with Σi
as its weight matrix. That is,
∑G
β̂WLS = arg min (ỹ i −QiX iβ)TΣ−1 2i (ỹ i −QiX iβ) := arg min fwss(β, σe ,ψ).
β β
i=1
Equivalently, β̂WLS can be obtained by solving the following estimating equations:( )p
∂fwss(β, σ
2
e ,ψ) ∂fwss(β, σ
2
e ,ψ)= = 0p,
∂β ∂βk k=1
where
∂f 2wss(β, σe ,ψ)
∑∂{βkG
− ∂β
T
T T ∂Σi= X i Qi Σ
−1
i (ỹ i −Q
T −1 −1
iX iβ)− (ỹ i −QiX iβ) Σi Σ (ỹ i −QiX iβ)∂β ∂β i
i=1 k } k
− ∂β
T
(ỹ i −Q{iX β)
TΣ−1i i QiX i
∑ ∂βkG }
= − 2δTXT T −1k i Qi Σi (ỹ i −QiX iβ) + (ỹ i −QiX iβ)
TΣ−1i Di,kΣ
−1
i (ỹ i −QiX iβ)
∑i=1G { }
= − 2δTkXTi QTi + (ỹ i −QiX iβ)TΣ−1i Di,k Σ
−1
i (ỹ i −QiX iβ).
i=1
3.3.4 Proof of (3.11)
Under the assumption of normality, ỹ i|X i,C i ∼ N(QiX iβ,Σi). When ψ is known,
the log-likelihood function of β and σ2e based on data {ỹ i,X i,C i, i = 1, . . . , G} is given
by:
1 ∑G { }
l(β , σ2e) = − ni ln(2π) + ln |Σ T −1i|+ (ỹ i −QiX iβ) Σi (ỹ i −QiX iβ) .2
i=1
53
( )
i ni,n
( )
i ni,ni
∂Σ ∂σ
i
Let D i jt ∂Σ
∂σ
i
i,k = ∂β = ∂β and Di,σ = 2 =
jt
2 denote the partial
k k ∂σe ∂σe
j=1,t=1 j=1,t=1
derivatives of Σi with respect to βk and σ
2
e , respectively, where
∂σi ∑Nijj ∂σi
= 2 qi i T
jt
∂β jj
′(1− qjj′)(xij′β)xij′kδk, = 0, (j 6= t)
k ∂βk
j′=1
∂σi ∑Nijj ∂σi ∑Nii jt= q i ijj′ , = qjj′qtj′ , (j 6= t)∂σ2e ∂σ2j′=1 e j′=1
for i = 1, . . . , G, j = 1, . . . , ni, t = 1, . . . , ni, and k = 1, . . . , p.
The first derivatives of the log-likelihood function l(β, σ2e) with respect to βk and σ
2
e
are:
G {
∂l(β, σ2) 1 ∑ Te − ∂ ln |Σi|= − ∂β XTQTΣ−1i i i (ỹ i −QiX iβ)∂βk 2 ∂βi=1 k ∂βk }
T
− − T −1∂Σi(ỹ Q −1 ∂βi i{X iβ() Σi )Σi (ỹ i −QiX iβ)− (ỹ −Q X β)
TΣ−1i i i i QiX i
∑ ∂βk ∂βkG1 ∂Σi
= − tr Σ−1i − 2δ
TXTQTΣ−1(ỹ i −QiX iβ)
2 ∂β k i i i
i=1 k }
− (ỹ −Q X β)TΣ−1i i{ i i D
−1
i,kΣi (ỹ i −QiX iβ)
∑G1 ( )
= − tr Σ−1D T T T −1i,k − 2δ X Q Σ (ỹ i −QiX iβ)
2 i k i i i
i=1 }
− (ỹ i −QiX Tiβ) Σ−1D −1i i,kΣi (ỹ i −QiX iβ) ,
{ }
∂l(β , σ2 ∑Ge) −1 ∂ ln |Σ −1i| ∂Σ= + (ỹ i −Q T i2 2 iX iβ) (ỹ i −QiX iβ)∂σe 2 { (∂σe ) ∂σ2∑i=1 eG }1 ∂Σi ∂Σ−1
= − tr Σ−1i − (ỹ i −Q
T −1 i −1
2 i
X iβ) Σ
2 ∂σ i
Σ
∂σ2 i
(ỹ i −QiX iβ)
1 ∑i=1G { (
e) e }
= − tr Σ−1i Di,σ − (ỹ i −QiX iβ)
TΣ−1D −1i i,σΣi (ỹ i −QiX iβ) .2
i=1
54
3.3.5 Proof of (3.13)
∑
Under the linkage error model ỹ i = Liy i, we have ỹ =
Ni i
ij j′=1 ljj′yij′ . Assum-
ing the linkage is at random (LAR), it is true that P (lijj′ = 1|y i,X ii,C i) = P (ljj′ =
1|C ) = P (lii jj′ = 1|ci i ijj′) = qjj′ under the mixture model. Thus, E(ljj′ |y i,X i,C i) = qijj′ ,
V ar(lijj′ |y i,X i,C i) = qijj′(1 − qijj′), and Cov(li i ′ ′jj′ , ltt′ |y i,X i,C i) = 0 if j 6= t or j 6= t .
Here, i = 1, . . . , G, j = 1, . . . , ni, j
′ = 1, . . . , Ni, t = 1, . . . , ni, t
′ = 1, . . . , Ni. By using
these facts, we can get:
∑Ni
E(ỹij |y i,X ii,C i) = qjj′yij′ ,
j∑′=1Ni
V ar(ỹij |y i,X 2 i ii,C i) = yij′qjj′(1− qjj′),
j′=1
Cov (ỹij , ỹit|y i,X i,C i) = 0.
Based on the logistic regression model and the assumption that the response variable
y is conditionally independent of comparison vector c given x, we have E(yij′ |X i,C i) =
g(xTij′β), V ar(yij′ |X i,C i) = g(xTij′β)[1 − g(xTij′β)], and Cov(yij′ , yit′ |X i,C i) = 0 for i =
1, . . . , G, j′ = 1, . . . , Ni, t
′ = 1, . . . , Ni, and t
′ 6= j′. By applying law of total expectation,
law of total variance, and law of total covariance, we can get the following result for
i = 1, . . . , G, j = 1, . . . , ni, t = 1, . . . , ni and t 6= j:
∑N
E(ỹij |X i,C i) = qi g(xTjj′ ij′β),
j∑′=1Ni ∑Ni
V ar(ỹij |X i,C i) = qijj′g(xTij′β)[1− g(xTij′β)] + qijj′(1− qi 2 Tjj′)g (xij′β),
j∑′=1 j′=1Ni
= qi T i Tjj′g(xij′β)[1− qjj′g(xij′β)],
j∑′=1N
Cov (ỹij , ỹ
i
it|X i,C i) = qjj′qitj′g(xTij′β)[1− g(xTij′β)], (t =6 j).
j′=1
55
Let Σ = (σ2i)ni,nii jt j=1,t=1, then the above result can be written in the following matrix form:
E(ỹ|X i,C i) = QiX iβ, V ar(ỹ|X i,C i) = Σi, i = 1, . . . , G.
56
Chapter 4: Small Area Estimation with Data from Two Files
4.1 Introduction
In this chapter, we focus our research on a specific area: small area estimation.
Specifically, we are interested in predicting an area-specific parameter, which can be ex-
pressed as a function of fixed effects and random effects related to the conditional dis-
tribution of the variable of interest given the auxiliary variables. We provide a general
methodology for small area estimation using data from two files. Similar to the regres-
sion analysis using data from two files, we propose a general integrated model for small
area estimation, where a new linkage error model is developed to connect the unit-level
small area model and the record linkage model. The empirical best prediction estimation
of the area-specific parameter is considered under the general integrated model. To es-
timate its mean squared error, two jackknife methods are provided. Application of the
general methodology is not limited to the mutual independence of measurements. It can
be applied to measurements that are correlated within small areas but independent across
small areas. Unit-level models such as the general linear model with correlated sampling
errors within small areas, the general linear mixed model with nested errors can all be
considered.
57
4.2 Small Area Estimation
Small area estimation refers to the methodology and techniques used to improve
estimation precision for sub-populations (small areas, domains), where the sample sizes
for some sub-populations are not large enough to provide a reliable direct estimate (such
as a sample mean) with adequate precision.
Small area estimation is widely used to provide small area estimates to support
policy making, regional planning, and fund allocation at government agencies. For exam-
ple, the Small Area Income and Poverty Estimates (SAIPE) program established by the
U.S Census Bureau provides estimates of median household income and poverty rate of
school-aged children at state and county levels. The government utilizes these small area
estimates to allocate federal funds to states and domains within each state.
The basic idea of small area estimation is to increase effective sample size by bor-
rowing strength from values of the variable of interest from other related areas. Models
are used to link related small areas through the use of auxiliary information related to the
variable of interest, such as census and administrative records. The small area models can
be generally classified into two types: (1) area-level models that relate direct estimates to
area-specific and/or area-specific auxiliary variables, and (2) unit-level models that relate
the unit values of the study variable to unit-specific auxiliary variables. In the dissertation,
unit-level models are of the focus of the research.
Most of the existing unit-level-model-based small area estimation methods rely on
the joint observations on the variable of interest y and the auxiliary variable x. The
corresponding data layout for each small area is shown in Table 4.1. This type of data
can be obtained in a couple of ways: (1) A sample of y is obtained by sampling from a
58
population frame with known x; (2) Two separate files (one containing the observations
of y and the other containing the observations of x) are linked without any errors. For
example, a survey data set can be perfectly linked to an administrative data set if there
exists a unique and error-free identifier. Under this data layout, a huge literature on small
area estimation is available. We refer reader to Rao and Molina (2015).
However, the auxiliary information may not be recorded in the same file as the
variable of interest, but is available in an administrative data set. Data integration could
be a potential approach to cut down costs in data collection by preventing the need to
collect new survey data with all necessary information. In this chapter, we provide a
general methodology for small area estimation in the case where observations on y and x
are recorded in two different files, and the matching status between records is unknown.
Table 4.1: Data layout for the joint observations on (y,x): Values of y are available
for a sample of ni units in small area i, values of x are available for a finite population
of size Ni in small area i, and the values are aligned, i = 1, . . . ,m.
Label y xT
1 yi1 x
T
i1
· · · · · · · · ·
ni y
T
in xi ini
· · · · · ·
N Ti xiNi
4.3 General Integrated Model for Small Area Estimation
4.3.1 Problem Description and Data Availability
Suppose that the population of interest U can be partitioned into m subpop-
ulations (small areas) Ui, i = 1, · · · ,m. Let y and x denote a scalar variable of
59
interest and a vector random variable of dimension p, respectively. Our goal is to
estimate an area-specific parameter θi, which can be expressed as a function of fixed
effects β and random effects v related to the conditional distribution of y given x,
say θi = h(β,v), where h(·) is known. However, the joint observations on (y,x)
are not available. Instead, observations on y and observations on x are separately
recorded in two files Fy and Fx, respectively, and the matching status between any
record from Fy and any record from Fx is unknown.
To be specific, Fy (Fx) contains the observed values of y (x) for a sample Sy
(Sx) of size n (N) selected from U . There is no duplicate in either file and Sy ⊂ Sx.
We assume that the records in both files can be partitioned into small areas without
error. Therefore, there is zero probability that two records (one from Fy and the
other from Fy) represent the same unit if they are in different small areas. The
data layout for files Fy and Fx is shown in Table 4.2. Let ỹij denote the observed
value of y for record j in small area i from Fy, let x̃ij denote the unobserved value
of x corresponding to ỹij, and let xij′ represent the observed value of x for record
j′ in small area i from Fx. Since x̃ij exists but is not observed in Fy, hence its
corresponding column in Fy is shaded in gray. Note that the records in Fy are not
aligned to those in Fx, so ỹij and yij′ may not represent the y values for the same
population unit even if j′ = j. Other than y and x, there also exists a vector of K
matching fields, denoted by w, whose observations are available in both files. Let
w̃ij and wij′ represent values of matching fields w for record j and record j
′ in small
area i from Fx and Fy, respectively. It is also sufficient to assume that only the
value of comparison vector cijj′ is available for each record pair (j, j
′) in small area
60
i. Here, i = 1, · · · ,m, j = 1, · · · , ni, j′ = 1, · · · , Ni.
Let ỹ i denote the ni × 1 vector of observed y values in small area i from Fy,
let X̃ i denote the ni × p matrix of unobserved x values corresponding to ỹ i, let X i
denote the Ni× p matrix of observed x values in small area i from Fx, let y i denote
the unobserved Ni × 1 vector of y values associated with X i, let W̃ i denote the
ni×K matrix of w values in small area i from Fy, let W i denote the Ni×K matrix
of w values in small area i from Fx, and C i denote Nini ×K matrix of comparison
vectors derived from comparing W̃ i and W i. In summary, our observed data are
{ỹ i,X i, W̃ i,W i : i = 1, · · · ,m}, or equivalently, {ỹ i,X i,C i : i = 1, · · · ,m}.
Table 4.2: Data layout for observations on y and x for each small area in Fy and
Fx: Fy contains observed values of variables w and y for a sample Sy of ni units in
small area i, Fx contains observed values of variables w and x for a sample Sx of Ni
units in small area i, and Sy ⊂ Sx. Note that the true x values corresponding to y
exists but are not observed in Fy (shaded in gray).
Label wT y xT Label wT xT
1 w̃T ỹ T T Ti1 i1 x̃i1 1 wi1 xi1
· · · · · · · · · · · · · · · · · · · · ·
Fy Fj w̃T ỹ x̃T xij ij ij j
′ wTij′ x
T
ij′
· · · · · · · · · · · · · · · · · · · · ·
n T T T Ti w̃in ỹin x̃i i in Ni i wiN xi iNi
4.3.2 General Integrated Model for Small Are Estimation
The general integrated model for small area estimation contains three compo-
nents, similar to that for the regression analysis. The record linkage model (mixture
model) stays the same, but the regression model has been replaced by a unit-level
small area model and a new linkage error model is used.
61
Unit-Level Small Area Model: In our dissertation, we focus our research
on those unit-level small area models for which the values of y are assumed to
be independent across small areas. Suppose that values of (y,x) for units in the
population U follow a unit-level small area model, which can be generally written
in the following form:
y ij = g(xij, v i, eij;φ), i = 1, · · · ,m, j ∈ Ui, (4.1)
where g(·) is a known function, v i is a vector of area-specific random effects, eij
is a sampling error, and φ is a vector of unknown parameters. Note that yij =
g(xij, eij;φ) is a special case of the above general small area model without random
effects. Two examples of the unit-level small area models are given below:
Example 1: Linear Regression Model with Common Regression Coefficients:
y = x′ij ijβ + eij, i = 1, · · · ,m, j ∈ Ui,
iid
where β is a vector of unknown regression coefficients, and eij ∼ N(0, σ2e). In this
case, φ = (β, σ2e)
T . W∑hen the population size N
p
i of small area i is large, the small
area mean ȳ = 1i p j∈U yij can be trea∑ted as a linear function of β under theNi i
model; that is, ȳ ≈ x̄Ti i β , where x̄ 1i = Np j∈U xi ij is the population mean of x ini
small area i.
Example 2: Nested Error Linear Regression Model:
yij = x
T
ijβ + v i + eij, i = 1, · · · ,m, j ∈ Ui, (4.2)
where v i is a vector of area-specific random effects and eij is the error term, which
takes account of any unexplained variation not taken care of by the other terms of
62
the above mixed model. It is often assumed that vi’s and eij’s are independently
iid iid
distributed with v 2 2i ∼ N(0, σv) and eij ∼ N(0, σe). Here, φ = (βT , σ2, σ2v e). When
Npi is large, ȳi can be treated as a linear function of fixed effects β and random effect
vi under the model; that is, ȳi ≈ x̄Ti β + vi. Battese et al. (1998) implemented the
model to estimate areas covered by corn (or soybeans) for m = 12 counties in north
central Iowa using the farm-interview data as y and the LANDSAT satellite data
(the number of pixels classified as corn and soybeans) as x. The standard method
of estimating small area mean ȳi is described in Section 5.3.1 for the situation where
joint observations (yij,xij) is available for all sampled units and the population mean
for each small area x̄i is known.
Here, we further assume that the model holds for the sampled units in Sy. The
assumption is satisfied if the sample design is non-informative, that is, the sample
selection probabilities do not depend on values of y but may depends on values of
x. Then we have
ỹ ij = g(x̃ij, v i, eij;φ), i = 1, · · · ,m, j = 1, · · · , ni.
Linkage Error Model:
Recall that the linkage error model used in Chapter 2 is developed by exploiting
the relationship between the observed y values in Fy and the unobserved y values
corresponding to x values in Fx. Here, we develop another linkage error model
based on the same idea but with a focus on the relationship between the observed
and unobserved x values.
Under the assumption that (1) there are no duplications in both files, (2) Sy ⊂
63
Sx, and (3) there is no error in partitioning records into small areas, the unobserved
x value corresponding to ỹij in Fy, x̃ij, must be one of those observed x values
{xi1, · · · ,xiN } in the same small area i from F ix. Let ljj′ be the unknown binaryi
matching status indicator for record pair (j, j′) in small area i; that is, lijj′ = 1 if
record j from Fy and record j
′ from Fx in small area i represent the same population
unit, and lijj′ = 0 otherwise. Then the relationship between x̃ij and {xi1, · · · ,xiN }i
can be modeled via the following identity:
∑
x̃ij = l
i
jj′xij′ , i = 1, · · · ,m, j = 1, · · · , ni.
j′=1
Let L = (li )niNii jj′ j=1,j′=1 be the ni×Ni matrix of matching status indicators, then
the above linkage error model can be written in the following matrix form:
X̃ i = LiX i, i = 1, · · · ,m. (4.3)
In other words, the unobserved x values for sampled units of Sy in small area i, is a
permutation of ni observed x values for sampled units of Sx in the same small area.
More generally, the following linkage error model can be used:
X̃ = LX. (4.4)
where X̃ is the n× p matrix of unobserved x values for units in Sy, X is the N × p
matrix of observed x values for units in Sx, and L = (l
nN
jj′)j=1,j′=1 is the n × N
matrix of matching status indicators, where ljj′ represents the matching status for
record pair (j, j′), with j ∈ S ′y, j ∈ Sx. This general model (4.4) can be used
when the records are partitioned into small areas with errors, and all the possible
record pairs (Nn in total) need to be considered for the purpose of record linkage.
64
The linkage error model (4.3) is a special case of the general model (4.4) with
L = diag(L1, · · · ,Lm).
In most situations, both the linkage error model built on y values and the one
built on x values can be used for the purpose of regression analysis or small area
estimation. However, in some situations, it is easier to implement our proposed
general methodology when the former model is used than when the latter model is
used. The opposite may be true in other situations. Below are two examples. For
the purpose of comparison, we follow the notation in this chapter.
Example 1: Consider the situation where a logistic regression model is used as
the first component of the general integrated model. It is easier to calculate the con-
ditional expectation of the observed y values in Fy given x values in Fx and compari-
son vector c when the linkage error model built on y values is used. When the linkage
error model built on y values is used, calculation of the condition(al expec)tation isn
quite simple. (That)is, E(ỹ i|X( i,C)i) = Qig(X iβ), where g(X iβ) = g(x
Tβ) iij andj=1
g(xTijβ) = exp x
T
ijβ /[1+exp x
T
ijβ ]. However, when the linkage error model built on
x values is used, E(ỹ i|X i,C i) = E [g(L(iX∑iβ)|X i,C i], w)he[re g(LiX(i∑β) is a ni×1 v)ec]-
tor with the jth element equal to exp Ni li T Ni i Tj′=1 jj′xij′β / 1 + exp j′=1 ljj′xij′β ,
whose conditional expectation is difficult to calculate.
Example 2: Consider the situation where a linear model with common re-
gression coefficient and equal variance is used as the first component of the general
integrated model. The conditional covariance of observed y values in Fy given x
values in Fx and comparison vectors c has a simpler format when the linkage model
built on x values is used. When the linkage error model built on y values is used,
65
under certain assumptions, the (j, j)th diagonal and the (j, t)th off-diagonal entries
of V ar(ỹ i|X i,C i) are given by
∑Ni ∑Ni ∑Ni
σi i 2 i i T 2 i i i 2jj = qjj′σe + qjj′(1− qjj′)(xij′β) , and σjt = qjj′qtj′σe .
j′=1 j′=1 j′=1
In contrast, when the linkage error model built on x values is used, under certain
assumptions,
∑Ni
σi = qijj jj′(1− qi Tjj′)(xij′β)2 + σ2e , and σijt = 0.
j′=1
When it comes to small area estimation, general linear models and general
linear mixed models are widely used as unit-level small area models. So we choose
to use the linkage error model built on X̃ , that is, X̃ i = LiX i, in order to obtain a
simple format of the conditional covariance matrix, since X̃ i = LiX i can only affect
the fixed effects term but not the random effects and mixed error terms.
4.4 Empirical Best Prediction Estimator
Under the squared error loss, the best prediction (BP) estimator θ̂BPi of θi is
equal to the conditional expectation of θi given the data under the general integrated
model when both φ in the small area model and ψ in the mixture model are known;
that is θ̂BPi = E(θi|ỹ ,X ,C ). The resulting BP estimator θ̂BPi can be expressed as a
function of ỹ ,X ,C,φ and ψ, say
θ̂BPi = π(ỹ ,X ,C ;φ,ψ) ≡ π(φ,ψ).
To obtain the BP estimator of θi, it is important to derive the conditional
distribution of ỹ (and v) given X and C under the general integrated model if the
66
small area model is a fixed (mixed) effect model. The empirical best prediction
(EBP) estimator is obtained from BP by substituting suitable estimators φ̂, ψ̂ of
model parameters φ, ψ:
θ̂EBPi = π(φ̂, ψ̂).
Therefore, once the expression of the BP estimator is obtained, the estimation of
θi reduces to the estimation of unknown parameters φ and ψ. Then the general
methodology given in Chapter 2 can be used.
Recall that when ψ is known, we can estimate φ by solving a system of es-
timating equations derived from the conditional distribution of ỹ given X and C .
The estimating equations for φ can be generally written as
∑m ′ ′
φ̂(ψ) : fi (ỹ i,X i,C i;φ,ψ) + a (φ,ψ) = 0. (4.5)
i=1
′ ′
where fi (ỹ i,X i,C i;φ,ψ) are vector-valued functions such thatE[fi (ỹ i,X i,C i;φ,ψ)] =
0 with respect to the general integrated model for true values of φ and ψ, a′(φ,ψ) is a
vector-valued functions which may depend on the joint distribution of {ỹ1, · · · , ỹm},
′
and 0 is a vector of zeros with the same order as φ. Note that fi (ỹ i,X i,C i;φ,ψ) has
the same order as φ. When a(φ,ψ) 6= 0, it plays the role of a modifier or penalizer.
In practice, however, value of ψ is unknown. The MLE estimate ψ̂ of ψ can
also be treated as the solution of a system of unbiased estimating equations.
∑m ′′
ψ̂ : fi (C i;ψ) = 0 (4.6)
i=1
′′ ′′
where where fi (C i;ψ) are vector-valued functions such that E[fi (C i;ψ)] = 0 for
true values of ψ, and 0 is a vector of zeros of the same order as ψ.
67
Based on (4.5) and (4.6), φ̂ ≡ φ̂(ψ̂) and ψ̂ are solutions to the following
estimating equations:
∑m
φ̂, ψ̂ : F (φ,ψ) = fi(ỹ i,X i,C i;φ,ψ) + a(φ,ψ) = 0 (4.7)
i=1
where
( )T ( )′ ′′ ′ T
fi(ỹ i,X i,C i;φ,ψ) = fi (ỹ i,X i,C i;φ,ψ)
T , f Ti (C i;ψ) , a(φ,ψ) = a (φ,ψ)
T ,0T ,
for i = 1, · · · ,m.
EBP
4.5 Estimation of the Mean Squared Error of θ̂i
In order to estimate the mean squared error (MSE) of the empirical best
prediction estimator θ̂EBP , the unified jackknife method proposed by Jiang, Lahiri
and Wan (2002, JLW) and its alternative proposed by Lohr and Rao (2009, LR)
can be used.
The jackknife replicate j is constructed by deleting (ỹj,X j,C j) from the data.
The delete-j estimates φ̂−j, φ̂−j of φ, ψ are the solutions of
∑
φ̂−j, ψ̂−j : fi(ỹ i,X i,C i;φ,ψ) + a−j(φ,ψ) = 0,
i 6=j
and the delete-j EBP estimate of θi is given by θ̂
EBP
i(−j) = π(φ̂−j, ψ̂−j), for j =
1, · · · ,m.
The JLW jackknife estimate of bias, variance, matrix MSE and scale MSE of
68
φ̂ are defined as
¯̂
biasJ(φ̂) = (m− 1∑)(φ − φ̂),− mm 1 − ¯̂ ¯̂varJ(φ̂) = (φ̂−j φ)(φ̂ − φ)T−j ,
m
m− 1 ∑j=1m
mseJ(φ̂) = (φ̂
T
(−j) − φ̂)(φ̂(−j) − φ̂) ,
m
j=1
¯̂ ∑
where φ = 1 m
m j=1
φ̂−j is the average of the replicate estimate of φ. The jackknife
estimates for ψ̂ are defined similarly.
In terms of estimating the mean squared error of θ̂EBPi , both the JLW and LR
methods are based on the following decomposition of MSE(θ̂EBPi ):
MSE(θ̂EBP ) = MSAE(θ̂EBPi i ) +MSE(θ̂
BP
i ), (4.8)
[ ] [ ]
where MSAE(θ̂EBP ) = E (θ̂EBPi i − θ̂BP )2 and MSE(θ̂BPi i ) = E (θ̂BPi − θ )2i .
The jackknife estimate of the first term on the right-hand side of (4.8) is the same
for both methods. That is,
m
m− 1 ∑
msae(θ̂EBPi ) = (θ̂
EBP − θ̂EBP )2i(−j) i . (4.9)m
i=1
The jackknife estimate for the second term is different for the two methods. The
method proposed by Jiang, Lahiri and Wan(2005) requires a closed-form expres-
sion for the mean squared error of θ̂BPi , while the method proposed by Lohr and
Rao (2009) requires a closed-form expression for the conditional mean squared
err[or (CMSE) of θ̂BPi ]. Suppose MSE(θ̂BPi ) = bi(φ,ψ) and CMSE(θ̂BPi ) =
E (θ̂BPi − θ )2i |ỹ ,X ,C = V ar(θi|ỹ ,X ,C ) = b′i(φ,ψ). Then the jackknife estimates
69
of the second term are given by
∑m [ ]
mse (θ̂BP
m− 1
J i ) = bi(φ̂, ψ̂)− bi(φ̂−j, ψ̂−j)− bi(φ̂, ψ̂) , (4.10)
∑m[ j=1 ]
mseL(θ̂
BP ) = b′i(φ̂, ψ̂)− b′i(φ̂ ′−j, ψ̂−j)− bi(φ̂, ψ̂) . (4.11)
j 6=i
Both jackknife estimators are nearly unbiased for MSE(θ̂EBPi ) in the sense of
having unconditional bias of order o(1/m) under regularity conditions. Compared
with the method proposed by Jiang, Lahiri, and Wan (2005), Lohr and Rao’s ap-
proach is less computationally intensive, because it does not require the calculation
of bi(φ,ψ), which usually involves integration with respect to ỹ i, X i and C i.
4.6 Summary
In this Chapter, a new linkage error model is developed based on the relation-
ship between observed and unobserved x values. It provides a connection between
the unit-level small area model and the mixture model in the general integrated
model for small area estimation. Under the general integrated model, we provide a
general methodology for obtaining an empirical best prediction (EBP) estimator of a
area-specific mixed parameter θ. A jackknife resampling method for estimating the
mean squared error of the EBP estimator is also provided. Application of the general
methodology is not limited to the mutual independence of measurements. It can
be applied to measurements that are correlated within small areas but independent
across small areas. Unit-level models such as general linear model with correlated
sampling errors within small areas, general linear mixed model with nested errors
can all be considered.
70
Chapter 5: Application to the General Linear Mixed Model
5.1 Introduction
To illustrate our general methodology for small area estimation, we consider
the situation where the general linear mixed model with block diagonal covariance
structure is used as the unit-level small area model. The Empirical Best Predic-
tion (EBP) estimator for a mixed parameter is derived under the general integrated
model. The closed-form expression for the conditional mean squared error of its cor-
responding Best Prediction (BP) Estimator is also provided for estimating its mean
squared error using the jackknife method provided by Lohr and Rao (2009). As a
special example, we consider the estimation of small area means when a nested error
linear model is used. We provide two methods for estimating the unknown parame-
ters: the Maximum Likelihood (ML) method and the Pseudo Maximum Likelihood
(PML) method. We also discuss the use of numerical algorithms in approximat-
ing the maximum likelihood estimates (MLE), including Newton-Raphson method
and Fish scoring algorithm, and further propose a quasi-scoring algorithm in order
to reduce the computational burden. In this chapter, we follow the notation from
Chapter 4.
71
5.2 General Linear Mixed Model with Block Diagonal Covariance
In this section, we consider a specific case where the first component of the
general integrated model is a general linear mixed model with block diagonal co-
variance structure, which may be expressed as:
ỹ i = X̃ iβ + Ũ iv i + ei, i = 1, . . . ,m. (5.1)
Here, ỹ = (ỹ )nii ij j=1 represents the ni × 1 vector of observed y values in small area i
from file Fy, X̃ i = (x̃
T ni
ij)j=1 is the ni×p matrix of unobserved x values corresponding
to ỹ i, β is an p× 1 vector of fixed effects, Ũ i is a known matrix of dimension ni×h,
v i is an h× 1 vector of area-specific random effects, ei is an ni× 1 vector of random
errors, and the v is and eis are independent with v i ∼ N(0,Gi) and ei ∼ N(0,Ri),
where Gi and Ri depend on variance parameters δ .
We are interested in estimating a linear combination of fixed effects β and
mixed effects v i, say θ
T T
i = ai β + bi v i, for specified known vectors ai of order p and
bi of order h, i = 1, . . . ,m.
Under the assumption that vi and ei are independent of X i and C i given X̃ i,
the conditional distribution of ỹ i andv i given X i and C i is     ỹ i QiX βind i   Σi Ũ iGi|X ,C ∼ N   , i i  . (5.2)
v 0 G Ti iŨ i Gi
Here, Q i ni Nii = (qjj′)j=1,j′=1 is the matrix of matching probabilities, and Σi = K i +
Ũ iG
T
iŨ i + Ri, where K i = (k
i
jt)
ni,ni
j=1,t=1 is a ni × ni diagonal matrix with diagonal
72
entry kijj equal to∑Ni
kijj = q
i i T T
jj′(1− qjj′)xij′ββ xij′ , i = 1, . . . ,m, j = 1, . . . , ni.
j′=1
The detailed proof of (5.2) is given in Appendix 5.4.1.
Therefore, the conditional distribution of v
( i
given ỹ i , X i and C)i
is
v i|
ind
ỹ i,X i,C i ∼ N G T −1 T −1iŨ i Σi (ỹ i −QiX iβ),Gi −GiŨ i Σi Ũ iGi . (5.3)
Based on results from (5.3), the BP estimator of v i and θi are given by
v̂BP = v̂BPi i (β,δ,ψ) = E(v i|ỹ i,X i,C i) = GiŨ T −1i Σi (ỹ i −QiX iβ), (5.4)
θ̂BP = θ̂BPi i (β,δ,ψ) = E(θi|ỹ i,X ,C ) = aTi i i β + bTi v̂
BP
i ,
and the conditional mean squared error of θ̂BPi is given by( )
CMSE(θ̂BP ) = bTV ar(v |ỹ ,X ,C )b = bT G −G Ũ TΣ−1i i i i i i i i i i i i Ũ iGi bi.
Note that the estimator θ̂BPi depends on the model parameters β , δ from the small
area model and ψ from the mixture model.
When ψ is known, the estimates β̂(ψ) and δ̂(ψ) of β and δ can be obtained
by using the maximum likelihood method based the conditional distribution of ỹ i
given X i and C i:
ỹ i|
ind
X i,C i ∼ N(QiX iβ,Σi).
When ψ is unknown, the MLE estimate ψ̂ of ψ can be obtained by using the
Expectation-Maximization algorithm, and β̂ = β̂(ψ̂), δ̂ = δ̂(ψ̂). By substituting β̂
for β , δ̂ for δ , and ψ̂ for ψ in (5.4), we can obtain the EBP estimator of θ:
θ̂EBPi = θ̂
BP
i (β̂ , δ̂ , ψ̂) = a
T
i β̂ + b
T BP
i v̂ i (β̂ , δ̂ , ψ̂) (5.5)
73
The jackknife methods are applicable to estimate MSE(θ̂BPi ), because the
data {ỹ i,X i,C i} are independent across small areas and the unknown parameters
{β,δ,ψ} are estimated using estimating equations. We first calculate the delete-j
estimators β̂−j, δ̂−j, ψ̂−j for each jackknife replicate j constructed by deleting data
{yj,X j,C j} in small area j from the full data, j = 1, . . . ,m, and then obtain the
delete-j estimators θ̂EBPi(−j) by replacing β , δ and ψ by β̂−j, δ̂−j and ψ̂−j, respectively.
The jackknife MSE estimator proposed by Lohr and Rao (2009) is recommended to
use since the closed-form expression of CMSE(θ̂BPi ) is available.
The above results can be easily adjusted for the case where a general linear
model is used as the first component of the general integrated model by letting Ũ i
be a matrix of zeros.
5.3 Nested Error Linear Model
In this section, we consider a specific example of the general linear mixed
model with block diagonal covariance structure. Here, a nested error linear model
is used as a unit-level small area model to characterize the relationshi∑p between y
and x. We are interested in estimating population means θi = ȳ
1
i = Np j∈U y fori i ij
each small area. The small area mean ȳi can be treated as a linear combination of
the fixed effect β and the mixed effect vi under the nested error linear model if the
population sizes Npi are sufficiently large. To be exact,
θi = ȳi ≈ x̄Ti β + vi, i = 1, . . . ,m.
∑
where x̄ = 1i p j∈U xij is the population mean of x for small area i.Ni i
74
5.3.1 Estimation of Small Area Mean Using Data From a Single File
First, we consider the ideal case where data for small area estimation are from
a single file as shown in Table 4.1. Or equivalently, the joint observations (yij,xij)
are available for each unit in a sample of ni units in small area i, and the population
means x̄i are known, the best prediction (BP) estimator of θi is given by
θ̂BP = x̄Ti i β + (1−Bi)(ȳi,s − x̄Ti,sβ) (5.6)
∑ ∑
where ȳi,s =
1 ni 1 ni
ni j=1
yij and x̄i,s = ni j=1 xij are sample means of y and x in
small area i, respectively, and Bi = σ
2/(σ2e e + n
2
iσv). An empirical best prediction
(EBP) estimator of small area mean ȳi can be obtained by substituting the unknown
parameters φ = (βT , σ2 2v , σe) with their suitable estimates φ̂ = (β̂
T , σ̂2v , σ̂
2
e) in the
above formula for BP estimator. Typically, the weighted least squares methods is
used to estimate β , and methods of moments (MOM), maximum likelihood method
(ML), or restricted maximum likelihood (REML) method is used to estimate the
variance components σ2 and σ2v e . Under regularity conditions, the mean squared
error (MLE) of the EBP estimator of small area mean θ = ȳi can be estimated by
M̂SE(θ̂EBPi ) ≈ g1i(σ̂2, σ̂2 2 2 2 2v e) + g2i(σ̂v , σ̂e) + g3i(σ̂v , σ̂e)
where the functions g1i(·), g2i(·), g3i(·) are given in Rao (2003), Chapter 7.
5.3.2 Estimation of Small Area Mean Using Data From Two Files
Now, we consider the case where observations on y and x are separately
recorded in two different files (Fy and Fx) and the matching status between records
75
from Fy and Fx is unknown. The data layout is shown in Table 4.2.
Assuming that the nested error linear model also holds for all sampled units
in file Fy, the general integrated model then becomes:
iid iid
(a) ỹij = x̃
T
ijβ + vi + eij, vi ∼ N(0, σ2), e ∼ N(0, σ2);∑ v ij eNi
(b) x̃ iij = ljj′xij′ ; (5.7)
j′=1
i
(c) lijj′ ∼
id ind ind
Binom(1, π), ci ijj′k|ljj′ = 1 ∼ Binom(1,mk), cijj′k|lijj′ = 0 ∼ Binom(1, uk);
where vi are independent of eij, i = 1, . . . ,m, j = 1, . . . , ni.
As discussed above, the small area mean can be expressed as ȳ Ti ≈ x̄i β + vi
under the nested error linear model, where x̄i is known for each small area i, i =
1, . . . ,m. Then, derivation of the best prediction estimator of ȳi reduces to the
derivation of the best prediction estimator of random effect vi. This requires us
to obtain the conditional distribution of vi given observed data under the general
integrated model.
Under the nested error linear model as described in (a), it is not difficult to
obtain that
E(ỹij|x̃ Tij) = x̃ijβ, Cov(ỹij, ỹit|x̃ij, x̃it) = σ2v ,
V ar(ỹ 2 2ij|x̃ij) = σv + σe , Cov(ỹ , v |x̃ ) = σ2ij i ij v . (5.8)
for integers i ≤ i ≤ m, 1 ≤ j, t ≤ ni, 1 ≤ j′ ≤ Ni.
Based on the linkage error model as described in (b), for any value of β , we
can get ∑Ni
x̃Tijβ = l
i
jj′x
T
ij′β. (5.9)
j′=1
76
Under the mixture model as described in (c), by using Bayes’ rule and the LAR
assumption, we can obtain that P (lijj′ = 1|cijj′) = qijj′ . Thus, for any 1 ≤ j, j′ ≤ ni,
1 ≤ t, t′ ≤ Ni, and j 6= t, we have
E(li i i ijj′ |C i) = qjj′ , V ar(ljj′ |C i) = qjj′(1− qijj′), Cov(li ijj′ , ltt′ |C i) = 0. (5.10)
By combining results from (5.9) and (5.10), we can get
∑Ni
E(x̃T i Tijβ |X i,C i) = qjj′xij′β,
j∑′=1Ni
V ar(x̃Tβ |X ,C ) = qi (1− qi )(xT 2ij i i jj′ jj′ ij′β) , (5.11)
j′=1
Cov(x̃T Tijβ, x̃itβ |X i,C i) = 0,
for integers 1 ≤ i ≤ m, 1 ≤ j, t ≤ ni, (j 6= t).
Under the assumption that vi and ei are independent of X i and C i given X̃ i,
by combining results from (5.8) and (5.11), we can prove that
∑Ni
E(ỹ |X ,C i Tij i i) = qjj′xij′β,
j∑′=1Ni
V ar(ỹ i i T 2 2 2ij|X i,C i) = qjj′(1− qjj′)(xij′β) + σv + σe , (5.12)
j′=1
Cov(ỹij, ỹ
2
it|X i,C i) = σv ,
Cov(ỹij, vi|X i,C i) = σ2v ,
for integers 1 ≤ i ≤ m, 1 ≤ j, t ≤ ni, (j =6 t). The detailed proof of (5.11) and
(5.12) is given in Section 5.4.2 and Section 5.4.3, respectively.
77
The above results can also be written in the following matrix form:
E(ỹ i|X i,C i) = QiX iβ,
V ar(ỹ |X ,C ) = K + σ21 1T + σ2i i i i v n n eIn := Σi i i i
Cov(ỹ i, vi|X i,C i) = σ2v1n .i
where 1n is the ni i × 1 vector of ones, In is the ni × ni i identity matrix, and K i =
(ki )nini is an n × n diagonal matrix with diagonal entry kijj′ j=1j′=1 i i jj equal to
∑Ni
kijj = q
i
jj′(1− qijj′)(xT 2ij′β) .
j′=1
Here, K i can also be written in a matrix form, that is,
{[ ] }
K i = diag Qi ◦ (1n 1TN −Qi) [(X iβ) ◦ (X iβ)] .i i
where M 1 ◦M 2 is the Hadamard product of any two matrix M 1 and M 2 of the
same dimension, and diag{v} is a diagonal matrix with diagonal entries same as
entries of a vector v. Note that both Σi and K i depends on unknown parameter
φ = (βT , σ2v , σ
2
e) from the nested error linear model and unknown parameter ψ =
(π,m1, . . . ,mK , u1, . . . , uK) from the mixture model.
Based on the above analysis, the conditional distribution of (ỹ i, vi) given X i
and C i is then given by
     ỹ i |X i,C i ∼ N Q 2iX iβ  Σi 1n σi v,  .
vi 0 1
T 2 2
n σ σi v v
78
Therefore, for known φ and ψ, the BP estimate of vi and θi are given by
v̂BP 2 T −1i = E(vi|ỹ i,X i,C i) = σv1n Σi (ỹ i −QiX iβ),i
θ̂BP = x̄Ti i β + σ
2
v1
T
n Σ
−1
i i
(ỹ i −QiX iβ) (5.13)
The EBP estimate θ̂EBPi is obtained by replacing φ and ψ by their estimates φ̂ and
ψ̂ in formula (5.13). That is,
θ̂EBP
−1
i = x̄
T
i β̂ + σ̂
2 T
v1n Σ̂i i (ỹ i − Q̂iX iβ̂), (5.14)
where Σ̂i = Σi(φ̂, ψ̂) is an estimate of Σi. The methods for estimating unknown
parameters φ and ψ will be introduced in the subsequent part.
The jackknife methods described in this dissertation can be used to the mean
squared error of θ̂EBPi . The choice of the jackknife methods depends on whether
a closed-form expression for MSE(θ̂BPi ) or CMSE(θ̂
BP
i ) is available. We prove in
Section 5.4.4 that
CMSE(θ̂BPi ) = σ
2 − σ21Tv v n Σ−1 2i i 1n σi v ,
MSE(θ̂BP 2 2 Ti ) = σv − σv1n E(Σ−1i )1n σ2v . (5.15)i i
Note that there exists a closed-form expression for CMSE(θ̂BPi ), but not for
MSE(θ̂BPi ) due to the difficulty in calculating the component E(Σ
−1
i ). Therefore,
the estimate of MSE(θ̂EBPi ) can be obtained by using the jackknife method pro-
posed by Lohr and Rao (2009) rather than the one proposed by Jiang, Lahiri and
Wan (2002).
79
5.3.3 Estimation of φ: Maximum Likelihood Method
When ψ is known, we can use the maximum likelihood method to estimate
φ. Based on the facts that f(ỹ i,X i,C i;φ,ψ) = f(ỹ i|X i,C i;φ,ψ)f(X i,C i;ψ) and
f(X i,C i;ψ) does not depends on φ, the log-likelihood function of φ based on data
{y i,X i,C i, i = 1, . . . ,m} can be expressed as
∑m
l(φ; ỹ ,X ,C ) ∝ ln (f(ỹ i|X i,C i;φ,ψ))
i=1 ∑m { }
= −1 ni ln(2π) + ln |Σi |+ (ỹ T −1i −QiX iβ) Σi (ỹ i −QiX iβ) .2
i=1
The first derivatives of l(φ; ỹ ,X ,C ) are given by:
m {
∂l(φ; ỹ ,X ,C ) 1 ∑ ( )
= − tr Σ−1i Di,k − 2δTkXT T −1i Qi Σi (ỹ i −QiX iβ)∂βk 2 i=1 }
− (ỹ i −Qi{X iβ)
TΣ−1i D
−1
i,kΣ
∑ i
(ỹ i −QiX iβ) , (5.16)
m
∂l(φ; ỹ ,X ,C ) −1
(
= tr Σ−
)
1
i 1n 1
T
∂σ2 2 i niv i=1 }
− (ỹ −Q X β)TΣ−11 1Ti i{ i i n n Σ
−1
i (ỹ i −QiX iβ) ,i i
m }
∂l(φ; ỹ ,X ,C ) 1 ∑ ( )
= − tr Σ−1 − (ỹ −Q X β)TΣ−1Σ−1(ỹ −Q X β) ,
∂σ2 i
i i i i i i i i
e 2 i=1
where tr(M ) is the trace of any square matrixM , δk is the kth column of the identity
matrix I p∑, andDi,k = (d
i nini
jj′,k)j=1j′=1 is a ni×ni diagonal matrix with diagonal entries
di = 2 Ni qi (1− qijj,k j′=1 jj′ jj′)(xTij′β)xij′k for i = 1, . . . ,m, j = 1, . . . , ni, k = 1, . . . , p.
80
The second derivatives of l(φ; ỹ ,X ,C ) are given by:
m {
∂2l(φ; ỹ ,X ,C ) 1 ∑ ( )
= − tr −Σ−1i D −1 −1i,tΣ∂β i
Di,k + Σi Di,kt
t∂βk 2 i=1
+ 2δTXT T −1 −1k i Qi Σi Di,tΣi (ỹ i −QiX iβ)
+ 2δTXT T −1 T T T −1 −1k i Qi Σi QiX iδ t + 2δ t X i Qi Σi Di,kΣi (ỹ i −QiX iβ)
+ 2(ỹ i −QiX iβ)TΣ−1i D Σ−1i,t i D −1i,kΣi (ỹ i −}QiX iβ)
− (ỹ −Q X β)TΣ−1i i{ i i D Σ
−1
i,kt i (ỹ i −QiX iβ)
∂2
m
l(φ; ỹ ,X ,C ) 1 ∑ ( )
= − − tr Σ−1i D Σ−11 T2 i,k i n 1∂β ∂σ 2 i nik v i=1
+ 2δT T T −1 T −1kX i Qi Σi 1n 1 Σi ni i (ỹ i −QiX iβ) }
+ 2(ỹ −Q X β)TΣ−1D −1i {i i i i,kΣi 1 1
T Σ−1n n i (ỹ i −QiX iβ) ,i i
∂2
m
l(φ; ỹ ,X ,C ) 1 ∑ ( )
= − − tr Σ−1i D −1i,kΣi + 2δTkXTi QTΣ−1Σ−1i i i (ỹ i −Q2 } i
X iβ)
∂βk∂σe 2 i=1
+ 2(ỹ i −Q X β)TΣ−1 −1 −1{i i i Di,kΣi Σi (ỹ i −QiX iβ) , (5.17)m
∂2l(φ; ỹ ,X ,C ) ∑ ( )
= −1 tr −Σ−11 1T −1 T
∂σ2 2
n
∂σ 2 i i n
Σi 1n 1i i ni
v v i=1 }
+ 2(ỹ i −QiX iβ)TΣ−1i 1n 1T Σ−11 T −1i ni i n 1n Σi (ỹ i −QiX iβ) ,
2 ∑ i im {∂ l(φ; ỹ ,X ,C ) −1 (= − tr Σ− )11 1T Σ−1n
∂σ2v∂σ
2
e 2
i i ni i
i=1 }
+ 2(ỹ −Q X T −1 T −1 −1i {i iβ) Σi 1n 1n Σi Σi (ỹ i −QiX iβ) ,i i
∂2
m
l(φ; ỹ ,X ,C ) 1 ∑ ( )
= − − tr Σ−1Σ−1
∂σ2∂σ2 2 i ie e i=1 }
+ 2(ỹ i −QiX iβ)TΣ−1 −1 −1i Σi Σi (ỹ i −QiX iβ) ,
where D i nini ii,kt = (djj′,kt)j=1j′=1 is an ni × ni matrix with diagonal entries djj,kt =
81
∑
2 m i ii=1 qjj′(1−qjj′)xij′txij′k, i = 1, . . . ,m, j = 1, . . . , ni, k = 1, . . . , p and t = 1, . . . , p.
Recall that the maximum likelihood estimation (MLE) of φ is obtained by
setting the score function S(φ) = ∂l(φ;ỹ,X,C) to zero. Here, S(φ) is our estimating
∂φ
functions. The components of S(φ) are given by equations in (5.16). Based on the
above analysis, we can see that there is no closed-form expression for the solutions
of S(φ) = 0. In this case, the equations S(φ) = 0 must simultaneously be solved
numerically by using some iterative algorithms. In the following part, we discuss
the advantages and drawbacks of two existing algorithms for optimizing l(φ; ỹ ,X ,C )
(Newton-Raphson method and Fisher scoring algorithm), and propose a new algo-
rithm which is expected to be computationally friendly and robust to poor starting
values.
5.3.4 Numerical Algorithms
The Newton-Raphson method starts with an initial guess φ(0) for the root of
S(φ), and then takes the following iteration step until a convergence criterion is
reached:
φ(t+1) = φ(t) + J−1(φ(t))S(φ(t)) (5.18)
where φ(t) is the approximation of φ at iteration t, and J (φ) is the negative value
2
of the gradient of S(φ), that is, J (φ) = −∂S(φ) = −∂ l(φ;ỹ,X,C)T . The components∂φ ∂φ∂φ
of J (φ) can be obtained by taking the negative values of the second derivatives of
l(φ; ỹ ,X ,C ) given by equations in (5.17). We can see that J (φ) is a very complicated
matrix depending on data {ỹ i,C i,X i, i = 1, . . . ,m} and parameters φ and ψ.
82
The Newton-Raphson method is easy to understand and can be implemented
here since the matrix J (φ) is available. It is expected that the Newton- Raphson’s
algorithm would converge rapidly because of its quadratic converge rate. However,
it can still take long time to converge due to the difficulty in calculating the com-
plicated matrix J (φ(t)) in each iteration. Also, the Newton-Raphson method may
not be effective if the iterations begin with a poor starting values φ(0).
One alternative to Newton-Raphson method is the Fisher scoring algorithm,
which is commonly used for optimizing log-likelihood functions. The Fisher scoring
algorithm is obtained by replacing the matrix J (φ) in the iteration equation (5.22)
by its expectation, i.e., the Fisher information matrix I (φ). The iteration equation
then becomes
φ(t+1) = φ(t) + I−1(φ(t))S(φ(t)).
The Fisher scoring algorithm can reduce the computations in Newton-Raphson
method by simplifying the matrix J (φ). Jennrich and Sampson (1976) demonstrated
its robustness to poor staring values. However, the Fisher scoring algorithm requires
a closed form expression of the Fisher information matrix I (φ), which is not available
in our case. By the law of total expectations,
[ ] [ ( )]
∂2− l(φ; ỹ ,X ,C ) ∂
2l(φ; ỹ ,X ,C )
I (φ) = E = E E − |X,C .
∂φ∂φT ∂φ∂φT
Therefore, in order to derive I (φ), we need to take expectation of both sides of
the equations in (5.19) with respect to X and C . The calculation involves taking
expectations of complicated terms QTΣ−1Q −1 T −1 −1 −1i i i, Di,tDi,k, Σi 1n 1n Σi i i , Σi Σi and
83
Σ −1D −1i i,kΣi , which is not practical. Therefore, the Fisher scoring algorithm cannot
be implemented here despite of its advantages over the Newton-Raphson method.
Based on the above analysis, both the Newton-Raphson method and the Fisher
scoring algorithm are not ideal for approximating the MLE estimate of φ. Here,
inspired by the derivation of the Fisher scoring algorithm, we propose a new iterative
algorithm: the quasi-scoring algorithm. The matrix J (φ) in the Newton-Raphson
method is replaced by its conditional expectation given the observed comparison
vectors X and C , I c(φ) = E[J (φ)|X,C ], rather than its expected value I (φ). Then
the interaction equation used for the quasi-scoring algorithm is given by:
φ(t+1) = φ(t) + I−1c (φ
(t))S(φ(t)).
( The components of I)c(φ) are given by:
2 ∑m { ( )}
−∂ l(φ; ỹ ,X ,C ) 1E |X,C = 2δTXTQTΣ−1Q −1 −1
( ) k i i i i
X iδ t + tr Σi Di,tΣi Di,k ,∂βt∂βk 2 i=1
2 m
−∂ l(φ; ỹ ,C ) 1
∑
E |X,C = 1T Σ−1D Σ−1n i i,k i 1 ,( ∂β ∂σ2k v ) n2 i ii=1
2 m
−∂ l(φ; ỹ ,X ,C ) | 1
∑ ( )
E X,C = tr Σ−1D −1i,kΣ , (5.19)( ∂βk∂σ2 i ie ) 2 i=1
2 m
−∂ l(φ; ỹ ,X ,C ) | 1
∑
E X,C = 1T Σ−11 1T Σ−11 ,
( ∂σ2∂σ2 ) 2 ni i ni ni i niv v i=1
∂2
m
− l(φ; ỹ ,X ,C ) | 1
∑
E X,C = 1T Σ−1Σ−11
( n
,
∂σ2∂σ2 ni i i iv e ) 2 ∑i=12 m ( )
E −∂ l(φ; ỹ ,X ,C ) | 1X,C = tr Σ−1 −1
∂σ2∂σ2 2 i
Σi .
e e i=1
ind
Th[e derivation of (5.19) is based on] the fact that ỹ i|X i,C i ∼ N(QiX iβ,Σi) and
E (ỹ i −QiX β)Ti M (ỹ i −QiX iβ)|C = tr (MΣi) for any ni × ni matrix M , which
may depends on X and C but not on ỹ.
84
We can see that the format of I c(φ) is much simpler than that of J (φ) by
comparing their components shown in (5.19) and (5.17). So theoretically the quasi-
scoring algorithm can reduce computations at each iteration when compared to
the Newton-Raphson method, while the Newton-Raphson method can take less
number of iterations compared to the quasi-scoring algorithm. Also, the quasi-
scoring algorithm may preserve the property of robustness to poor starting values
of Fisher scoring, while the Newton-Raphson method may diverge if the starting
point is too far from the root of S(φ). Here, it is highly recommended to start an
iterative procedure with the quasi-scoring algorithms in the first few steps and then
change to Newton-Raphson method later. In this way, the quasi-scoring algorithm
can reduce computations and generate a find a good starting value for iterations
using the Newton-Raphson method.
All iterative methods do have their drawbacks. First, they may not converge
well due to poor starting values. The common approach is to use different starting
values and hope the algorithm will converge. Second, the resulting approximated
solutions φ̃ = (β̃ , σ̃2v , σ̃
2 T
e) of S(φ) = 0 may not be within the parameter space of
φ = (β, σ2v , σ
2
e)
T . For example, σ̃2 2v ≤ 0 or σ̃e ≤ 0. This inaccuracy may be caused by
the different ranges of the parameters. One may improve the accuracy by scaling
the parameters σ2v and σ
2
e so that they all have approximate the same range as β .
85
5.3.5 Estimation of φ: Pseudo Maximum Likelihood Method
Pseudo maximum likelihood method was first introduced by Samart and
Chambers (2014) as an appropriate modifications to the maximum likelihood
method of variance components estimation in presence of linkage errors. Based on
the above analysis shown in 5.3.3, we can see that the maximum likelihood method
may not be easy to use in practice. This is mainly caused by the difficulty of calcu-
lating the derivatives of l(φ; ỹ i,X i,C i) due to the complexity of Σi. The matrix Σi
depends on not only the variance components σ2, σ2v e but also the regression coeffi-
cient β . The basic idea of the pseudo maximum likelihood method is to reduce the
computation complexity by assuming that Σi is fixed in β , and depends on σ
2
v and
σ2e only.
Whenψ is known, the log-likelihood function of φ based on data {ỹ i,X i,C i, i =
1, . . . ,m} is given by
m
−1
∑{
l(φ; ỹ ,X ,C ) = n ln(2π) + ln |Σ |+ (ỹ −Q X β)TΣ −
}
1
i i i i i i (ỹ i −QiX iβ) .
2
i=1
Under the assumption that Σi is fixed in β , the first derivatives of l(φ; ỹ ,X ,C )
with respect to β , σ2v and σ
2
e are given by
m
∂l(φ; ỹ ,X ,C ) ∑
= XT T −1i Qi Σi (ỹ i −QiX iβ),∂β
i=1
m
∂l(φ; ỹ ,X ,C ) −1
∑{ }
= 1T Σ−1n i 1n − (ỹ i −Q X β)TΣ−1 T −12 i i i i i 1n 1i n Σ (ỹ −Qi i i iX iβ) ,∂σv 2 i=1
m
∂l(φ; ỹ ,X ,C ) 1 ∑{ ( ) }
= − tr Σ−1i − (ỹ −Q X β)TΣ−1i i i i Σ−1i (ỹ i −QiX iβ) . (5.20)∂σ2e 2 i=1
The pseudo maximum likelihood estimate (PMLE) of φ is obtained by solving
86
the following estimating functio(ns: )T
∂l(φ) ∂l(φ) ∂l(φ)
S̃(φ) = , , = 0
∂β ∂σ2 ∂σ2v e
Note that the resulting solution of S̃(φ) = 0 is referred as the PMLE estimator
of ψ because the estimating equations S̃(φ) is derived under the assumption Σi is
fixed in β . However, Σi is actually a function of β , σ
2
v and σ
2
e . Therefore, the
estimating equations S̃(φ) = 0 cannot be solved analytically. Again, we can use
either Newton-Raphson algorithm or quasi-scoring algorithm to solve the estimating
equations iteratively.
From (5.20), we can obtain the second derivatives of l(φ):
∂2
m
l(φ; ỹ ,X ,C ) ∑
= − XTQTΣ−1QiX i,
∂β∂βT i i i
i=1
m
∂2l(φ; ỹ ,X ,C ) ∑
= − XTi QT −1 T −12 i Σi 1n 1n Σ (ỹi i i i −QiX iβ),∂β∂σv i=1
m
∂2l(φ; ỹ ,X ,C ) ∑
= − XTQTΣ−1 −1i i i Σi (ỹ i −QiX iβ),∂β∂σ2e i=1
m {
∂2l(φ; ỹ ,X ,C ) 1 ∑
= − − 1T Σ−11 1T Σ−11
2 2 ni i ni ni i n
(5.21)
∂σv∂σv 2
i
i=1 }
+ 2(ỹ T −1 T −1 T −1i −Q{iX iβ) Σi 1n 1n Σi 1n 1n Σi (ỹ i −QiX iβ) ,
∂2l(φ; ỹ ,X ,C ) 1 ∑
i i i i
m
= − − 1T Σ−1Σ−11
∂σ2∂σ2 2 ni i i
ni
v e i=1 }
+ 2(ỹ −Q X β)TΣ−1Σ−1i {i i i i 1 1
T Σ−1ni ni i (ỹ i −QiX iβ) ,
2 m∂ l(φ; ỹ ,X ,C ) ∑ ( )
= −1 − tr Σ−1Σ−1
∂σ2e∂σ
2
e 2
i i
i=1 }
+ 2(ỹ i −QiX β)TΣ−1 −1i i Σi Σ−1i (ỹ i −QiX iβ) .
The Newton-Raphson for approximating the pseudo maximum likelihood es-
87
timator of φ = (β, σ2v , σ
2
e) has the following iteration equations:
φ(t+1) = φ(t) + J̃−1(φ(t))S̃(φ(t)) (5.22)
where φ(t) is the approximation of φ at iteration t, and J̃ (φ) is the negative value of
the gradient of S̃(φ). The components of J̃ (φ) can be obtained by taking the negative
values of the second derivatives o[f l(φ; ỹ ,X ,C ) given by equations in](5.21).
By using the fact that E (ỹ i −Q X β)Ti i M (ỹ i −QiX iβ)|X,C = tr (MΣi)
and that E(ỹ i|X i,C i) = QiX iβ , we can obtain the following results from (5.21):[ ] m
−∂
2l(φ; ỹ ,X ,C ) ∑
E |X,C = XTQTΣ−1i i i QiX[ T ] i
,
∂β∂β
i=1
∂2− l(φ; ỹ ,X ,C )E [ |X,C = 0 ,∂β∂σ2 pv ]
∂2− l(φ; ỹ ,X ,C )E [ |X,C = 0p, (5.23)∂β∂σ2e ]
2 ∑m
−∂ l(φ; ỹ ,X ,C ) | 1E X,C = 1T Σ−11 1T Σ−1
[ ∂σ2∂σ2 ] 2 ni i ni n
1 ,
i i ni
v v i=1
∂2
m
− l(φ; ỹ ,X ,C )
∑
E | 1X,C = 1T Σ−1Σ−11 ,
[ ∂σ2 nv∂σ2 ni i i ie ] 2 i=1
2 m
−∂ l(φ; ỹ ,X ,C ) | 1
∑ ( )
E X,C = tr Σ−1Σ−1 ,
∂σ2∂σ2 i ie e 2 i=1
where 0p is a p-order vector of zeros.
Thus, the quasi-scoring algorithm for approximating the pseudo maximum
likelihood estimator of φ = (β, σ2v , σ
2
e) has the following iteration equations:
φ(t+1) = φ(t) + {Ĩc(φ(t))}−1S̃(φ(t))
88
where φ(t) is the estimate of φ at iteration t, and Ĩc(φ) is given by
∑

X −1iQiΣi QiX i 0p 0p 
m
1  Ĩc(φ) =  0T 1T −1 T −1 T −1 −1  . (5.24)2 p n Σi i 1n 1 Σi ni i 1n 1i n Σii=1
− − ( i
Σi 1ni
0T 1T Σ 1Σ 11 tr Σ−
)
1Σ−1p ni i i ni i i
5.4 Proofs
5.4.1 Proof of (5.2)
Theorem: When the linear mixed model (5.1), the linkage error model (4.3) and
the mixture model (2.4) are used as the three components of the general integrated
model, the conditional distribution of ỹ i and v i given X i and C i can be derived
under the assumption that v i and ei are independent of X i and C i given X̃ i. That
is,      ỹ i QiX iβ| indX  i,C i ∼ N    Σi Ũ iGi,  . (5.25)
v i 0 G Ũ
T
i i Gi
Here, Q = (qi ni Nii jj′)j=1,j′=1 is the matrix of matching probabilities, and Σi = K i +
Ũ TiGiŨ i + Ri, where K i = (k
i ni,ni
jt)j=1,t=1 is a ni × ni diagonal matrix with diagonal
entry kijj equal to
∑Ni
kijj = q
i
jj′(1− qi T Tjj′)xij′ββ xij′ , j = 1, . . . , ni, i = 1, . . . ,m.
j′=1
Proof: Under the LAR assumption that P (L|ỹ ,X ,C ) = P (L|C ), we can obtain
the following from the mixture model described in (2.4):
E[Li|ỹ i,X i,C i] = E[Li|C i] = Qi. (5.26)
89
Recall that Qi depends on C i but not X i. Therefore,
E(Li|X i,C i) = E[E(Li|ỹ i,X i,C i)|X i,C i] = E[Qi|X i,C i] = Qi. (5.27)
Under the linkage error model that X̃ i = LiX i, by applying the result in (5.27), it
is not difficult to obtain that
[ ]
E [X̃ iβ |X i,C i] = E [[LiX iβ |X i,C i] = E] [Li|X[i,C i]X iβ =]Q( iX[iβ, ])T
V ar X̃ iβ |X i,C i = E [X̃ iββ
TX̃Ti |X i,C i]− E X̃ iβ |X i,C i E X̃ iβ |X i,C i
= E X̃ ββTX̃Ti i |X i,C i −QiX iββTXT Ti Qi .
Now, we partition X̃ T T iT iTi and Qi into ni row vectors x̃i1, . . . , x̃in andi ∑q1 , . . . , qn ,i
respectively, where q iT ij = (qj1, . . . , q
i
jN ). By∑using the fact that x̃ =
Ni i T
i ij j′=1
ljj′xij′
under the lin[kage model ]and that q iTX = Ni i Tj i j′=1 qjj′xij′ , we can achieve the (j, t)
entry of V ar X̃ iβ |X i,C i , that is,
[ ] [ ] ( )
V ar X̃ β |X ,C = E X̃ ββT T T T Ti i i i X̃ |X i,C i − QiX iββ X Q
j,t [ i ] i ij,t j,t
= E x̃(T T iTijββ x̃it|X i),C i −(qj X iββTX ∑ ∑ )
T
i q
i
t 
Ni N
T ∣
i ∣
= E li T T i Tjj′xij′ ββ l ∣ tt′xit′ ∣X i,C i( j′=1 ) ( t′∑ =1 )N Ti ∑Ni
− qi xT ββT i T[ jj′ ij′ qjt′xit′j′=∑1 t′=1N ∣ ]i ∑Ni ∣
= E li T T i ∣jj′xij′ββ xit′ltt′∣X i,C i
∑ j′=1 t′=1Ni ∑Ni
− qi xT ββT ijj′ ij′ X it′qtt′
j∑′=1 t∑′=1Ni Ni ( [ ] )
= E li li i i T Tjj′ tt′|C i − qjj′qtt′ xij′ββ xit′ ,
j′=1 t′=1
90
for j = 1, . . . , ni and t = 1, . . . , ni.
Since lijj′s are conditionally independent given C i with P (l
i
jj′ = 1|C i) = qijj′
under the mixture model for j = 1, . . . , ni and j
′ = 1, . . . , Ni, then
[ ] 
 i ′qjj′ if j = j and t = t
′
E li ijj′ltt′ |C i =  .qijj′qitt′ otherwise
Therefo[re, the (j, t) ]off-diagonal entry (i.e., when j 6= t) and the (j, j) diagonal
entry of V ar X̃ iβ |X i,C i are given by
[ ] ∑Ni ∑Ni ( )
V ar X̃ β |X ,C = qi i i i T Ti i i jj′qtt′ − qjj′qtt′ xij′ββ xit′ = 0 (j 6= t),
[ ]j,t j∑′=1 (t′=1Ni [ ] )
V ar X̃ β |X ,C = E li i i i T Ti i i jj′ljj′ |C i − qjj′qjj′ xij′ββ xij′
j,j
∑j′=1Ni ∑( [ ] )
+ E li ijj′ljt′|C i − qi i T Tjj′qjt′ xij′ββ xit′
j∑′=1 t(′ 6=j′Ni )
= qijj′ − qi i T Tjj′qjj′ xij′ββ xij′
∑j′=1Ni ∑( )
+ qi qi − qi i T Tjj′ jt′ jj′qjt′ xij′ββ xit′
j∑′=1 t′ 6=j′Ni ( )
= qi i Tjj′ 1− qjj′ xij′ββTxij′
j′=1
= kijj.
Based on the above analysis,
[ ] [ ]
E X̃ iβ |X i,C i = QiX iβ, V ar X̃ iβ |X i,C i = K i. (5.28)
From the general linear mixed model, we have
[ ] ( ) ( )
E ỹ i|X̃ i = X̃ iβ, V ar ỹ Ti|X̃ i = Ũ iGiŨ i +Ri, Cov ỹ i, v i|X̃ i = Ũ iGi (5.29)
91
Under the assumption that v i and ei is independent of X i and C i given X̃ i, ỹ i is
independent of X i and C i given X̃ i. By combining (5.28) and (5.29), we have
[ ( ) ] [ ( ) ]
E [ỹ i|X i,C i] = E [E ỹ i|X̃]i,X i,C i |X i,C i = E E ỹ i|X̃ i |X i,C i
= E X̃(iβ |C[ i = QiX iβ,] )
V ar (ỹ i|X i,C i) = V a[r E(ỹ i|X̃ i,X i,C i) |X i,C i]
+ E V(ar [ỹ i|X̃ i,X] i,C i |X) i,C i[ ( ) ]
= V ar (E ỹ i|X̃ i, )|X i,C i[ + E V ar ỹ i|X̃ i |]X i,C i
= V ar X̃ iβ |X i,C i + E Ũ iGiŨ Ti +Ri|X i,C i
= K i + Ũ iG Ũ
T
i i +Ri
= Σi, ( [ ] [ ] )
Cov (ỹ i, v i|X i,C i) = Co[v E(ỹ i|X̃ i,X i,C i , E) v i|X̃ i,]X i,C i |X i,C i
+ E C(ov [ỹ i, v i|X̃] i,X[i,C i ]|X i,C i )
= Co[v E(ỹ i|X̃ i , E) v i|X̃ i]|X i,C i
+ E C(ov ỹ i, v i|X̃ i |)X i,C i[ ]
= Cov X̃ iβ, 0|X i,C i + E Ũ iGi|X i,C i
= Ũ iGi.
Therefore, the conditional distribution of ỹ i andv i given X i and  
C i is
ỹ i QiX βind i   Σi Ũ iGi|X i,C i ∼ N ,  .
v i 0 GiŨ
T
i Gi
92
5.4.2 Proof of (5.11)
(∑ )Ni
E(x̃Tijβ |X ,C ) = E li xTi i jj′ ij′β |X i,C i
∑ j′=(1Ni )
= E lijj′|X i,C i xTij′β
j∑′=1Ni
= E(li ijj′ |cjj′)xij′β
j∑′=1Ni
= qi T(jj′xij′β,j′=1 ∑ )Ni
V ar(x̃Tijβ |X i,C i) = V ar li Tjj′xij′β |X i,C i
∑ j′=1Ni
= V ar(lijj′|X i,C i)(xTij′β)2
j∑′=1Ni
= V ar(li i T 2jj′|cjj)(xij′β)
j∑′=1Ni
= qi(jj′(1− q
i T 2
jj′)(xij′β) .
j′=1 ∑ )Ni ∑Ni
Cov(x̃Tijβ, x̃
T
itβ |X i,C i) = Cov lijj′xT i Tij′β, ltt′xit′β |X i,C i
∑∑j′=1 t′=1Ni Ni ( )
= Cov li T i Tjj′xij′β, ltt′xit′β |X i,C i
j∑′=1 t′=1Ni ∑Ni
= Cov(li i T Tjj′ , ltt′|C i)(xij′β)(xit′β)
j′=1 t′=1
= 0.
93
5.4.3 Proof of (5.12)
Assuming vi and eij is conditionally independent of X i and C i given X̃ i,
then we have E(ỹij|x̃ij,X i,C i) = E(ỹij|x̃ij), V ar(ỹij|x̃ij,X i,C i) = V ar(ỹij|x̃ij),
Cov(ỹij, ỹit|x̃ij, x̃it,X i,C i) = Cov(ỹij, ỹit|x̃ij, x̃it), and E(vi|x̃ij,X i,C i) = E(vi|x̃ij).
By applying the law of total conditional expectation, we have
∑Ni
E(ỹij|X i,C i) = E [E(ỹij|x̃ij,X i,C i)|X ,C Ti i] = E[x̃ijβ |X i,C i] = qijj′xTij′β.
j′=1
By applying the law of total conditional variance, we have
V ar(ỹij|X i,C i) = V ar (E(ỹij|x̃ij,X i,C i)|X i,C i) + E (V ar(ỹij|x̃ij,X i,C i)|X i,C i)
= V ar ((E(ỹij|x̃ij)|X)i,C i)(+ E (V ar(ỹij|x̃)ij)|X i,C i)
= V ar x̃Tijβ |X i,C i + E σ2v + σ2e |X i,C∑ iNi
= qi (1− qi T 2 2 2jj′ jj′)(xij′β) + σv + σe .
j′=1
By applying the law of total conditional covariance, we have
Cov(ỹij, ỹit|X i,C i) = Cov (E(ỹij|x̃ij, x̃it,X i,C i), E(ỹit|x̃ij, x̃it,X i,C i)|X i,C i)
+ E [Cov(ỹij, ỹit|x̃ij, x̃it,X i,C i)|X i,C i]
= Cov (E(ỹij|x̃ij), E(ỹit|x̃it)|X i,C i)
+ E [Co(v(ỹij, ỹit|x̃ij, x̃it)|)X i,C i[] ]
= Cov x̃Tβ, x̃Tij itβ |X i,C i + E σ2v |X i,C i
= 0 + σ2v
= σ2v ,
94
Cov(ỹij, vi|X i,C i) = E[Cov(ỹij, vi|x̃ij,X i,C i)|X i,C i]
+ Cov (E(ỹij|x̃ij,X i,C i), E(vi|x̃ij,X i,C i)|X i,C i)
= E[Cov(ỹij, vi|x̃ij)|X(i,C i] + Cov (E)(ỹij|x̃ij), E(vi|x̃ij)|X i,C i)
= E[σ2v |X i,C i] + Cov x̃Tijβ, 0|X i,C i
= σ2v .
5.4.4 Proof of (5.15)
Based on the general integrated model, the conditional distribution of (y i, vi)
given X i and C iis  ỹ i 

QiX iβ
 
 Σi 1n σ2i v|X ,C ∼ N , i i  .
vi 0 1
T σ2n v σ
2
i v
It is also not difficult to obtain that V ar(vi|ỹ i,C i) = σ2v − σ21T −1 2v n Σi i 1n σv .i
The conditional mean squared error (CMSE) of θ̂BPi is equal to
[ ]
CMSE(θ̂BP BP 2i ) = E (θ̂(i − θi) |ỹ i,X i,C i) ( [ ])2
= V ar θ̂BPi − θi|ỹ i,X i,(C i + E θ̂
BP
i − θi|ỹ i),X i,C i2
= V ar (θ |ỹ ,X ,C ) + θ̂BPi i i i i − E [θi|ỹ i,X i,C i]
= V ar (θi|ỹ i,X i,C i) + 0
= V ar(x̄iβ + vi|ỹ i,X i,C i)
= V ar(vi|ỹ i,X i,C i)
= σ2v − σ21T −1 2v n Σi i 1n σi v .
95
The mean squared error (MSE) of θ̂BPi is equal to[ ]
MSE(θ̂BPi ) = E (θ̂
BP
{ i[ − θ
2
i) ]}
= E E (θ̂BP[ i − θ
2
i)] |ỹ i,X i,C i
= E [CMSE(θ̂
BP
i ) ]
= E σ2 − σ21T Σ−1v v n i 1n σ2i i v
= σ2v − σ2 Tv1n E(Σ−1i )1 2n σv .i i
5.4.5 Proof of (5.16) and (5.17)
When the mixture model parameter ψ is known, the log-likelihood function of
φ based on observed data {ỹ i,X i,C i, i = 1, . . . ,m} is given by
∑m1 { }
l(φ;y,X,C ) = − ni ln(2π) + ln |Σ T −1i|+ (ỹ i −QiX iβ) Σi (ỹ i −QiX iβ) .2
i=1
Since the first derivative of kijj with respective to the kth element of β , βk, is
equal to
∂ki ∑Nijj
= 2 qi i Tjj′(1− qjj′)(xij′β)x iij′k := djj,k, (5.30)∂βk
j′=1
for k = 1, . . . , p, where xij′k is the kth element of xij′ . Define Di,k as a ni × ni
matrix with diagonal entries di ∂K ijj,k. Then = Di,k, the first derivatives of Σ∂β i =k
K + σ21 1T + σ2i v n Ii ni e n arei
∂Σi ∂K ∂Σ ∂σ
21 1T 2i i v ni n T ∂Σi i ∂σeIn= = D , = = 1 1 , = ii,k n n = In , (5.31)∂βk ∂β ∂σ2 ∂σ2
i i 2 2 i
k v v ∂σe ∂σe
and
∂β
= δk. (5.32)
∂βk
96
where δk is defined as the kth column of identity matrix I p.
By using results from (5.31) and (5.32), the first partial derivatives of the
likelihood function l(φ; ỹ ,X ,C ) with respect to parameters βk, σ
2
v and σ
2
e are given
by:
∂l(φ; ỹ ,X ,C )
∂βk
m {
−1
∑ ∂ ln |Σ Ti | − ∂β= 0 + XT T −1i Qi Σi (ỹ i −QiX iβ)2 ∂β ∂β
i=1 k k
−1 }∂Σ
+ (ỹ −Q X β)T ii i{ i ( (ỹ)i −
∂β
QiX iβ)− (ỹ −Q T −1i iX iβ) Σ QiX i
1 ∑ ∂β
i
k ∂βk
m
− ∂Σi= tr Σ−1 − 2δTXTQTΣ−1i k i i i (ỹ i −QiX iβ)2 ∂β
i=1 k }
− ∂Σi(ỹ i −Qi{X iβ)
TΣ−1 Σ−1i i (ỹ i −QiX iβ)∑ ( ∂β)km1
= − tr Σ−1i D T T T −1i,k − 2δ2 k
X i Qi Σi (ỹ i −QiX iβ)
i=1 }
− (ỹ −Q X β)TΣ−1D Σ−1i i i i i,k i (ỹ i −QiX iβ) ; (5.33)
∂l(φ; ỹ ,X ,C )
∂σ∑2vm { }
−1 ∂ ln |Σ | ∂Σ
−1
i
= 0 + + (ỹ i −QiX iβ)T i (ỹ −Q X β)
2 { ( ∂σ2 ) ∂σ2 i i i∑i=1 v vm }1 ∂Σi ∂Σi
= − tr Σ−1 − (ỹ −Q T −1 −1i iX iβ) Σ Σ (ỹ i −QiX iβ)
2 i 2 i∑{ ( ∂σv ) ∂σ
2 i
i=1 v
m }
= −1 tr Σ−1i 1n 1Tn − (ỹ i −QiX iβ)TΣ−11 T −1i i i n 12 i n
Σ
i i
(ỹ i −QiX iβ) ; (5.34)
i=1
97
∂l(φ; ỹ ,X ,C )
∂σ∑2em { }
−1 ∂ ln |Σi | − T ∂Σ
−1
= 0 + + (ỹ i QiX β)
i (ỹ −Q X
{ ( 2 ) i 2 i i i
β)
2 ∑ ∂σ ∂σi=1 e em }
= −1 ∂Σi ∂Σitr Σ−1i − (ỹ i −QiX iβ)TΣ−1 −12 i Σi (ỹ i −Q2 iX iβ)2 ∑{ ( )∂σi=1 e ∂σem }
= −1 tr Σ−1i − (ỹ i −Q T −1 −1iX iβ) Σi Σi (ỹ i −QiX iβ) . (5.35)2
i=1
From (5.30), we can get
∂di mjj′ ∑,k
= 2 qijj′(1− qijj′)z iij′tzij′k := djj,kt, for integers 1 ≤ k, t ≤ p.∂βt i=1
Define Di,kt as a diagonal matrix with diagonal entries d
i
jj,kt, then
∂Di,k
= Di,kt. (5.36)
∂βt
Again, by using (5.31)-(5.35), and (5.36), we can obtain the second derivatives of
the log-likelihood function l(φ; ỹ ,X ,C ). That is,
∂2l(φ; ỹ ,X ,C )
∂β∑∂σ2k vm { ( )
−1 ∂Σ
−1 T
= tr i
∂β
1 1T + 2 XTQTΣ−1n n i i i 1 1
T Σ−1n n i (ỹ i −QiX iβ)2 ∂β i i i i
i=1 k
∂βk
−1 }
− ∂Σ2(ỹ i −Q{iX β)
T i T −1
i ( 1n 1n Σi (ỹ i∑ ∂β i ikm )
−QiX iβ)
−1 − −1∂Σi= tr Σ −1i Σi 1 1T + 2δTXTn QTΣ−11 1T Σ−1(ỹ −Q X β)2 ∂β i ni k i i i} ni ni i i i ii=1 k
T −1∂Σi+ 2(ỹ −1i −Q{iX iβ) Σi Σi 1 1
T −1
n n Σi (ỹ i −QiX iβ)∑ ( ∂β
i i
k
m
1 )
= − − tr Σ−1i D −1i,kΣi 1 Tn 1n + 2δTkXT T −1 T −12 i i i
Qi Σi 1n 1n Σi (ỹ i −QiX} i i i
β)
i=1
+ 2(ỹ −Q X β)TΣ−1 −1 T −1i i i i Di,kΣi 1n 1n Σi (ỹ i −QiX iβ)i i
98
∂2l(φ; ỹ ,X ,C )
∂βt∑∂βkm { ( )1 ∂ tr Σ−1D ∂Σ−1
= − i i,k − 2δT T T i
2 ∂β k
X i Qi (ỹ i −QiX iβ)
i=1 t
∂βt
∂β ∂βT
+ 2δTXTQTΣ−1Q X + XTQTΣ−1 −1k i i i i i i i i Di,kΣi (ỹ i −QiX iβ)∂βt ∂βt
−1
− ∂Σ(ỹ i −QiX β)T i D Σ−1i i,k i (ỹ i −QiX iβ)∂βt
− − T −1∂Di,k(ỹ −1i QiX iβ) Σi Σi (ỹ i −QiX iβ)∂βt
−1 }
− (ỹ −Q X β)T(Σ−1
∂Σi − − T −1 −1 ∂βi i{ i i Di,k ) (ỹ i QiX iβ) + (ỹ i QiX iβ) Σi Di,kΣ QiX i∑ ∂β it ∂βtm ∂ tr Σ−1D −1
= −1 i i,k − T T T ∂Σ2δ ikX i Qi (ỹ i −QiX iβ)2 ∂βt ∂βi=1 t
T
+ 2δTXTQTΣ−1
∂β ∂β
k i i i QiX + 2 X
T
i i Q
T −1 −1
i Σi Di,kΣi (ỹ i −QiX iβ)∂βt ∂βt
∂Σ−1− 2(ỹ −Q X β)T i D −1i i i i,kΣi (ỹ i −QiX iβ})∂βt
− (ỹ −Q X β)TΣ−1∂Di,k −1i i{ i ( i Σi (ỹ i −QiX iβ)
1 ∑ ∂βtm )− ∂Σ−1i −1∂Di,k T T T −1∂Σi= tr Di,k + Σi + 2δkX −12 ∂β ∂β i Qi Σi Σi (ỹ i −QiX iβ)∂β
i=1 t t t
+ 2δTXTQT −1 T T T −1 −1k i i Σi QiX iδ t + 2δ t X i Qi Σi Di,kΣi (ỹ i −QiX iβ)
− T −1∂Σi+ 2(ỹ Q X β) Σ Σ−1D Σ−1i i i i i i,k i (ỹ i −Q}iX iβ)∂βt
− ∂Di,k(ỹ −Q T −1i i{X iβ) Σi Σ
−1
i (ỹ i −QiX iβ)
1 ∑ ( ∂βtm )
= − tr −Σ−1D Σ−1 −1i i,t i Di,k + Σi Di,kt + 2δTXTQTΣ−1 −1k i i i Di,tΣi (ỹ i −QiX iβ)2
i=1
+ 2δTXTQT −1k i i Σi QiX iδ t + 2δ
T
t X
T
i Q
T −1 −1
i Σi Di,kΣi (ỹ i −QiX iβ)
+ 2(ỹ i −QiX iβ)TΣ−1 −1i Di,tΣi Di,kΣ−1i (ỹ i −}QiX iβ)
− (ỹ i −QiX iβ)TΣ−1i D −1i,ktΣi (ỹ i −QiX iβ)
99
∂2l(φ; ỹ ,X ,C )
∂β 2k∑∂σem { ( −1) T
= −1 ∂Σi ∂βtr + 2 XT T −1 −1i Qi Σi Σi (ỹ −Q X} i i i
β)
2 ∂β
i=1 k
∂βk
− ∂Σ
−1
2(ỹ i −Q{iX β)
T i
i ( Σ
−1
i (ỹ i)−QiX iβ)∂βkm
−1
∑ ∂Σi
= − tr Σ−1 Σ−1 + 2δTXTQTΣ−1 −1i i k i i i Σi (ỹ i −Q} i
X iβ)
2 ∂β
i=1 k
∂Σi
+ 2(ỹ T −1 −1 −1i −Q{iX iβ) Σi Σi Σi (ỹ i −QiX iβ)∑ ( ∂βkm )
= −1 − tr Σ−1D Σ−1 + 2δTXTQTΣ−1Σ−1i i,k i k i i i i (ỹ i −Q} i
X iβ)
2
i=1
+ 2(ỹ T −1 −1 −1i −QiX iβ) Σi Di,kΣi Σi (ỹ i −QiX iβ)
∂2l(φ; ỹ ,X ,C )
∂σ2 2v∑∂σvm { ( ) }∂ tr Σ−11 1T −1
= −1 i ni ni − 2(ỹ −Q X β)T ∂Σi 1 1T Σ−1(ỹ −Q X β)
2 { ( i i i n i i i∂σ2 2 i ni ii=1 v ) ∂σvm
−1
∑
− −1∂Σi= tr Σi Σ−11n 1T2 ∂σ2 i i ni
i=1 v }
∂Σi
+ 2(ỹ i −Q{iX iβ)
TΣ−1 −1 T −1
∑ i
Σi 1n 1n Σi (ỹ i −QiX iβ)
( ∂σ2
i i
v
m
1 − )= − tr −Σ 11 1T Σ−1i n n i 1n 1T2 i i i ni
i=1 }
+ 2(ỹ T −1 T −1 T −1i −QiX iβ) Σi 1n 1i n Σi i 1n 1n Σi (ỹ i −QiX iβ)i i
∂2l(φ; ỹ ,X ,C )
∂σ2v∑∂σ2em { ( ) }1 ∂ tr Σ−1 −1
= − i − 2(ỹ −Q X β)T ∂Σii i i Σ−1i (ỹ{ (2 ) 2 i
−QiX iβ)
2 ∑ ∂σv ∂σi=1 vm }
−1 − −1∂Σi −1 − T −1∂Σi= tr Σi Σ + 2(ỹ i Q −1 −1{ i
X iβ) Σ Σ Σ (ỹ i −QiX iβ)
2 ∑ ( ∂σ
2 i i 2 i i
i=1 v ) ∂σvm }
= −1 − tr Σ−11 1T Σ−1 + 2(ỹ −Q X β)TΣ−1 Ti n n i i i i i 1n 1n Σ−1i Σ−1i (ỹ i −QiXi i i i iβ)2
i=1
100
∂2l(φ; ỹ ,X ,C )
∂σ2e∑∂σ2em { ( ) }1 ∂ tr Σ−1− i − ∂Σ−1= 2(ỹ −Q X β)T i Σ−1(ỹ −Q X β)
2 ∑i=1m { ∂σ(
2 i
e ) i i 2 i i i i∂σe }
−1= − tr Σ−1∂ΣiΣ−1 − T ∂Σii i + 2(ỹ i QiX iβ) Σ−1 −1 −1i Σi Σi (ỹ i −QiX{ 2 2 } i
β)
2 ∑ ( ∂σe) ∂σi=1 em
−1= − tr Σ−1 −1 T −1 −1 −1
2 i
Σi + 2(ỹ i −QiX iβ) Σi Σi Σi (ỹ i −QiX iβ)
i=1
5.4.6 Proof of (5.19)
ind
Since ỹ i|X i,C i ∼ N(QiX iβ,Σi), then it is not difficult to obtain that
[ ]
E(ỹ i −QiZ iβ |X i,C i) = 0, E (ỹ i −QiX iβ)(ỹ −Q X β)Ti i i |X i,C i = Σi. (5.37)
By using basic properties of trace, we can prove that for any ni × ni matrix M =
M (X,C,φ,ψ), which may depends on X and C but not on ỹ,
[ ]
E (ỹ i −Q X β)Ti i M (ỹ i −QiX iβ)|X,C = tr (MΣi) , (5.38)
since
[ ]
E (ỹ i −[Q
T
iX( iβ) M (ỹ i −QiX iβ)|X,C ) ]
= E [tr ((ỹ −Q X β)
T
i i i M (ỹ i −QiX iβ)) |X,C]
= E (tr [M (ỹ i −QiX iβ)(ỹ −Q X β)
T
i i i |X,C])
= tr (E M
T
[(ỹ i −QiX iβ)(ỹ i −QiX iβ) |X,C])
= tr ME (ỹ Ti −QiX iβ)(ỹ i −QiX iβ) |X,C
= tr (MΣi) .
101
By using the fact (5.37) and (5.38), we can obtain the following results:
( )
−∂
2l(φ;y,X,C )
E |X,C
∑{∂βt∂βkm1 ( − − − )= tr −Σ 1 1 1i Di,tΣi Di,k + Σi Di,kt2
i=1
+ 2δTXTQTΣ−1k i i i Di,tΣ
−1
i E[ỹ i −QiX iβ |X,C ]
+ 2δTk[X
TQT −1 T T T −1 −1i i Σi QiX iδ t + 2δ t X i Qi Σi Di,kΣi E[ỹ i −QiX]iβ |X,C ]
+ 2E[ (ỹ −Q X β)
TΣ−1D −1 −1i i i i i,tΣi Di,kΣi (ỹ i −QiX i]β})|X,C
− E (ỹ i{−QiX iβ)
TΣ−1D −1i i,ktΣi (ỹ i −QiX iβ)|X,C
1 ∑m (
= tr −Σ−
)
1D Σ−1D + Σ−1D + 2δTXTQTΣ−1i i,t i i,k i }i,kt2 k i i i
QiX iδ t
i=(1 ) ( )
+ 2 tr Σ−1D Σ−1{i i,t i Di,k − tr Σ
−1
i Di,kt
m
1 ∑ − ( − )}= 2δTkXT T 1 1i Qi Σi QiX iδ t + tr Σi Di,tΣ−1( ) i
Di,k ,
2
i=1
−∂
2l(φ;y,C )
E |X,C
∑∂{β 2k∂σvm1 ( − − )= − tr Σ 1 1 T
2 i
Di,kΣi 1n 1i ni
i=1
+ 2δT T Tk[X i Qi Σ
−11 1T Σ−1i n n E[ỹ −Q X β |X,C ]i i i i i i ]}
+ 2E (ỹ{i −Q X β)
T
i i Σ
−1D Σ−1 T −1i i,k i 1n 1n Σi (ỹ i −QiX iβ)|X,C∑ im1 ( − − )
i ( }
= − tr Σ 1D Σ 11 1T + 2 tr Σ−1D Σ−
)
1 T
i i,k i ni ni i i,k i
1n 1
2 i ni
i=1
m
1 ∑ (
= tr Σ−
)
1
i D Σ
−1 T
i,k
2 i
1n 1i ni
∑i=1m1
= 1T Σ−1 −1
2 ni i
Di,kΣi 1n ,i
i=1
102
( )
−∂
2l(φ;y,X,C )
E |X,C
∑{∂βk∂σm1 (
2
e )
= − tr Σ−1D Σ−1 + 2δTXTQTΣ−1 −1i i,k i k i i i Σi E[ỹ i −QiX i|X,C ]2
i=[1 ]}
+ 2E (ỹ{i −QiX
T −1 −1 −1
iβ) Σi Di,kΣi Σi (ỹ i −QiX iβ)∑m ( ) ( )}
|X,C
1
= − tr Σ−1D Σ−1 + 2 tr Σ−1 −1i i,k2 i i
Di,kΣi
1 ∑i=1m (
= tr Σ−1D −
)
1
( i i,k
Σi ,2
i=1 )
2
−∂ l(φ;y,X,C )E { |X,C
1 ∑ ∂σ
2 2
v∂σ(vm )
= − tr Σ−1i 1n 1Tn Σ−1i 1n 1T2 i i i ni
i=[1 ]}
+ 2E (ỹ{i −Q X β)
TΣ−1i i i 1n 1
T −1 T −1
∑ i n
Σi 1n 1n Σi (ỹ i −QiX iβ)|X,Ci i
m ( ) i ( )}1
= − tr Σ−1i 1n 1Tn Σ−1 T −1 Ti 1n 1n + 2 tr Σi 1n 1n Σ−1 T2 i i i i i i i
1n 1i ni
1 ∑i=1m ( − )= tr Σ 11 1Ti n n Σ−11 1T2 i i i ni ni∑i=1m1
= 1T Σ−1 T −1
(2 ni i
1n 1i n Σi i 1n ,i
i=1 )
2
−∂ l(φ;y,X,C )E { |X,C∑ ∂σ2∂σ2v em1 (
= − tr Σ−11 T −
)
1
i n 1 Σ2 i ni i
i=[1 ]}
+ 2E (ỹ{i −QiX iβ)
TΣ−1i 1n 1
T
n Σ
−1 Σ−1i i (ỹ −Q X β)|X,C∑ i i
i i i
m
1 ( }
= − tr Σ−
) ( )
11 1T −1 −1 T −1
2 i
ni n Σi i + 2 tr Σi 1n 1i n Σi i
1 ∑i=1m ( )
= tr Σ−1i 1
T −1
n 1 Σ
2 i ni i∑i=1m1
= 1T −1 −1
2 n
Σi Σi i 1n ,i
i=1
103
( )
−∂
2l(φ;y,X,C )
E |X,C
∑{∂σ2e∂σm1 (
2
e )
= − tr Σ−1Σ−1
2 i i
i=[1 ]}
+ 2E (ỹ −Q X T −1 −1 −1{i i iβ) Σi Σi Σi (ỹ i −QiX iβ)|X,Cm
1 ∑ ( − − ) ( )}= − tr Σ 1Σ 1 + 2 tr Σ−1Σ−1
2 i i i i
1 ∑i=1m ( )
= tr Σ−1Σ−1
2 i i
.
i=1
5.4.7 Proof of (5.20) and (5.21)
When ψ is known, the log-likelihood function of φ based on data {ỹ i,C i, i =
1, . . . ,m} is given by
m
1 ∑{ }
l(φ; ỹ ,X ,C ) = − ni ln(2π) + ln |Σ |+ (ỹ −Q X β)T −1i i i i Σi (ỹ i −QiX iβ) .
2
i=1
Assuming that the matrix Σi = K + σ
2
i v1n 1
T
n + σ
2
i i e
In is fixed in β , we havei
∂Σi ∂Σi T ∂Σi= 0, = 1n 1n , = In ,∂β ∂σ2 i i 2 iv ∂σe
and
∂Σ−1i = −Σ−1∂Σi −1 −1 T −1
∂σ2 (i Σ = −Σ∂σ2 i) i 1n 1n Σi i i ,v v
∂ ln |Σi| −1∂Σi ( − )= tr Σi = tr Σ 11n 1T ,∂σ2v ∂σ2 i i niv
∂Σ−1i = − ∂ΣiΣ−1(i Σ
−1 = −Σ−1Σ−1,
∂σ2 ∂σ2 i i ie e )
∂ ln |Σi| − ∂Σi ( − )= tr Σ 1 = tr Σ 1i i .∂σ2 2e ∂σe
Using the above results, the first derivative of l(φ; ỹ ,X ,C ) with respect to
104
parameters β , σ2 and σ2v v are given by
∂l(φ; ỹ ,X ,C )
∂β
1 ∑m { }
= − −XTQTi i Σ−1 T −1i (ỹ i −QiX iβ)− (ỹ i −QiX iβ) Σi QiX i
∑2 i=1m
= XTQTi i Σ
−1
i (ỹ i −QiX iβ),
i=1
∂l(φ; ỹ ,X ,C )
∂σ∑2vm { }
−1 ∂ ln |Σi| ∂Σ
−1
= + (ỹ i −QiX iβ)T i (ỹ −Q X β)
2 ∂σ2 ∂σ2
i i i
∑i=1m1 { ( v ) v }
= − tr Σ−1i 1 Tn 1n − (ỹ i −QiX β)T −1i Σi 1n 1T −12 i i i n
Σ (ỹ −Q X β)
i i i i i
1 ∑i=1m {
= − 1T −1 T −
}
1 T −1
n Σi 1n − (ỹ i −Q Xi i i iβ) Σi 1n 1n Σi (ỹ −Q Xi i i i iβ) ,2
i=1
∂l(φ; ỹ ,X ,C )
∂σ
1 ∑
2
e
m { }
− ∂ ln |Σ
−1
i| ∂Σ
= + (ỹ i −QiX iβ)T i (ỹ i −QiX iβ)
2 2 2∑ ∂σi=1 { ( em1 − )
∂σe }
= − tr Σ 1i − (ỹ i −QiX iβ)TΣ−1Σ−12 i i
(ỹ i −QiX iβ) .
i=1
The second derivatives of l(φ; ỹ ,X ,C ) with respect to parameters β , σ2v and
σ2v are given by
∂2
m
l(φ; ỹ ,X ,C ) ∑
= − XTQTΣ−1i i i QiX i,∂β∂βT ∑i=12 m∂ l(φ; ỹ ,X ,C ) T T ∂Σ−1= X Q i (ỹ i −QiX iβ)
∂β∂σ2 i iv ∑ ∂σ
2
i=1 v
m
= − XTQTΣ−11 1T Σ−1i i i n n i (ỹ i −Qi i iX iβ),
i=1
105
∂2l(φ; ỹ ,X ,C )
∑∂β∂σ2em ∂Σ−1
= XT T ii Qi (ỹ i −QiX2 iβ)
i=1∑ ∂σem
= − XTQTΣ−1i i i Σ−1i (ỹ i −QiX iβ),
i=1
∂2l(φ; ỹ ,X ,C )
∂σ2v∂σ
2
v
m {
−1
∑ ∂Σ−1T i − − T ∂Σ−1= 1n 1n (ỹ i QiX iβ) i 1 1T Σ−1n (ỹ −Q X β)2 i ∂σ2 i ∂σ2} i ni i i i ii=1 v v
− ∂Σ
−1
(ỹ i −Qi{X iβ)
TΣ−1 T ii 1n 1n (ỹ i −Q X β)∑ i i
i i
∂σ2v
m
−1= − 1T Σ−1n i 1n 1T Σ−112 i i ni i ni
i=1
+ (ỹ −Q X β)TΣ−1 T −1i i i i 1n 1n Σi 1 Tn 1n Σ−1i i i i i (ỹ i −QiX iβ)}
+ (ỹ T −1 T −1 T −1i −Qi{X iβ) Σi 1n 1∑ i n
Σi 1n 1n Σi (ỹ i −QiX iβ)i i i
m
= −1 − 1T Σ−11 1T Σ−1ni i n 12 i ni i ni
i=1 }
+ 2(ỹ −Q X β)TΣ−11 1T −1i i i i n n Σi 1 1T −1i i ni n Σi i (ỹ i −QiX iβ) ,
∂2l(φ; ỹ ,X ,C )
∂σ2v∑∂σ2em {1 ∂Σ−1 −1
= − 1T in 1n − −
∂Σ
(ỹ i QiX
T i T −1
iβ) 1n 1 Σ (ỹ i −QiX iβ)
2 i ∂σ2 i ∂σ2 i ni i
i=1 e e
−1 }
− (ỹ −Q X β)TΣ−1 T ∂Σii i{ i i 1n 1n (ỹ i −QiXi i 2 iβ)∑ ∂σem1
= − − 1T −1 −1 T −1 −1n Σi Σi 1n + (ỹ i −QiX iβ) Σi Σi 1 1T −1n n Σi (ỹ i −QiXi i } i i i
β)
2
i=1
+ (ỹ i −Q X β)TΣ−11 1T Σ−1 −1i i n Σ (ỹ −Q X β)
∑{ i i ni i i
i i i
m }
−1= − 1T Σ−1n i Σ−1i 1n + 2(ỹ i −QiX β)TΣ−1Σ−11 1T −1i i i i i n2 i n
Σ
i i
(ỹ i −QiX iβ) ,
i=1
106
∂2l(φ; ỹ ,X ,C )
∂σ2e∑∂σ2em { ( )1 ∂Σ−1 ∂Σ−1
= − tr i − (ỹ T ii −QiX iβ) Σ−1(ỹ2 } 2 i i
−QiX iβ)
2 ∂σe ∂σi=1 e
−1
− − ∂Σ(ỹ i Q X β)TΣ−1 ii{ i i ∂σ2em
−1
∑ ( )
= − tr Σ−1Σ−1i i + (ỹ −Q X β)TΣ−1Σ−1 −1i i i i i Σi (ỹ i −QiX iβ)2
i=1 }
+ (ỹ i −Qi{X iβ)
TΣ−1i Σ
−1Σ−1i i (ỹ i −QiX iβ)∑m ( ) }
= −1 − tr Σ−1Σ−1 + 2(ỹ −Q T −1 −1 −1i i i iX iβ) Σi Σi Σi (ỹ i −QiX iβ) .2
i=1
5.4.8 Proof of (5.23)
| i∼ndSince ỹ i X i,C i N(QiX iβ,Σi), then for any ni × ni matrix M which does
not depend on ỹ, we have
[ ]
E(ỹ i −QiX iβ |X i,C i) = 0n , E (ỹ i −QiX iβ)TM (ỹ i −QiX iβ)|X i,C i = tr (MΣi i) ,
where 0n is a ni × 1 vector of zeros.i
Using the above results, we can get
[ ]
∂2
m
− l(φ; ỹ ,C )
∑
E |X T T −1
[ T i
,C i = X Q Σ
∂β∂β ] i i i
QiX i,
i=1
m
−∂
2l(φ; ỹ ,C ) ∑
E |X ,C = XTQTΣ−11 1T Σ−1i i i i i n n i E [ỹ i −QiX iβ |X i,C[ 2 ] i i i
] = 0p,
∂β∂σv i=1
∂2
m
− l(φ; ỹ ,C )
∑
E |X i,C i = XTQTΣ−1Σ−1E [ỹ i −QiX iβ |X i,C i] = 0p,
∂β∂σ2 i i i ie i=1
107
[ ]
2
−∂ l(φ; ỹ ,C )E |X i,C i
∑∂σ{2v∂σ2vm1
= − 1Tn Σ−1 T −12 i i
1n 1i n Σi 1i ni
i=[1 ]}
+ 2E (ỹ −Q X β)TΣ−1{i i i i 1n 1
T Σ−11 1T −1
i ni i ni n
Σ (ỹ
i i i
−QiX iβ)|X i
∑m ( )}
,C i ,
1
= − 1T Σ−11 1T −1 −1 T −1 T
{ ni i ni n
Σi 1n + 2 tr Σi i i 1n 1 Σ2 i ni i
1n 1i ni
∑i=1m }1
= − 1T −1 T −1 T −1 T −1
2 n
Σi 1 1i ni n Σi i 1n + 21i n Σi 1n 1i i n Σ 1i i ni
∑i=1m1
= 1T Σ−11 1T Σ−11 ,
[ ni i n2 i ni i nii=1 ]
−∂
2l(φ; ỹ ,C )
E |X i,C i
∑∂σ{2 2v∂σem1
= − 1T Σ−1Σ−1
2 ni i i
1ni
i=[1 ]}
+ 2E (ỹ −Q X β)TΣ−1Σ−1{i i i i i 1
T −1
n 1n Σi (ỹ i −Q Xi i i iβ)|X i,C i ,∑m1 − − ( )}= − 1Tn Σ 1i Σ 1i 1n + 2 tr Σ−1Σ−1 T2 { i i i i
1n 1}i ni
1 ∑i=1m
= − 1T Σ−1n i Σ−11 + 21T Σ−1Σ−1i n n i i 12 i i i ni
i=1
m
1 ∑
= 1T Σ−1Σ−1n i i 1 ,[ n2 i ii=1 ]
2
−∂ l(φ; ỹ ,C )E |X ,C
∑∂σ2∂σ2
i i
e e
m {
1 ( )
= − tr Σ−1i Σ−12 i
i=[1 }
+ 2E (ỹ −Q X β)TΣ−1Σ−1Σ−
]
1
{i i i i i i (ỹ i −QiX}iβ)|X i,C i∑m1 (
= − tr Σ−1Σ−
) ( )
1
i i + 2 tr Σ
−1Σ−1
2 i i
i=1
m
1 ∑ ( )
= tr Σ−1Σ−1 .
2 i i
i=1
108
Chapter 6: Monte Carlo Simulation Study
6.1 Introduction
In this section, we design a Monte Carlo simulation study to compare finite
sample performances of different estimators of regression coefficient β in simple
linear and logistic models in the presence of linkage errors. Four different estimators
are evaluated: naive estimator β̂N that ignores linkage errors, proposed estimator
β̂F that incorporates linkage errors, and two of its computational simpler versions
β̂M and β̂M2. These estimators can be derived by solving a set of corresponding
estimating equations ( See Table 6.1 ), where QMi and Q
M2
i are simplified versions
of design matrix Qi = (q
i ni,Ni
jj′)j=1,j′=1, i = 1, . . . G. All the entries except the largest
one are set to zero on each row in QMi , while all the entries except the first two
largest are set to zero on each row inQM2i . In our simulation, we assume that linkage
errors only exist within blocks, but not across blocks. The conditional independence
assumption is made. That is, the agreement on one matching field is independent
from that on others.
Simulation results for both equal and unequal scenarios are presented. Files
Fy and Fx are of the same size (i.e., the number of units) within each block in the
equal scenario while they are different in the unequal scenario. However, even in
109
the equal scenario, we allow block sizes to vary across blocks. In each scenario, two
sets of simulation conditions are considered in order to compare the performances
of different estimators under different levels of linkage errors. Results for these
two different scenarios are shown in section 4.1 and 4.2, respectively. We consider
G = 100 blocks and R = 100 independent simulation replications throughout each
section.
In section 6.4, we conduct another Monte Carlo simulation to investigate the
difference between the standard and simplified jackknife methods in estimating the
variance of the estimators of β. The simulation is performed under two different
scenarios: the equal scenario and the unequal scenario. In the equal scenario, sim-
ulation is done for a simple logistic model. In the unequal scenario, simulation is
done for a simple linear model.
Table 6.1: Estimating equations used for four different estimators of regression
coefficient β in simple linear and logistic models. Notation is followed from Chapter
2.
Est. L∑inear Model L∑ogistic Model
β̂N ∑G Ti=1X i {ỹ i −X β} = 0 ∑G Ti i=1X i (ỹ i − g(X i, β)) = 0
β̂ ∑GF i=1(Q X T G T∑ i i
) (ỹ i −QiX iβ) = 0 i=1(QiX i) (ỹ i −Qig(X i, β)) = 0
β̂ G
∑ ( )
(QM TM i=1 i X i) (ỹ
M G M
i −Qi X iβ) = 0 ∑i=1(Qi X i)T (ỹ Mi −Qi g(X i, β) = 0
β̂ G
)
(QM2X )TM2 i=1 i i (ỹ i −QM2i X iβ) = 0
G M2 T M2
i=1(Qi X i) ỹ i −Qi g(X i, β) = 0
exp(x β)
Note: g(X , β) = col ij . Q = (qi )ni,Nii 1≤j≤Ni 1+exp(x β) i jj′ij j=1,j′=1, i = 1, . . . G is the design
matrix, and QM and QM2i i are its simplified versions. All the entries except the
largest one are set to zero on each row in QMi , while all the entries except the first
two largest are set to zero on each row in QM2i .
110
6.2 The Equal Scenario
Simulation Conditions: The number of records in each block i, ni, across
different simulation rep∑lications varies from 10 to 40 in case 1, and from 20 to 40 in
case 2. Then there are G 2 2i=1 ni potential links in total with ni potential links in block
i ranging from 100 to 1600 in case 1 and from 400 to 1600 in case 2. The number of
matching fields, K, across different simulation replications varies between 8 and 10
in case 1, and between 6 and 10 in case 2. Across different replications, probability
of agreement on matching field k among matches, mk, and among mismatches, uk,
take values from interval (0.55, 0.95) and (0.10, 0.50), respectively, in case 1, whereas
they take values from interval (0.55, 0.85) and (0.20, 0.50) in case 2. In general,
linkage errors are less likely to occur under case 1 than under case 2 when combining
files Fy and Fx, since it has smaller block sizes, more matching fields, and larger
probabilities of agreement among matches and smaller probabilities of agreement
among mismatches. Hence, we would expect to have better estimates in case 1 than
those in case 2. A summary of simulation conditions for case 1 and case 2 is shown
in Table 6.2.
Simulation Steps: (1) Data Generation: N values of x in Fx and ỹ in
Fy are generated based on the selected model and simulated regression coefficient
β. A comparison vector c can be generated for each record pair based on their
true matching status, the number of matching fields K, probabilities of agreement
on matching fields among matches {mk, k = 1, . . . , K}, and among mismatches
{uk, k = 1, . . . , K}. Note that only the records within the same blocks are compared.
111
Table 6.2: Simulation conditions for case 1 and case 2 under the equal scenario.
Symbol Description Case 1 Case 2
lower upper lower upper
limit limit limit limit
G number of blocks 100 100
R number of simulations 100 100
x univariate covariate x ∼ N(0, 1) x ∼ N(0, 1)
β regression coefficient 0 1 0 1
σ2e regression variance σ
2 = 2 2 2e
ni size of block i 10 ∑1− β σe = 1− β40 20 ∑ 40
N size of file Fy and Fx N =
G
i=1 n
G
i N = i=1 ni
K number of matching fields 8 12 6 10
mk probability of agreement on 0.55 0.95 0.55 0.85
field k for a match
uk probability of agreement on 0.10 0.50 0.20 0.50
field k for a mismatch
Note: the condition for σ2 is used for simulation of linear regression on linked data
only.
The observed data for the following linkage step and statistical analysis includes X
in Fx, ỹ in Fy, and comparison vector C .
(2) Record Linkage: A two class mixture model is fitted to observed com-
parison vectors c using the expectation maximization (EM) algorithm. All the
parameters in the mixture model are estimated. These parameters consist of
weights of class, π, probabilities of agreement on matching fields among matches,
{mk, k = 1, . . . , K}, and among mismatches, {uk, k = 1, . . . , K}. The probability of
a record pair (j, j′) within block i being a link, qijj′ , is the same as the probability of
its corresponding vector cijj′ belonging to class M . It can be estimated by applying
Bayes’ Theorem and can be used to partition the record pairs into designated links
and non-links.
(3) Parameter Estimation: For the naive estimator, it is essential to determine
112
designated links. The designate link to a record j within block i in Fy is a record
j′ within the same block in F ix whose corresponding linkage probability qjj′ is the
largest among {qijt, t = 1, . . . , ni}. In our case, it is possible a record j′ in Fy is
linked to two or more records in Fx since one-to-one assignment is not enforced. For
our proposed estimators, the design matrix Q Mi, Qi and Q
M2
i for block i need to
be constructed based on the estimated linkage probabilities {qijj′ , i = 1, ..., G, j =
1, ..., ni, j = 1, ..., ni}. Qi = (qi )ni,ni M M2jj′ j=1,j′=1, i = 1, . . . m, Qi and Qi are simplified
versions of Qi. All the entries except the largest one are set to zero on each row in
QMi , while all the entries except the first two largest are set to zero on each row in
QM2i . Then the four estimators β̂N , β̂F , β̂M and β̂M2 can be estimated by solving
the estimating equations shown in Table 6.1.
(4)Variance Estimation: Jackknife is used to estimate bias, variance and mean
squared error of each estimator of β. Jackknife replicates are generated by leaving
out data of one block from the two files at a time. Hence, there are G = 100 jackknife
replicate in total. For each jackknife replicate, step 2 and step 3 are performed and
estimates of β are re-evaluated. The jackknife estimates of the bias, variance and
mean squared error of an estimator can be obtained by aggregating G replicate
estimates of β. And a 95% confidence interval can be obtained for each estimate of
β.
Step (1) to (4) are performed for R = 100 simulation runs.
Performance Evaluation: The performance of the four estimators can be
evaluated by the average absolute deviation (AAD) and average squared deviation
(ASD) over all the simulation runs. The formulas for AAD and ASD of an estimator
113
β̂ are shown below. ∑R
r=1 |β̂(r) − β|AAD(β̂) = ∑ ,RR (β̂(r) − β)2
ASD(β̂) = r=1 ,
R
where β̂(r) is value of β̂ calculated based on simulation r. We can also measures
improvement of an estimator β̂ over β̂N with respect to AAD and ASD by relative
percent improvement (RPI). The formulas are shown below.
AAD(β̂N)− AAD(β̂)
RPIAAD(β̂) = × 100%,
AAD(β̂N)
ASD(β̂N)− ASD(β̂)
RPIASD(β̂) = × 100.%
ASD(β̂N)
To learn more about the properties of these estimators, Monte Carlo estimates
of sampling mean, bias, relative bias, variance, relative variance, mean square error,
relative mean square error of each estimator are obtained. Suppose R independent
replicates are generated, and β̂(r) is the estimate of β computed based on replicate
r, r = 1, . . . R. Then the Monte Carlo estimate of sampling mean, variance, and
mean square error of β̂ are given by
∑R1 ∑R
Ê(β̂) = β̂(r)
1
,( ) B̂ias(β̂) = β̂
(r) − β,
R R
r=1 r=1
R R 2 R
1 ∑ 1 ∑ 1 ∑
V̂ (β̂) = β̂(r) − β̂(r) , M̂SE(β̂) = (β̂(r) − β)2.
R− 1 R R
r=1 r=1 r=1
And the Monte Carlo standard deviations of the estimated mean, bias, mean square
error are given by
sd(β̂)
SE(Ê(β̂)) = SE((B̂ias(β̂)))= √ ,R
sd (β̂ − β)2
SE(M̂SE(β̂)) = √ .
R
114
6.2.1 Linear regression with linked data
In each simulation run, values of a scalar independent variable X are randomly
and independently generated from N(0, 1), and the corresponding values of Y are
given by
yij = xijβ + e, i = 1, . . . , G, j = 1, . . . , ni.
where the regression coefficient β is randomly selected from a uniform distribution
in [0, 1], and the random errors  are randomly and independently sampled from
N(0, σ2e) with σ
2 = 1− β2e .
Scatter plots of β̂N , β̂F , β̂M , β̂M2 estimates versus true values of β are shown in
Fig 6.1. The true values of β is on the x-axis and the estimated value of β is on the
y-axis. A 45 degree straight line is plotted in red, and a fitted straight line is plotted
in blue. If an estimator performs well, the data points should gather closely around
red line and the blue line should be close to the red line. Based on the results, we can
see that naive estimator β̂N usually underestimate β under both of the two cases.
This phenomenon is even more obvious under case 2. This is because the linkage
errors in the linked data weakened the correlation between y and x, which introduce
a bias toward zero when estimating the slope of the regression line. All of our
proposed estimators correct this bias, with full estimator and max2 estimator being
the most efficient, and max estimator being the least efficient. The max estimator
β̂M seems to overestimate β a little bit when the true value of β increases, but it
still behaves much better than β̂N especially in Case 2. In general, our proposed
estimators behave better than the naive estimator based the visual results. This is
115
Figure 6.1: Simulation results for a simple linear regression under case
1 and case 2 in the equal scenario: Scatter plots of naive, full, max and
max2 estimates versus true values of regression coefficient β. Diagonal
lines with slop 1 are plotted in red. Fitted lines are plotted in blue.
probably because they take account of the linkage errors in the linked data.
Fig 6.2 shows heat maps of 100 absolute deviations and squared deviations
of each estimators from true values of β in a linear model under two cases. The
darker the color is, the smaller the absolute deviation is. We can clearly see that our
proposed estimators performs much better than the naive estimator, especially in
case 2. Table 6.3 shows the average absolute deviations (AAD) and average squared
deviations (ASD) of our proposed estimators for β, as well as the relative percent
improvement (RPI) over the naive estimator under Case 1 and Case 2 in the equal
scenario. Values of AAD and ASD are shown in black and values of RPI are shown in
blue. Under both cases, β̂N has the smallest values of AAD and ASD among the four
estimators, implying it performs the worst in estimating the regression coefficient in
116
Figure 6.2: Simulation results for linear regression under case 1 and case
2 in the equal scenario: Heat map of absolute deviations (top 2) and
squared deviations (bottom 2) for estimates of regression coefficient β.
a simple linear model; β̂M2 performs better, but not as well as β̂F and β̂M2. We can
also see that values of AAD and ASD increase under Case 2 when compared to Case
1. This is as expected because Case 2 has more difficult simulation conditions (less
matching fields, larger block sizes, small probabilities of agreement among matches
and larger probabilities of agreement among mismatches) and then linkage errors
are more likely to occur. However, our proposed estimators improved more over
the naive estimator in Case 2 then in Case 1, indicated by the larger values of RPI
under Case 2. It shows that our proposed estimator would be especially useful when
linkage errors are more likely to occur.
117
Table 6.3: Simulation results for linear regression under case 1 and case 2 in the
equal scenario: Average absolute deviations (AAD) and average squared deviations
(ASD) of naive, full, max and max2 estimators of regression coefficient β. The
percent relative improvement (PRI) of the proposed estimators over naive estimator
is shown in blue.
Estimator Case 1 Case 2
AAD ASD AAD ASD
β̂N 0.0601 0.0070 0.2211 0.0670
β̂F 0.0166 0.0005 0.0286 0.0015
72.43% 92.33% 87.08% 97.82%
β̂M 0.0435 0.0027 0.0664 0.0060
27.63% 61.92% 69.95% 91.00%
β̂M2 0.0176 0.0006 0.0308 0.0017
70.65% 91.85% 86.05% 97.48%
6.2.2 Logistic regression with linked data
In this section, we want to compare the performances of different estimators of
regression coefficient β in a simple logistic model. Simulation for logistic regression
basically follows the same steps as those for linear regression. The only difference
is in the generation of values of x and y. In each simulation run, values of a scalar
independent variable x are randomly and independently selected from N(0, 1), and
the corresponding values of y are given by
exp(βxij)
P (yij = 1|xij) = g(xij) = , i = 1, . . . ,m, j = 1, . . . , ni,
1 + exp(βxij)
where the regression coefficient β is randomly selected from a uniform distribution
in [0, 1].
Fig 6.3 shows scatter plots of naive, full, max and max2 estimates and true
values of β in a simple logistic regression on 100 simulation runs under Case 1
118
and Case 2 in the equal scenario. Heat map for absolute deviations and squared
deviations for each estimator on 100 simulation runs is in Fig 6.4, and values of
average absolute deviations (AAD) and average squared deviations (ASD) of our
proposed estimators and their relative percent improvement (RPI) are displayed in
Table 6.4.
Results shown in Fig 6.3, Fig 6.4 and Table 6.4 for logistic regression on
linked data are similar to those for linear regression. Again, a bias toward zero is
introduced to the naive estimator. The proposed estimators correct this bias, with
max2 estimator and full estimator being the most efficient, and max estimator being
the least efficient. The increased relative percent improvements from Case 1 to Case
2 implied again that our proposed estimators improve more over the naive estimator
when the linkage error are more likely to occur.
Table 6.4: Simulation results for logistic regression under case 1 and case 2 in the
equal scenario: Average absolute deviations (AAD) and average squared deviations
(ASD) of naive, full, max and max2 estimators of regression coefficient β. The
percent relative improvement (PRI) of the proposed estimator over naive estimator
is shown in blue.
Estimator Case 1 Case 2
AAD ASD AAD ASD
β̂N 0.0811 0.0112 0.2320 0.0811
β̂F 0.0527 0.0045 0.0755 0.0098
35.01% 59.78% 66.61% 87.90%
β̂M 0.0681 0.0080 0.1215 0.0269
16.02% 28.60% 47.65% 66.78%
β̂M2 0.0517 0.0043 0.0796 0.0106
36.24% 61.28% 65.70% 86.99%
In order to further evaluate the performances of these four estimators, another
set of 100 simulation runs for logistic regression under case 1 in equal scenario is
119
Figure 6.3: Simulation results for logistic regression under case 1 and
case 2 in the equal scenario: Scatter plot of naive, full, max and max2
estimates versus true values of regression coefficient β. Diagonal lines
with slop 1 are plotted in red. Fitted lines are plotted in blue.
performed with β fixed at 0.5. Box plot of deviations and relative deviations of
different estimators from the true value of β is shown in Fig 6.5. Table 6.5 gives
the Monte Carlo estimate of bias, relative bias, mean square error (MSE), relative
mean square error, length and coverage of nominal 95% confidence intervals of β
for each method. The standard errors of these estimates are shown in blue. The
negative values of bias and relative bias of naive estimator implies it underestimate
values of β, and the other three estimators correct this bias, with max2 estimator
and full estimator being the most efficient. The correctness of bias and relative bias
also lead to the decrease of mean square error and relative mean square error. In
terms of mean square error and relative mean square error, the max2 estimator β̂M2
performs the best, followed closely by the full estimator β̂F . The relative efficiency
120
Figure 6.4: Simulation results for logistic regression under case 1 and
case 2 in the equal scenario: Heat map for absolute deviations (top 2)
and squared deviations (bottom 2) of estimates of regression coefficient
β.
of each proposed estimator to the naive estimator with respect to mean square error
is given in Table 6.6. We can also see that the coverage rates of confidence intervals
produced by max2 and full estimators and their jackknife variances are very close to
their desired nominal level, while those produced by naive estimator is lower than
the desired nominal level.
6.3 The Unequal Scenario
In this section, we attempt to compare performances of different estimators
under the unequal scenario where the number of records in the same blocks are
different in Fx and Fy. Our parameter of interest is the regression coefficient β in
a simple linear model. Two sets of simulation conditions are considered. They are
similar to those used for the equal scenario, but slightly different. The number of
observations of y in block i of file Fy, ni, is different from the number of observations
of x in the same block of file Fx, Ni, i = 1, . . . ,m. ni ranges from 10 to 20 under
121
Figure 6.5: Simulation results for logistic regression under case 1 in
the unequal scenario: Box plot of deviations and relative deviations of
different estimators from the true value of β over 100 simulation runs.
Figure 6.6: Simulation results for logistic regression under case 1 in the
equal scenario: Plot of Monte Carlo estimate of bias, relative bias, vari-
ance, relative variance, mean square error (MSE), relative mean squared
error(RMSE), length and coverage of nominal 95% confidence intervals
of regression coefficient β by different methods over 100 simulation runs.
Value of β is set to 0.5 for each simulation.
122
Table 6.5: Simulation results for logistic regression under case 1 in the equal scenario:
Monte Carlo estimate of bias, relative bias (R.Bias), variance, relative variance
(R.Var), mean square error (MSE), relative mean squared error(R.MSE), length and
coverage rate (C.R.) of nominal 95% confidence intervals of regression coefficient β
by different methods over 100 simulation runs. Value of β is set to 0.5 for each
simulation. The corresponding estimated standard deviation is show in blue.
Bias R.Bias Var R.Var MSE R.MSE Length C.R.
Naive -0.0731 -0.1462 0.0044 0.0175 0.0097 0.0387 0.1869 63%
0.0066 0.0132 0.0011 0.0045 0.0022 0.0485
Full -0.0096 -0.0192 0.0046 0.0185 0.0047 0.0187 0.2429 93%
0.0068 0.0136 0.0007 0.0027 0.0048 0.0256
Max 0.0430 0.0859 0.0042 0.0167 0.0060 0.0239 0.2388 90%
0.0065 0.0129 0.0009 0.0035 0.0051 0.0302
Max2 -0.0007 -0.0014 0.0037 0.0148 0.0037 0.0147 0.2259 95%
0.0061 0.0122 0.0005 0.0019 0.0034 0.0219
Table 6.6: Simulation results for logistic regression under case 1 in the equal scenario:
Relative efficiency (RE) of proposed estimators to naive estimator with respect to
mean square error.
Naive Full Max Max2
MSE 0.0097 0.0047 0.0060 0.0037
RE 0.4821 0.6179 0.3798
Case 1 and from 20 to 30 under Case 2. The ratio of ni and Ni is denoted by ri,
which varies from 1.5 to 3, and Ni, is set to Ni = bniric. With this setup, linkage
errors would be more likely to occur compared to the equal size scenario. For the
details of the simulation conditions, see Table 6.7.
6.3.1 Linear regression with linked data
Fig 6.7 displays scatter plots of naive, full, max and max2 estimates versus the
true values of β in a simple linear model under Case 1 and Case 2 in the unequal
scenario. Similar to the results for the equal scenario, the naive estimator underes-
timate the regression coefficient, indicated by the obvious discrepancy between the
123
Table 6.7: Simulation conditions for Case 1 and Case 2 under the unequal scenario.
Symbol Description Case 1 Case 2
lower upper lower upper
limit limit limit limit
G number of blocks 100 100
R number of simulations 100 100
x univariate covariate x ∼ N(0, 1) x ∼ N(0, 1)
β regression coefficient 0 1 0 1
σ2 regression variance σ2e e = 1− β2 σ2e = 1− β2
ni size of block i in Fy 10 20 20 30
ri ratio of sizes of block i in Fy 1.5 3 1.5 3
and Fx
Ni size of block i in Fx Ni = bniric Ni = bniric
K number of matching fields 8 12 6 10
mk probability of agreement on 0.55 0.95 0.55 0.85
field k for a match
uk probability of agreement on 0.10 0.50 0.20 0.50
field k for a mismatch
red line and the blue line. Our proposed estimators, especially the full estimator,
correct the bias. The fitted blue line for the full estimator almost coincide with the
diagonal red line. Fig 6.8 shows the heat map of absolute deviations and squared
deviations of each estimator over the 100 simulation runs. The simulation runs are
sorted in a decreasing order of values of full estimator. The darker the color is,
the smaller the absolute deviation is. Table 6.8 gives the average absolute devia-
tions (AAD) and average squared deviations (ASD) of our proposed estimator and
its relative percent improvement over the naive estimator. Results show that our
proposed estimators, especially full estimator and max2 estimator, performs better
than the naive estimator, especially in the case where linkage errors are more likely
to occur.
In order to further evaluate the performances of these four estimators, another
124
Figure 6.7: Simulation results for linear regression under case 1 and case
2 in the unequal scenario: Scatter plot of naive, full, max and max2
estimates versus true values of regression coefficient β in a simple linear
model on 100 simulation runs. Diagonal lines with slop 1 are plotted in
red. Fitted lines are plotted in blue.
125
Figure 6.8: Simulation results for linear regression under case 1 and
case 2 in the unequal scenario: Heat map of absolute deviations and
squared deviations for estimates of regression coefficient β in a simple
linear model on 100 simulation runs.
set of 100 simulation runs for linear regression under case 1 of unequal scenario is
performed with β fixed at 0.5. Box plot of deviations and relative deviations of
different estimators from the true value of β is shown in Fig 6.9. Table 6.9 gives the
Monte Carlo estimate of bias, relative bias, mean square error (MSE), relative mean
square error(RMSE), length and coverage of nominal 95% confidence intervals of β
for each method. The standard errors of these estimates are shown in blue. The
negative values of bias and relative bias of naive estimator implies it underestimate
values of β, and the other three estimators correct this bias, with full estimator being
the best. Our proposed estimators also efficiently decreased the variance by about
50% when compared to the naive estimator. We can also see that the coverage rate
of confidence intervals produced by full estimators and their jackknife variances is
126
Table 6.8: Simulation results for linear regression under case 1 and case 2 in the
unequal scenario: Average absolute deviations (AAD) and average squared devia-
tions (ASD) of naive, full, max and max2 estimators of regression coefficient β in
a simple linear model over 100 simulation runs. The percent relative improvement
(PRI) of the proposed estimator over naive estimator is shown in blue.
Estimator Case 1 Case 2
AAD ASD AAD ASD
β̂N 0.0573 0.0061 0.2780 0.1056
β̂F 0.0192 0.0006 0.0366 0.0023
66.51% 89.88% 86.84% 97.80%
β̂M 0.0456 0.0028 0.0665 0.0062
20.55% 53.06% 76.08% 94.09%
β̂M2 0.0201 0.0007 0.0402 0.0026
64.96% 88.45% 85.53% 97.52%
very close to its desired nominal level, with only 1 percent off, while those produced
by other estimators are lower than the desired nominal level. In terms of mean
square error and relative mean square error, the full estimator β̂F performs the
best, followed by the max2 estimator β̂M2 and max estimator β̂M . The relative
efficiency of each proposed estimator to the naive estimator with respect to mean
square error is given in Table 6.10.
6.4 Comparison of the Standard and Simplified Jackknife Methods
Inspired by Jiang, Lahiri and Wan (2005), we proposed to use Jackknife
method to estimate variance of each estimator of β. A Jackknife replicate i is
constructed by leaving out data from blocks i, i = 1, . . . , G in file Fy and Fx.
The estimate of mixture model parameters ψ = {π,m1, . . . ,mK , u1, . . . , uK} are re-
estimated at each replicate data, and then are used to estimate the probability of
127
Figure 6.9: Simulation results for linear regression under case 1 in the
unequal scenario: Box plot of deviations and relative deviations of dif-
ferent estimates from the true value of β over 100 simulation runs.
Figure 6.10: Simulation results for linear regression under case 1 in the
unequal scenario: Plot of Monte Carlo estimate of bias, relative bias,
variance, relative variance, mean square error (MSE), relative mean
squared error(RMSE), length and coverage of nominal 95% confidence
intervals of regression coefficient β by different methods over 100 simu-
lation runs. Value of β is set to 0.5 for each simulation.
128
Table 6.9: Simulation results for linear regression under case 1 in the unequal sce-
nario: Monte Carlo estimate of bias, relative bias(R.Bias), variance, relative variance
(R.Var), mean square error (MSE), relative mean squared error(RMSE), length and
coverage rate (C.R.) of nominal 95% confidence intervals of regression coefficient
β by different methods over 100 simulation runs. Value of β is set to 0.5 for each
simulation. The corresponding estimated standard deviation is show in blue.
Bias R.Bias Var R.Var MSE R.MSE Length C.R.
Naive -0.0785 -0.1569 0.0028 0.0112 0.0089 0.0357 0.1149 35%
0.0053 0.0106 0.0012 0.0049 0.0025 0.0479
Full 0.0061 0.0122 0.0011 0.0043 0.0011 0.0044 0.1225 94%
0.0033 0.0066 0.0001 0.0006 0.0023 0.0239
Max 0.0501 0.1003 0.0016 0.0064 0.0041 0.0164 0.1294 69%
0.0040 0.0080 0.0005 0.0020 0.0031 0.0465
Max2 0.0117 0.0235 0.0014 0.0056 0.0015 0.0061 0.1240 88%
0.0037 0.0075 0.0002 0.0009 0.0027 0.0327
Table 6.10: Simulation results for linear regression under case 1 in the unequal
scenario: Relative efficiency (RE) of proposed estimators to naive estimator with
respect to mean square error.
Naive Full Max Max2
MSE 0.0089 0.0011 0.0041 0.0015
RE 0.1244 0.4607 0.1713
linkage and designate the record pairs as links and non-links. Then β̂−i, the esti-
mate of β for replicate i, can be obtained based on replicate data and ψ̂−i. And
the jackknife variance of an estimator of β can be obtained by aggregating these
G replicate estimates of β. This way, the estimated variance V̂ cannot only cap-
ture the variability caused by linkage errors but also cover the variability caused by
expectation maximization algorithm
However, the above jackknife method of estimating variance is time consum-
ing because of the complexity of process. The expectation maximization algorithm
is operated G + 1 times estimate mixture model parameters during the entire pro-
129
cess, 1 time on the full data to obtain ψ̂, G times on all jackknife replicates to get
{ψ̂−i, i = 1, . . . , G}. Also, the design matrix Q used in the estimating functions
need to be re-constructed for each jackknife replicate since it depends on the mix-
ture model parameters. We wonder whether the jackknife method can be simplified
to decrease the computation time without losing much accuracy. Based on some of
our findings that (1) the expectation maximization algorithm takes most part of the
computation time during the entire process; (2) there’s no big difference between
the estimate of mixture model parameters on the full data and on the jackknife
replicates, we propose a simplified version of the above jackknife method. Instead of
re-estimating mixture model parameters ψ for each jackknife replicate, we use the
estimate obtained from the full data ψ̂ during the entire process. Let V̂0 denote the
estimated variance of an estimate obtained from the simplified jackknife method.
Next, we conduct a Monte Carlo simulation study to show that the difference be-
tween these two estimated variances V̂ and V̂0 is quite small.
Again, the simulation is performed under two different scenarios: the equal
scenario and the unequal scenario. In the equal scenario, simulation is done for a
simple logistic model under Case 1 with simulation conditions as shown in Table
6.2. In the unequal scenario, simulation is done for a simple linear model under
Case 1 with simulation conditions as shown in Table 6.7. Under both scenarios, the
true value of the regression coefficient β is set to 0.5. For each of the R simulation
runs, two different jackknife methods are used to estimate variance of an estimate.
130
6.4.1 Equal Scenario
Figure 6.11 shows the box plot of the relative difference of V̂0 and V̂ of each
estimate of β. First, most of the relative differences among R simulation runs are
above 0, showing that V̂ is usually greater than V̂0. This is expected since V̂0 ignores
the variability in estimating mixture model parameters while V̂0 does not. Also, we
can clearly see the relative difference of V̂ and V̂0 is large for the naive estimator but
small for our propose estimators, especially for the full estimator β̂F , which is within
0.1. We wonder whether the absolute relative difference of V̂0 and V̂ is smaller than
a positive constant L. We would like to test the hypotheses
H0: d = C.
H0: d < C
where d >= 0 is the absolute relative difference of V̂0 and V̂ .
A one sample t test may be considered. However, the histogram of the absolute
relative difference, shown in Figure 6.12, looks strongly skewed, suggesting lack
of normality. We would therefore to use the Wilcoxon signed rank test, a non-
parametric statistical hypothesis test. The test statistic W+is the sum of ranks of
the positive difference of d and C. That is,
∑R
W+ = |d− C| × I{sign(d− C) = 1}
i=1
Under the null hypothesis, W+ has mean
R(R + 1)
µW+ =
4
131
and standard deviation √
R(R + 1)(2R + 1)
σW+ =
24
Under the null hypothesis, the distribution of the signed rank statistic W+ converges
to a normal distribution when the number of replicate R becomes large. Then we
can use the normal probability calculation (with continuity correct) to approximate
P-value for W+. The P-value is equal to
( ) ( )
+ +
P (X < W+
W − µW+ W − µW+
W+ ) = P Z <= = Ψ
σW+ σW+
The Wilcoxon signed rank statistics and their corresponding p values of the
hypotheses test for different choices of C are shown in Table 6.11. Based on the
result, there is strong evidence that the absolute relative difference between V̂ and V̂0
is less than 0.02 for full estimator, and less than 0.04 for all the proposed estimators.
There’s no evidence that the absolute relative difference is within 0.05 for the naive
estimator.
Table 6.11: Simulation results for logistic regression under case 1 in the equal sce-
nario: Table of test statistic and p values of one tailed Wilcoxon signed rank test.
The alternative hypothesis is that the absolute relative difference between the two
jackknife variances of the estimate of β is less than C.
C 0.01 0.02 0.03 0.04 0.05
Estimator w p.value w p.value w p.value w p.value w p.value
Naive 4954 1.00 4787 1.00 4618 1.00 4412 1.00 4104 1.00
Full 3440 1.00 1367 0.00 654 0.00 244 0.00 45 0.00
Max 4234 1.00 3452 1.00 2578 0.57 1936 0.02 1388 0.00
Max2 4327 1.00 3073 0.97 2196 0.13 1439 0.00 864 0.00
132
Figure 6.11: Simulation results for logistic regression under case 1 in the
equal scenario: Box plot of relative difference between the two jackknife
variances of each estimate of β. The true value of β is set to 0.5.
Figure 6.12: Simulation results for logistic regression under case 1 in
the equal scenario: Relative frequency histogram of absolute relative
difference between the two jackknife variances of each estimate of β.
The black line is the kernel density estimate.
133
Figure 6.13: Simulation results for linear regression under case 1 in the
unequal scenario: Box plot of relative difference between the two jack-
knife variances of each estimate of β. The true value of β is set to 0.5.
6.4.2 Unequal Scenario
Similar results are obtained for the linear regression under case 1 in the unequal
scenario. The box plot of relative differences of V̂ and V̂0 is displayed in Figure 6.13,
the histogram of absolute relative differences is shown in Figure 6.14, and the results
for the hypothesis test is given in Table 6.12. Based on the result, there is strong
evidence that the absolute relative difference of V̂ and V̂0 is within 0.05 for all the
proposed estimators, but not for the naive estimator.
6.4.3 Conclusions
Based on the results of the one-sided hypothesis test, we concludes that the ab-
solute relative difference between the variances obtained from the standard jackknife
134
Figure 6.14: Simulation results for linear regression under case 1 in the
unequal scenario: Relative frequency histogram of absolute relative dif-
ference between the two jackknife variances of each estimate of β. The
black line is the kernel density estimate.
Table 6.12: Simulation results for linear regression under case 1 in the unequal
scenario: Table of test statistic and p values of one tailed Wilcoxon signed rank
test. The alternative hypothesis is that the absolute relative difference between the
two jackknife variances of the estimate of β is less than C.
C 0.01 0.02 0.03 0.04 0.05
Estimator w p.value w p.value w p.value w p.value w p.value
Naive 5034 1.00 4994 1.00 4895 1.00 4767 1.00 4519 1.00
Full 4918 1.00 4354 1.00 3534 1.00 2773 0.80 1911 0.02
Max 4713 1.00 3887 1.00 3066 0.97 2343 0.27 1860 0.01
Max2 4866 1.00 4225 1.00 3443 1.00 2631 0.64 2019 0.04
method and from the simplified jackknife method are small for our proposed estima-
tors, but not for the naive estimator, under both the equal scenario and the unequal
scenario. Hence, we would recommend to use the simplified jackknife method to
estimate variance for any of the proposed estimators if one would like to pursue less
computation time.
135
Chapter 7: Future Research
Our research has initiated some new ideas and points to several directions for
future research.
A key assumption of the general methodology for small area estimation is that
the records from multiple sources can be partitioned into small areas without error,
and small areas coincide with blocks. In reality, however, this may not hold in
different situations:
(1) The variable specifying the membership of small areas is not present in all
files to be linked. Thus, it is impossible to partition the records in each file into
small areas before the record linkage process.
(2) Even when the records in multiple files can be divided into small areas
successfully, the number of record pairs within some small areas could be large,
and thus blocking within small areas may be a reasonable option in order to reduce
computational burden. In this case, our general methodology works when a small
modification to the linkage error model is made. The matrix of matching status
indicators Li = diag(Li1, . . . ,LiG ) turns block-diagonal, assuming that the recordsi
in small area i are partitioned into Gi blocks without error.
(3) It is also possible that small areas are nested within blocks. In this case, the
136
model is likely to introduce correlation across small areas. Our general methodology
works for point estimation because global parameters like the regression coefficients
and variance components will be estimated properly as long as we have a large
number of blocks. As for variance estimation, a new method is required since the
current jackknife methods used in the dissertation requires the measurements to
independent across small areas.
Our research is limited to performing statistical analysis on data from two
different files. Specifically, we consider the case where the variable of interest and
its predictors are observed separately for two samples of population units. When
developing the current methodology, we see the potential of extending it to an even
more general case, where observations on the variable of interest and some of its
predictors are recorded in one file and observations on the rest of its predictors are
stored in other multiple files. The basic idea is to use a system of linkage error models
and a system of mixture models. The validity of the idea need to be investigated in
the future.
Our proposed methodology requires the measurements to be independent
across blocks or small areas. This is mainly due to the assumption of the jackknife
methods we used for estimating bias, variance, and mean squared errors. Recently,
Jiang and Mahmoud proposed a Monte-Carlo-assisted approach to mean squared
error estimation of a small area estimate, which allows correlation across small areas.
It can be a potential tool to improve our research.
The dissertation is focused on the classical method of small area estimation
using data from multiple files. The classical unit-level models are used for describing
137
the relationship between the study variable and auxiliary variables, and the mixture
model is used for the purpose of record linkage. In the literature of small area esti-
mation, Hierarchical Bayesian approaches have been suggested due to the following
advantages:
(1) It is straightforward to take into account all sources of variation.
(2) The MCMC techniques have made it computationally feasible and easy to
estimate the model.
(3) The Bayesian approaches allow the use of the one-to-one matching assump-
tion, so that we do not need to be concerned about the one-to-many and many-to-one
linkage problem that usually occurs when classical methods are used.
In the future, we would like extend our research to use Bayesian methods for
small area estimation using data from multiple files.
In this dissertation, Monte Carlo simulations are used to provide preliminary
evidence supporting the validity of our general methodology. In the future, we would
like to apply the classical and Bayesian methods of statistical analysis using data
from multiple files to address some real issues. Poverty mapping and nonresponse
adjustment are two possible applications. We may have very limited information
about the individuals in poverty or nonrespondents from the sampling frame, but
more valuable information about them can be obtained if we can link the survey
and administrative data. As the amount of information increases, more advanced
models can be built to help us understand their behavior and further improve the
accuracy of poverty estimates or efficiency of weight adjustment. For example,
weighting class adjustment method is commonly used for nonresponse adjustment
138
when relatively few variables are available. If additional variables can be obtained
from record linkage, response propensity models, using logistic regression, can be
applied to predict the likelihood of response versus nonresponse, and then provide
a weighting factor. When it comes to poverty mapping, the more advanced models
can better predict the poverty status of an individual that was not sampled in the
survey, and further provide a more reliable poverty estimate for each small area.
139
Bibliography
[1] Alvey, M. and Jamerson, N. (1997). Record linkage techniques-1997. Proceed-
ings of an International Workshop and Exposition (1997 March), pp. 20-21.
[2] Armstrong, J.B. and Mayda, J.E. (1993) Model-based estimation of record
linkage error rates, Survey Methodology, 19, 137-147.
[3] Becker, M.P. and Yang, I. (1998) Latent class marginal models for cross-
classifications of counts, Sociological Methodology, 28, 293-325.
[4] Battese, G. E., Harter, R. M. and Fuller, W. A. (1988). An error-components
model for prediction of county crop areas using survey and satellite data. Jour-
nal of the American Statistical Association, 83, 28-36.
[5] Belin, T.R. Evaluation of sources of variation in record linkage through a fac-
torial experiment, Survey Methodology, 19 (1993) 13-29.
[6] Belin, T.R and Rubin, D.B (1995) A method for calibrating false-match rates
in record linkage, Journal of the American Statistical Association, Vol. 90, No.
430, 694-707.
[7] Binder, D.A. (1983) On the variances of asymptotically normal estimators from
complex surveys, International Statistical Review, 51, 279-292.
[8] R. Chambers, Regression analysis of probability-linked data, Statisphere (2009),
4.
[9] Chambers, R., Chipperfield, J.O., Davis, W. and Kovacevic, M. (2009b). ”In-
ference based on estimating equations and probability-linked data”. In: Centre
for Statistical and Survey Methodology, Working Paper Series. Wollongong:
University of Wollongong, p. 38.
140
[10] Chambers, R. and Kim, G. (2016). Secondary analysis of linked data in (K.,
Harron, H., Goldstein and C., Dibben, Eds) Methodological developments in
data linkage, pp. 83-108,Chichester: Wiley.
[11] Chipperfield, J.O., Bishop, G.R. and Campbell, P. (2011). Maximum likelihood
estimation for contingency tables and logistic regression with incorrectly linked
data, Survey Methodology, vol. 37, pp. 13-24.
[12] Chipperfield, J.O., Chambers, R. (2015). ”Using the bootstrap to analyse bi-
nary data obtained via probabilistic linkage”. Journal of Official Statistics, vol.
31, pp. 397-414.
[13] Dasylva, A. (2014). ”Design-based Estimation with Record-Linked Admin-
istrative Files”. In: Statistics Canada. Beyond traditional survey tak-
ing: adapting to a changing world: Proceedings of the 2014 International
Methodology Symposium, 29-31 October 2014, Ottawa, Canada. Ottawa:
Statistics Canada; 2014.http://www.statcan.gc.ca/sites/default/files/
media/14265-eng.pdf
[14] Dempster, A., Laird, N., and Rubin, D. (1977). ”Maximum Likelihood from In-
complete Data via the EM Algorithm”. Journal of the Royal Statistical Society
Series B, 39, pp. 1-38.
[15] Fellegi, I.P. and Sunter, A.B. (1969) A theory for record linkage, Journal of the
American Statistical Association, Vol. 64 1183-1210.
[16] Fortini, M., Liseo, B., Nuccitelli, A., and Scanu, M. (2000), On Bayesian record
linkage, Bayesian Methods with Applications to Science, Policy, and Official
Statistics: Selected Papers from ISBA 2000: The Sixth World Meeting of the
International Society for Bayesian Analysis, E. I. George ed., 155-164.
[17] Fortini, M., Nuccitelli, A., Liseo, B., and Scanu, M. (2002), Modelling issues
in record linkage: A Bayesian perspective, Proc. Amer. Statist. Assoc. Survey
Research Meth. Sec.,
[18] Gill, R. (1986), On Estimating Transition Intensities of a Markov Process
with Aggregate Data of a Certain Type: “Occurrences but No Exposures”,
Scand. Jour. Statist., 13, 113-134.
[19] Gill, L. E. (1997), OX-LINK: The Oxford Medical Record Linkage System
Demonstration of the PC Version, in Record Linkage Techniques - 1997, Proc.
International Workshop Exposition, Federal Committee on Statistical Method-
ology, Office of Management of the Budget, page 491.
141
[20] Goldstein, H., Harron, K. and Wade, A. (2015). The analysis of record-linked
data using multiple imputation with data value priors, Statistics in Medicine,
vol. 21, pp. 1485-1496.
[21] Thomas R. Belin, Evaluation of sources of variation in record linkage through
a factorial experiment, Survey Methodology, 19 (1993) 13-29.
[22] A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood from in-
complete data via the EM algorithm (with discussion), Journal of the Royal
Statistical Society, Ser. B, 39 (1977) 1-38.
[23] Gomatam, S., and Larsen, M.D. (2004), Record linkage and counterterrorism,
Chance 17, 25-29.
[24] Harron, K., Goldstein, H. and Dibben, C., eds. (2016). Methodological Devel-
opments in Data Linkage. John Wiley & Sons, Hoboken, NJ.
[25] Herzog, T.N., Scheuren, F.J. and Winkler, W.E. (2007) Data Quality and
Record Linkage Techniques, Springer, New York, NY.
[26] Herzog, T. N., Scheuren, F., and Winkler, W.E., (2010), Record Linkage, in
(D. W. Scott, Y. Said, and E.Wegman, eds.) Wiley Interdisciplinary Reviews:
Computational Statistics, New York, N. Y.: Wiley, 2 (5), September/October,
535-543
[27] Hof, M. H. P. and Zwinderman, A.H. (2012) Methods for analyzing data from
probabilistic linkage strategies based on partially identifying variables, Statistics
in Medicine,31, 42314242, DOI: 10.1002/sim.5498.
[28] Howe, G.R. (1981) A generalized iterative record linkage computer system for
use in medical follow-up studies, Computers and Biomedical Research, 14, pp.
327-340.
[29] Jaro, M. A. (1989). Advances in record linkage methodology to matching the
1985 census of Tampa, Florida, Journal of the American Statistical Association,
84, pp. 414-420.
[30] Jiang, J. (2007), Linear and Generalized Linear Mixed Models and Their Ap-
plications. New York: Springer.
[31] Jiang, J., Lahiri, P. and Wan, S-W. (2002) A unified jackknife theory for em-
pirical best prediction with M-estimation, Annals of Statistics, 30, 1782-1810.
142
[32] Kandari, N. and Lahiri, P. (2016), Prediction of a function of misclassified
binary data, Statistics in Transition new series, Vol. 17, No. 3, 429-447.
[33] Kim, G. and Chambers, R. (2012a). Regression analysis under incomplete link-
age. Computational Statistics and Data Analysis, 56, no. 9, pp. 27562770.
[34] Kim, G. and Chambers, R. (2012b). Regression analysis under probabilistic
multi-linkage. Statistica Neerlandica, 66, no. 1, pp. 6479.
[35] Kim, G. and Chambers, R. (2013). Bias reduction for correlated linkage error.
Working Papers Series. Wollongong: NIASRA, University of Wollongong.
[36] Krewski, D., Dewanji, A., Wang, Y., Bartlett, S., Zielinkski, J.M. and Mallick,
R. (2001). Regression Analysis with Linked Data. Survey Methodology, 31(1),
13-22.
[37] Lahiri, P. and Larsen, M.D. (2005) Analysis with linked data, Journal of the
American Statistical Association, 100, No. 469, 222-230.
[38] Lane, J. (2010). Linking administrative and survey data.In P. V. Marsden & J.
D. Wright (Eds.), Handbook of survey research (p. 659-680). Bingley: Emerald.
[39] Larsen, K. (2005), Generalized Naive Bayes Classifiers, SIGKDD Explorations,
7 (1), 2005, 76-81,doi¿10.1145/1089815.1089826.
[40] Larsen, M.D. (1999a), Multiple imputation analysis of records linkage using
mixture models, Proc. Statist. Soc. Canada Survey Meth. Sec., 65-71.
[41] Larsen, M.D. (1999b), Predicting the residency status for administrative records
that do not match Census records,Administrative Records Research Memoran-
dum Series, #20,Bureau of the Census, U.S. Department of Commerce.
[42] Larsen, M.D. (2002), Comment on hierarchical Bayesian record linkage, Proc.
Sec. Bayesian Statist. Sci., American Statistical Association meeting, New York
City, NY, CDROM: 1995-2000.
[43] Larsen, M. D. (2004), Record linkage using finite mixture models, Applied
Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives,
A. Gelman and X-L. Meng, eds. 309-318.
[44] Larsen, M.D. and Rubin, D.B. (2001) Iterative automated record linkage using
mixture models, Journal of the American Statistical Association, Vol. 96, 32-34.
143
[45] Livingston, E. H., and Ko, C. Y. (2005), Effect of diabetes and hypertension
on obesity-related mortality, SURGERY 137, 16-25.
[46] Lohr, S. L., and Rao, J.N,K. (2009), Jackknife estimation of mean squared
error of small area predictors in nonlinear mixed models, Boimetrika 96, 457-
468, 16-25.
[47] McCutcheon, A. L. (1987), Latent class analysis, Sage Publications, Inc.: New-
bury Park, CA; London.
[48] McGlinchy, M. (2004), A Bayesian record linkage methodology for multiple
imputation of missing links, Proc. Amer. Statist.Assoc. Survey Research Meth.
Sec., Alexandria, VA, CDROM.
[49] McLachlan, G. and Peel, D. (2000), Finite Mixture Models,1st ed., Wiley-
Interscience.
[50] Meng, X.L. and Rubin, D.B. (1993) Maximum likelihood estimation via the
ECM algorithm: a general framework, Biometrika, Vol. 80 267-278.
[51] Neter, J., Maynes, E. and Ramanathan, R. (1965) The Effect of Mismatching
on the Measurement of Response Error, Journal of the American Statistical
Association, 60.
[52] Newcombe, H. B., Kennedy, J. M., Axford, S. J. and James, A. P.(1959),
Automatic linkage of vital records, Sceince 130,954-959.
[53] Newcombe, H.B., and Kennedy, J. M. (1962). Record Linkage:Making Maxi-
mum Use of the Discriminating Power of Identifying Information Communica-
tions of the Association for Computing Machinery, 5, 563-567.
[54] D’Orazio, M., Zio, M.D., and Scanu, M. (2006) Statistical Matching: Theory
and Practice, John Wiley & Sons.
[55] Rao, J.N.K. (2002) Estimating equations for the analysis of survey data using
poststratification information, Sankhyā, 64, 364-378.
[56] Rässler, S. (2002) Statistical matching: a frequentist theory, practical ap-
plications, and alternative Bayesian approaches, Lecture Notes in Statistics,
Springer.
144
[57] Sadinle, M. and Fienberg, S. (2013). A generalized Fellegi-Sunter framework for
multiple record linkage with application to homicide record-systems Journal of
the American Statistical Association, 108, pp. 385-397.
[58] Rao, J.N.K. and Molina, I. (2015), Small Area Estimation, 2nd ed., Wiley.
[59] Särndal, C-E, Swensson, B., and Wretman, J. (1992) Model Assisted Survey
Sampling, Springer-Verlag.
[60] Samart, K. and Chambers R. (2014) Linear regression with nested errors using
probability-linked data, Australian and New Zealand Journal of Statistics,56(1),
27-46.
[61] Scheuren, F.J. and Winkler, W.E. (1993) Regression analysis of data files that
are computer matched, Survey Methodology, 19, 39-58.
[62] Scheuren, F.J. and Winkler W.E. (1997). Regression Analysis of Data that are
Computer Matched - Part ii. Survey Methodology, 23(2):157-165.
[63] Schnell, R. (2013), Linking Surveys and Administrative Data, German Record
Linkage Center Working Paper Series, No. WP-GRLC-2013-03.
[64] Steorts, R. C., Hall, R. and Fienberg, S. E. (2015). A Bayesian Approach
to Graphical Record Linkage and De-Duplication. Journal of the American
Statistical Association, in press.
[65] Tancredi, A., and Liseo, B. (2011) A hierarchical Bayesian approach to record
linkage and population size problems. Annals of Applied Statistics, 5, 1553-1585.
[66] Tancredi, A. and Liseo, B.(2015). Regression analysis with linked data: prob-
lems and solutions, Statistica.
[67] Tancredi, A., Steorts, R.C., Liseo, B. (2017). A Bayesian approach for dedupli-
cation, record linkage and inference with linked data. Working paper, MEMO-
TEF, Sapienza Università di Roma.
[68] Thibaudeau, Y. (1993) The discrimination power of dependency structures in
record linkage, Survey Methodology, 19, 31-38.
[69] Wang, J. and Donnan, P. (2002). Adjusting for missing record-linkage in out-
come studies. Journal of Applied Statistics, Aug; 29(6):873-884.
145
[70] Winkler, W. E. (1988), Using the EM algorithm for weight computation in
the Fellegi-Sunter model of record linkage, Proc. Amer.Statist. Assoc. Survey
Research Meth. Sec., 667- 671.
[71] Winkler, W. E. (1989), Near automatic weight computation in the Fellegi-
Sunter model of record linkage, Proc. Bureau of the Census Annual Research
Conference 5, 145-155.
[72] Winkler, W.E. (1990) String comparator metrics and enhanced decision rules in
the Fellegi-Sunter model of record linkage, Proceedings of the Section on Survey
Research Methods, American Statistical Association (1990) 354-359.
[73] Winkler, W. E. (1993), Improved decision rules in the Fellegi-Sunter model of
record linkage, Proc. Amer. Stastist. Assoc. Survey Research Meth. Sec., 274-
279.
[74] Winkler, W. E. (1994), Advanced methods for record linkage, Amer. Statist.
Assoc. Survey Research Meth. Sec., 467-472.
[75] Winkler, W. E. (1995), Matching and record linkage, in Business Survey Meth-
ods, Cox, B. G., Binder, D. A., Chinnappa, B. N., Christianson, A., Colledge,
M. J., and Kott, P. S. eds., Wiley, New York, 355-384.
[76] Winkler, W. E. (1995), Editing Discrete Data, Proceedings of the Section on
Survey Research Methods, American Statistical Association, 108-113.
[77] Winkler, W. E. (2007). Examples of easy-to-implement, widely used methods
of masking for which analytic properties are not justified. Technical report,
Statistical Research Division, U.S. Bureau of the Census, Washington, DC.
[78] Winkler, W. E. (2014). Matching and Record Linkage. Wiley Interdisciplinary
Reviews. Computational Statistics, 6, 313-325.
146