ABSTRACT Title of Document: DIAGNOSTICS FOR MULTIPLE IMPUTATION BASED ON THE PROPENSITY SCORE. Jia Wang, MPH, 2010 Directed By: Assistant Professor Guangyu Zhang, Department of Epidemiology and Biostatistics Abstract: Multiple imputation (MI) is a popular approach to handling missing data, however, there has been limited work on diagnostics of imputation results. We propose two diagnostic techniques for imputations based on the propensity score (1) compare the conditional distributions of observed and imputed values given the propensity score; (2) fit regression models of the imputed data as a function of the propensity score and the missing indicator. Simulation results show these diagnostic methods can identify the problems relating to the imputations given the missing at random assumption. We use 2002 US Natality public-use data to illustrate our method, where missing values in gestational age and in covariates are imputed using Sequential Regression Multiple Imputation method. DIAGNOSTICS FOR MULTIPLE IMPUTATION BASED ON THE PROPENSITY SCORE By Jia Wang Thesis submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Master of Public Health 2010 Advisory Committee: Assistant Professor Guangyu Zhang, Chair Assistant Professor Tongtong Wu Assistant Professor Xin He ? Copyright by Jia Wang 2010 ii Table of Contents Table of Contents .......................................................................................................... ii List of Tables ............................................................................................................... iii List of Figures .............................................................................................................. iv Chapter 1: Introduction ................................................................................................. 1 1.1 Missing Data Mechanisms .......................................................................... 1 1.2 Existing Approaches to Missing Data......................................................... 2 1.3 Multiple Imputation .................................................................................... 3 Chapter 2: Diagnostic Method for Multiple Imputation ............................................... 6 2.1 Existing Diagnostics for Multiple Imputation ............................................ 6 2.2 Diagnostics Based on the Propensity Score ................................................ 7 Chapter 3: Simulation Study ......................................................................................... 9 Chapter 4: Application to 2002 US Natality Public-Use Data ................................... 15 4.1 Data Source: the 2002 Natality Public-Use Data ...................................... 15 4.2 Variable of Interests: Gestational Age (DGESTAT) ................................ 15 4.3 Multiple Imputation using Sequential Regressions .................................. 17 4.4 Diagnostic Procedures for Imputations of Gestational Age ..................... 19 Chapter 5: Discussion ................................................................................................ 21 Appendix...????????????????????????????.. 24 Bibliography ............................................................................................................... 45 iii List of Tables Table 1: Summary of mean function, correct, overfitted and incorrect imputation model?.. .?????????????????????????????. 24 Table 2: Summary of mean function, true propensity function, percentage of missing of Y, correct and overfitted propensity model. ........................................................... 25 Table 3: Regression of Y on the propensity score and the missing indicator. ............ 26 Table 4: Correlation matrix among all continuous variables. ..................................... 27 Table 5: Spearman correlation coefficients and p-values. .......................................... 27 Table 6: Categorical variables included in the imputation model. ............................. 28 Table 7: Continuous, count or mixed variables included in the imputation model. ... 31 Table 8: Variables included in the propensity model and percent missing. ............... 32 Table 9: Point estimates (Standard Errors) and p-values of linear regression coefficients for model of gestational age. ................................................................... 33 iv List of Figures Figure 1: Scatter plots of true propensity score versus estimated propensity score. Propensity score is estimated by fitting correct model. .............................................. 34 Figure 2: Scatter plots of true propensity score versus estimated propensity score. Propensity score is estimated by fitting overfitted model. .......................................... 35 Figure 3: Scatter plots of propensity score from correct model versus propensity score from overfitted model. ................................................................................................ 36 Figure 4: Distribution of completed Y versus X1/X2. ................................................. 37 Figure 5: Histograms of observed Y and imputed Y. ................................................. 39 Figure 6: Distribution of completed Y versus propensity score. ................................ 41 Figure 7: Plots of gestational age versus propensity score. ........................................ 44 1 Chapter 1: Introduction Missing data is a ubiquitous problem in the analysis of survey data. Missing data for individual variables can occur due to nonresponse for sensitive or difficult items (e.g. income measures), mistakes in responding to survey questions (e.g. incorrect skips) or nonresponse to complete phases of a multi-phase survey (e.g. refusal of medical examination in NHANES). Two potential problems with the analysis of incomplete data are: (1) loss of information or power due to missing data; and (2) potential bias due to systematic differences between observed data and the unobserved data (Barnard and Meng, 1999). 1.1 Missing Data Mechanisms There are three types of missing data mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). MCAR occurs if the probability of missingness is the same for all units and missing occurs completely at random. In other words, a missing response is independent of both observed and missing values (Rubin 1976; Little and Rubin, 2002). A second condition is MAR, where missingness depends on only the observed characteristics of a participant, but not on the missing values themselves (Rubin 1976; Little and Rubin, 2002). Lastly, MNAR mechanism implies that missingness is related to the unobserved values of the variables with missing data. In such situation, the probability of missingness varies and cannot be characterized by available predictors (Rubin 1976; Little and Rubin, 2002). 2 1.2 Existing Approaches to Missing Data There are several approaches to handling missing data. The most simple and convenient method is complete case analysis, by which only individuals with complete information on all variables are included in the statistical analysis. Available case analysis (pairwise deletion) includes all available data under analysis, instead of removing the entire cases that have missing values on any of the variables. However, the inference may base on different subjects for different estimators. The main drawback of these two methods is that MCAR assumption must be held. Otherwise, they may lead to biased results. Maximum likelihood estimation method obtains the maximum likelihood parameter estimates of interest by maximizing the observed data likelihood given that the missing values are MAR. The disadvantages of this method are that it requires fairly sophisticated computations and they are specific to the model being applied. Imputation procedures are techniques for assigning plausible values to missing data. Imputation techniques range from the simplest mean imputation to multiple imputation (MI), all of which produce a completed data set that can be analyzed using standard complete data software procedures. Furthermore, unlike maximum likelihood estimation that is problem-specific and may require totally different and complicated computational procedures to for different models, using the same imputation approach to handling missing data on public-use datasets provides consistency across different scientific questions (Parker and Schenker,2007). Single imputation methods, including mean imputation, regression imputation, hot-deck imputation, stochastic imputation, can be viewed as precursors of multiple imputation. For single imputation, only one plausible value is imputed for each missing observation. 3 The main disadvantage of this method is that the imputed values are treated as if they were true values, so that it fails to account for the added uncertainty due to the assignment of a plausible, yet not actual, value for each missing value. Therefore, the parameter estimate and variance may be underestimated. 1.3. Multiple Imputation The ideas of multiple imputation for missing data were first proposed by Donald Rubin in 1977 and now a variety of statistical software packages have capabilities to conduct MI. Multiple imputation method meets two requirements to develop accurate parameter estimates and variances: (1) the imputation should be model-based in a way that the distributions of the variables and the relationships among the variables can be captured; (2) the imputation method should account for the uncertainty in the imputed values. The multiple repetitions of imputation procedures enable the estimation of variance that is added due to imputing missing values in the data set. Multiple imputation, an extension of the single imputation method, comprises three steps: (1) each missing value is replaced by a set of K>1 plausible values to generate K complete data sets (Sinharay and Russell, 2001). The critical component of this step is the imputation model selection, which is defined by a set of variables available to the imputation process and the distributional assumptions; (2) each of K complete data sets are then analyzed using standard statistical analyses. The results are K point estimates and their corresponding estimated variances; (3) the results from the K completed data sets are combined to create parameter estimates and standard errors. Estimates of population parameter are computed using an average of the parameter estimates of l=1, ?, K co mpleted data set from step 2 4 ? = 1K ?l Kl=1 , where ?l =estimate of ? from the completed data set l=1, ? , K. The corresponding variance for ? is estimated by a simple combination of the average of the K variance estimates and the variance of the K point estimates. var(? ) =U + K+1K ? B, where U = within-imputation variance= 1K var(?l )Kl=1 , B=between-imputation variance= 1K?1 (?l ? ? )Kl=1 . Rubin (1987) showed that the efficiency of an estimate in MI analysis is approximately (1 + ?K)?1 , where ? is the fraction of missing values and K is the number of multiple repetitions of the imputation process. For example, consider a dataset with 25% missing values, K=5 imputations gives 95.2% efficiency. Virtually all of the desirable efficiency can be achieved by using K=5 to K=10 independent repetitions of the imputation process. The success of MI depends on two required assumptions. First, an important step in generating multiple imputations is to assume an imputation model, which is defined by a set of variables and the distributional assumption. The selection of variables to include in the imputation model directly affects the quality of imputations. A general rule of thumb is to incorporate as many as possible the available data and the possible variables correlated with the analysis variables, and at the same time keep the model building and fitting feasible (Barnard and Meng, 1999). Usually, the set of variables included in the imputation model for an MI analysis is much larger and broader in scope than the variables required for the analytic model. Failure to include one or more variables in the 5 imputation model can yield less accurate imputed values. Except for the selection of variables, in order to generate imputations, one must assume a probability model on the complete data. This multivariate model must preserve the associations among the many variables included in the imputation model. A variety of algorithms like Markov Chain Monte Carlo (MCMC) method, or Sequence Regression Model, can be used to generate the imputations. Second, MI assumes that the missing data are missing at random (MAR), that is, the probability that an observation is missing only depends on the observed values, but not on the missing value (Rubin, 1976). Let Y be a data matrix, Ymis be the missing part of Y and Yobs be the observed part of Y. Suppose M is a missing data indicator matrix of the same dimension of Y, where the elements are zero or one depending on whether the corresponding elements of Y are observed or missing. MAR implies that P M Y = P M Yobs . In principle, it is impossible to test the assumption of MAR without additional data collection, since information that would be used to make such a test is unavailable (Abayomi, Gelman and Levy 2008). Therefore, due to the belief that imputed values are merely the guesses of the unobserved data, which are unknown, few attempts have been made to check the quality of imputed data. We propose a diagnostic method based on the propensity score to check the quality of multiple imputations described in Section 2, and conduct a simulation study to show how this diagnostic method can serve as a reliable method for assessing the problems relating to the imputation model given the assumption of MAR in Section 3. Then we apply this diagnostics method to imputations of gestational age in 2002 US Natality public-use data in Section 4. We conclude the thesis in Section 5. 6 Chapter 2: Diagnostic Method for Multiple Imputation 2.1 Existing Diagnostics for Multiple Imputation Abayomi, Gelman and Levy (2008) developed diagnostics for random imputations, based on two arguments: (a) imputations can be checked by using a standard of reasonability: the differences between observed and missing values and the distribution of the completed data as a whole can be checked to see whether they make sense in the context of the problem being studied; (b) imputations are typically generated by using models that are fitted to observed data, and the fit of these models can be checked. They first checked if there were unusual patterns that might suggest problems with imputation (e.g. the histogram of the completed data of a variable was bimodal because the imputed data markedly differed from the observed data). Next, they compared and flagged the difference between the distributions of observed and imputed data values. Finally, they checked the fit of the observed data to the imputation model that was used to create the imputations (check whether the pattern of residuals versus expected values was random). The non-random pattern of residual plots may flag the problems in terms of the violation of the missingness assumptions, and thus the imputation model. One limitation of their methods is that it only works well if there are dramatic differences between the imputed and observed data. However, differences in distributions do not necessarily suggest a problem with the imputations or a violation of MAR, because such differences might be explained by other variables in the data set. We will illustrate this issue in detail in the simulation study. 7 2.2 Diagnostics Based on the Propensity Score Propensity score is a conditional probability of assignment to a particular group given a vector of observed covariates (Rubin, 1983). The key attribute of the propensity score is that adjustment for the scalar propensity score is sufficient to remove bias due to all observed covariates (Rubin, 1983). Therefore, the basic idea of our diagnostic method is that for multiple imputation assuming MAR, the observed data and the imputed data have the same conditional distribution given on the propensity score. If the two distributions differ, it suggests that the imputation results are questionable. Let (Y, X1, X2, ?, X p) be a vector of variables with Y having missing values and X1, X2, ?, X p fully observed variables. Let m denote a missing indicator with m=1 when Y is missing and m=0 when Y is observed. The propensity score, or the probability of missingness, is denoted as P. logit (P)= logit (P (m=1) | X1, X2, ?, X p)) The missingness of Y only depends on X1, X2, ?, X p. Thus, the missing data mechanism is MAR. We construct a set of imputations by using multiple imputation procedure (PROC MI in SAS 9.1), and then apply the diagnostic method to the imputed data sets. The estimates of the propensity score can be obtained by fitting a logistic regression model of m on X1, X2, ?, X p, yielding the predicted values of m. The first diagnostics is to compare the distributions of the imputed and observed values against the propensity score. We look for differences in the conditional distributions, which suggest the inaccuracy of the imputations because the missingness of Y is not a random sample of the original data given on the same propensity score. 8 We then fit regression models of Y (Yobs+Yimp) as a function of the propensity score (P) and the missing indicator (m). An insignificant association between m and Y implies that the missingness of Y is independent of the values of this variable after adjusting for the propensity score. If the missingness can completely be explained by the propensity score, it indicates the MAR assumption holds true and the imputation model used to generate imputations enables to preserve the associations among all available variables in the data set. The following is the flowchart of our diagnostic method. No Ye s No Yes s Step 1: Create multiple imputation based on the imputation model. Step 2: Estimate the propensity score by fitting logistic regression model of the missing indicator. Step 3: Plot Y (Yobs+Yimp) versus the propensity score. Step 4: Fit a regression model of Y (Yobs+Yimp) on the propensity score and the missing indicator. Low quality of Yimp High quality of Yimp Low quality of Yimp High quality of Yimp Diff between Yobs & Yimp? Significant effect of m? 9 Chapter 3: Simulation Study We illustrate our method by a simulation study. We generate a dataset with 500 subjects. Table 1 and 2 show the models we use to generate the data, to create the missing values, to estimate propensity score, and to impute missing data. The following procedures describe how we conduct the simulation with data set 1 as an example. (1) Generate a dataset with a sample size of 500 and three variables (Y, X1, X2) based on the following model: Y = 1 + X1 + 2 ? X2 + ?, where X1, X2, and ? all follow standard normal distribution with mean 0 and variance 1. In addition, create another ten variables X3, X4,?, X 12 in the data set, all of which follow either a normal or a uniform distribution. (2) Generate missing values of Y from the response propensity model: logit (P (m=1|X1, X2)) = ?0 + ? 1*X1+ ? 2*X2. The different percentage of missing values depends on the arbitrary assignments of parameters ? 0, ? 1, and ? 2. The missingness of Y only depends on the values of X1 and X2, thus, the missing mechanism is MAR. (3) Impute missing values and develop 5 completed data sets by fitting a correct model of Y given X1 and X2 using PROC MI procedure in SAS9.1, denoted as [X1, X2]. (4) Impute missing values and develop 5 new completed data sets by using an overfitted model of Y given X1, X2 ,?, X 12, denoted as [X1, X2, ?, X 12] 10 (5) Impute missing values and develop another 5 completed data sets by fitting an incorrect model of Y only given X1, denoted as [X1]. (6) Estimate two sets of propensity scores by fitting a correct logistic regression model of m given X1 and X2 and by fitting a overfitted regression model of m given X1, X2 ,?, X 12. (7) Plot the true propensity scores versus two sets of estimated propensity scores from Step 6 and plot the estimated propensity scores from the correct model against the ones from the overfitted model. (8) Plot three sets of Ys from Step 3-5 against the estimated propensity scores from the correct model (red: Yobs; blue: Yimp). Compare the distribution of the imputed Ys and observed Ys. In addition, true values of Ys (from step 1) versus the propensity score are plotted as well. (9) Plot the observed and imputed Ys from step 3 and 5 against one of covariates (X1). Then plot the histograms of Y to compare the distributions of Y at two levels of missingness as Abayomi et al. (2008) did. (10) Fit a linear regression model of Y from step 3, 4 and 5 on the estimated propensity score and m. The results from the 5 completed data sets are combined to create parameter estimates and standard errors using PROC MIANALYZE procedure in SAS 9.1. Diagnostic methods are applied not only across percentage of missingness (from low to high), but also across three different mean functions of Y. Repeat step 2 to step 10, except changing the propensity model, correct imputation model, overfitted and incorrect imputation model correspondingly as shown in Table 1 and 2. 11 The simulation study compares the conditional distributions of the observed and imputed Ys given the propensity score when the imputations are created from correct, overfitted or incorrect imputation model. Theoretically the similarities of the patterns between the observed and imputed Ys suggest the high quality of imputations. Furthermore, fitting regression models of Y (created by correct, overfitted or incorrect model) on the propensity score and m will quantitatively check the effectiveness of our diagnostic method. The purpose of using overfitted imputation model to generate imputations is to test the robustness of our diagnostic method. Practically, there is no way to know the correct imputation model, a determinant factor affecting the quality of imputations. Overfitted imputation model can be a common situation in real imputation process, because by following the general rule we usually incorporate as many as possible the variables that might be associated with Y. Estimating the propensity score by fitting the overfitted model is also due to the fact that the correct propensity model is always unknown in practice. Therefore, the scatter plots of propensity scores and a regression model of Y on the propensity score from the overfitted model and m are used to test the reliability of our diagnostic method. The scatter plots of the true propensity score versus the estimated propensity score from the correct propensity model and the overfitted propensity model are shown in Figure 1 and Figure 2 respectively, blue points when Ys are missing and red when Ys are observed. Figure 3 shows the scatter plots of the propensity score from the correct model versus the ones from the overfitted model. In each scatter plot, although there is a larger variation among the points in Figure 2 and 3 because of the noise in the overfitted model, 12 all of the points almost lie on a 45-degree straight line. It suggests that the estimated propensity scores either from the correct or the overfitted model are reliable to reflect the true probability of missingness of Y. The results from this simulation study support our statement that the graphical diagnostics proposed by Abayomi et al. (2008) has its own limitation. Figure 5 shows the histograms of the completed data of Y (from Step 3 and 5) at two levels of missingness. These histograms illustrate that the distributions of Y can be different between observed and imputed values. The distributions of observed and imputed Ys against one of the covariates in the imputation model are plotted in Figure 4. In these scatter plots, some deviations between observed and imputed Ys do exist under MAR. Such differences can come from the effects of other variables on the missingness of Y in the data set. For example, in Figure 4 (when ?y =1+X1+2*X2), due to the effect of X1, the larger values of Y are more likely to be missing than the smaller values of Y. Therefore, such differences between observed and imputed data can not necessarily flag the potential problems relating to the imputation model. Our graphical diagnostics can avoid this problem by adjusting for the propensity score. The conditional distributions of Y given the propensity score can actually remove the overall effects of the covariates in the data set. Figure 6 plots three sets of Ys versus the estimated propensity score. These bivariate scatter plots, including the smooth curves, present the comparisons of conditional distributions of the imputed and the observed Ys given the propensity score. Observed data are shown in red and imputed data in blue. Our diagnostics can be applied to each of 5 completed data sets. In this paper, we only show the plots for a single randomly chosen completed data set. There are obvious differences 13 between the distributions of the observed and imputed Ys conditional on the propensity score, when the missing values are imputed using incorrect imputation model (Figure 6(d)). In contrast, when the correct (Figure 6(a)) or overfitted imputation models (Figure 6(c)) are used to generate MIs, there are only slight deviations between two curves. Additionally, these patterns of distribution are similar to the true distributions when we plot the true Ys against the propensity score (Figure 6(e), blue: m=1; red: m=0). When the estimated propensity scores from the overfitted model are applied in the scatter plot (Figure 6(b)), they show the similar results. Results of fitting linear regression model of completed Y on propensity score and m are shown in Table 3. When the correct or overfitted imputation model is used to impute missing values, from low (16%) to high (64%) percentage of missing values across three mean functions of Y and no matter if the propensity scores are estimated from correct or overfitted models, the effect of the missing indicator is insignificant, while the propensity score has a significant association with Y. However, when the incorrect imputation model is implemented to create imputations, in most of cases in this simulation study, both the propensity score and the missing indicator have significant effects on the values of Y. This simulation study is empirical evidence that our graphical diagnostic approach to checking the imputation model is robust. It can be functioned as indirect method to identify potential problems relating to the imputation model. Obvious deviations between distributions of observed and imputed values conditional on the propensity score do occur when the imputation model that is used to generate imputations fails to preserve the associations of all important variables. Then this model would be flagged because of the marked differences. 14 In addition to the graphic presentation of observed and imputed data, we fit a regression model of the completed Y as a function of the propensity score and the missing indicator. The results suggest that we can assess the goodness of imputations by examining the relationship of m with Y. Given the assumption of MAR, the statistically significant effect of m on Y, after adjusting for the propensity score, indicates a deficiency in the imputation model, which fails to preserve all the associations among the variables with the dependent variable. Thus the model underlying the inaccurate imputations should be suspected. The results from Figure 6 (c) and Table 3 confirm the notion that the inclusion of as many as possible the variables in the imputation model can improve the imputations, even though it might be overfitted. When missing values are imputed by using the overfitted model, there is no significant difference in the conditional distribution given propensity score between observed and imputed Y. Moreover, the insignificant effect of m indicates the sufficiency of imputation model to capture the associations among all variables. This simulation study illustrates where and how our diagnostics can serve as effective method for assessing the imputation model that is used to generate the imputed data given the assumption of MAR. In both steps of diagnostic procedures, the graphical display and statistical analysis based on the propensity score can flag the inaccurate imputation model. 15 Chapter 4: Application to 2002 US Natality Public-Use Data 4.1 Data Source: the 2002 Natality Public-Use Data We apply our method to 2002 US Natality public-use dataset produced by the National Center for Health Statistics (NCHS). The NCHS collects Natality data from Standard Certificate of Live Birth for all living births in the United States every year and releases them to the public. The 1989 version of US Standard Certificate of Live Birth provides a wide variety of information on maternal and infant health characteristics, including information on general items, occurrence, residence, prenatal, child, mother, pregnancy history, father, medical and health data (NCHS, 2002). The 2002 public-use Natality data consists of 4,027,376 live births within the United States to residents and non-residents. Our study sample includes a subset of 2002 US Natality data. We randomly select 40,274 newborns, 19,730 females (48.99%) and 20,544 males (51.01%). 4.2 Variable of Interests: Gestational Age (DGESTAT) The high incidence rate and consequences of preterm births make it necessary to correctly determine the important factors that affect preterm delivery in order to establish guidelines for monitoring and treatment plans for expectant mothers who are most susceptible to preterm labor (Hammad, 2009 ). However, missing data and inaccurate information on gestational age have affected the utility of the US Natality public-used datasets (Parker and Schenker, 2007). The period of gestation is defined as beginning with the first day of the last normal menstrual period (LMP) and ending with the day of the birth (NCHS, 2002). In 2002 Natality file, gestational age information contains four parts: (a) computed using date of 16 birth of child and last normal menses; (b) imputed from LMP date; (c) the clinical estimate; or (d) unknown when there is insufficient data to impute or no valid clinical estimate (NCHS, 2002). The primary measure (Part (a)) used to determine the gestational age of the newborn is the interval between the first day of the mother?s LMP and the date of birth. It is subject to error due to reasons including imperfect maternal recall or misidentification of the LMP because of post conception bleeding, delayed ovulation, or intervening early miscarriage (Martin, et al., 2003). The clinical estimate is used in three situations: (1) if the LMP date is not reported; (2) when the computed gestation is outside the 17-47 code range; (3) normal weight births come with apparently short gestations and very-low-birth weight births reported to be full term. There are 4.6 percent of the births in 2002 Natality data based on the clinical estimate of gestation. The NCHS also publishes the imputed weeks of gestation for records with missing day of LMP when there is a valid month and year. Although LMP-based gestational ages are edited for obvious inconsistence with the infant?s plurality and birth weight, reporting problems for this item persist and may occur more frequently among some subpopulations and among births with shorter gestations (Alxandra & Allen, 1996). Some research is ongoing to address these data deficiencies. In order to avoid dealing with the intricacies of misspecified gestational ages, we set gestational age (DGESTAT in the data set) to missing if computed gestation is different from its clinical estimate by more than 2 weeks or they are replaced with the clinical estimations or the imputed gestational age created by the NCHS. After these alterations, there are 18.69% missing values for DGESTAT among all subjects in the final dataset used in the analysis. 17 4.3 Multiple Imputation using Sequential Regressions 2002 US Natality public-use data set consists of 213 variables, including the recoded ones. They have many types of variables, such as continuous (birth weight, age of mother, etc.), categorical (race of mother, marital status, etc.), count (number of prenatal visits), or mixed variables (number of cigars per day, number of drinks per week). Some of these variables have small percentages of missing values which need to be imputed as well. In addition, there are certain reasonable bounds for specific variables with missing values, which must be incorporated in the imputation process. For example, the imputed values for ?Age of Mother? must be greater than 10 and less than 54 and imputations for ?Age of Father? must be greater than 10. Because of the complex data structure of US Natality public-use file, we choose sequential regression multiple imputation (SRMI) method by using publicly available software (IVEware, available at http://www.isr.umich.edu/src/smp/ive) to handle both missing and implausible gestational ages, as well as missing values in the covariates. The basic strategy of SRMI is to create imputations through a sequence of multiple regressions on a variable by variable basis, varying the type of regression model by the type of variable being imputed. Covariates include all other variables observed or imputed for that individual (Raghunathan, et al., 2001). SRMI imputes the least missing variables before the most missing at each round of the procedure and then continued in a cyclical manner, each time overwriting previously drawn values, building interdependence among imputed values and exploiting the correlational structure among covariates (Raghunathan, et al., 2001). 18 The rule for the selection of variables in the imputation model is to include as many as possible the variables that are possibly correlated with the period of gestation. Therefore, the imputation model in our study includes variables from all 10 categories regarding the newborn and maternal characteristics mentioned above. Some of the variables are summed into one category to be used in the imputation model. These summary variables include: the total number of medical risk factors (MEDRK), the total number of obstetric procedures (OBSTET), the total number of complications of labor or delivery (LABOR), the total number of abnormal conditions of the newborn (NEWBN), and the total number of congenital anomalies (CONGN). Descriptive statistics for all variables included the imputation model are listed in Table 6 (categorical variables) and Table 7 (continuous, count and mixed variables). Three variables (Number of Live Birth, Now Living; Number of Live Birth, Now Dead; Number of Other Termination) and five summary variables are classified as categorical variables, because high percentages of value 0 of these variables can lead to unstable results in the SRMI procedures if they are treated as continuous or count variables. The Pearson correlation for continuous variables and Spearman correlation for five discrete variables (CORR procedure in SAS 9.1) are used to check the possible colinearity. Correlation matrixes are shown in Table 4 and Table 5 respectively. As can be seen is Table 5, two pairs of variables (Number of Live Births, Now Living (NLBNL) vs. Detailed Total Birth Order (DTOTORD), Number of Live Births, Now Living (NLBNL) vs. Detailed Live Birth Order (DLIVORD) are highly correlated (|r|=0.876, 0.994), which imply that there is possible colinearity between two variables. Thus, both DTOTRORD and DLIVORD are excluded from the imputation model. 19 4.4 Diagnostic Procedures for Imputations of Gestational Age We created M=5 SRMIs, repeating the process with 10 iterations (seed=2010). We assume MAR, and estimate the propensity scores by fitting a logistic regression of the missing indicator on all variables in the imputation model without any missing values. The variables used in the propensity model are presented in Table 8. We apply two steps of our diagnostic method to 5 sets of SRMIs as follows: (1) Plot the gestational age (DGESTAT) vs. propensity scores (red: observed values, blue: imputed values). (2) Fit a linear regression model of gestational age (DGESTAT) on the propensity score (P) and the missing indicator (m) for 5 imputed data sets. Then the results from 5 data sets are combined to create one parameter estimates and corresponding standard errors by using PROC MIANALYZE in SAS 9.1 as the methods described earlier. Figure 7 provides a snapshot of the distributions of the observed and imputed gestational age against the propensity score. There is no significant difference between red and blue curves and the patterns of two sets of points are quite similar, although the variation of imputed values is slightly smaller than that of observed values. Results from the regression model are summarized in Table 9. All p-values of m are greater than 0.05, in other words, m has insignificant effect on the values of gestational age after adjusting for the propensity score. It implies that the imputation model we created enables to preserve the associations among all variables with gestational age given MAR and, thus, the missingness of DGESTAT can totally explained by the propensity score. By applying two steps of checking procedures, we can conclude that the 20 imputations under our imputation model can sufficiently reflect the true distribution of gestational age for those newborns with missing values. 21 Chapter 5: Discussion Very little attention has been given to the development of diagnostic techniques for multiple imputation (Abayomi, Gelman & Levy, 2008). The aim of this research is to develop diagnostic method based on the propensity score to identify potential problems with the imputations. We propose two steps of diagnostic method for imputations: (1) comparisons of the distributions of observed and imputed data against propensity scores, which are used to reveal differences between the observed and imputed data; and (2) fitting regression model of completed data on the propensity score and the missing indicator. In addition, we apply our method to the 2002 US public-use Natality data published by the NCHS. In simulation study, when the missing values are imputed by using incorrect imputation model, there are apparent differences between the conditional distributions of the observed and imputed Ys given the propensity score, and a significant association is found between the values of Y and the missing indicator (P<0.05). In contrast, when the correct or overfitted imputation model is used to generate MIs, the distributions of the observed and imputed data conditional on the propensity score are similar, and the values of Y are independent of m (P>0.05). These results suggest we can flag potential problems with the underlying imputation model that is used to create imputations. Additionally, the propensity scores estimated from the correct or overfitted propensity model are proved to be reliable and will not affect the diagnostic results. A recent study conducted by Abayomi, et al. (2008) considered diagnostics for imputations in three steps as described in the introduction. Our simulation study confirms 22 the limitation of their graphical methods. They can work well only for the extreme departures between observed and imputed values. However, as shown in Figure 4 and Figure 5, deviations can be expected under MAR and they do not necessarily indicate problems with the imputation model. Such deviation is due to the effect of other variables in the dataset on the probability of missingness. The key property of our diagnostic method is that the adjustment for the propensity score is sufficient to remove the effects of all other covariates that contribute to the probability of missingness. Therefore, assuming MAR, our graphical display conditioning on the propensity score is more robust than the marginal distribution of the completed data or the conditional distribution given only one variable as described in Abayomi?s research. In application to 2002 US Natality file, the results show the similarities of the conditional distributions given propensity score between observed and imputed gestational age and the insignificant effect of the missing indicator on gestational age. All of the results suggest the high quality of the imputations we create, that is, the missingness of gestational age can be totally explained by the propensity score. In Natality file, gestation age is subject to two problems: missing data and implausible data. Therefore, the imputation for US Natality data is complicated by the uncertainty about which records need to be imputed due to implausible values (Parker and Schenker, 2007). We simplified this issue by setting the records with over two-week difference between computed and clinical estimate as missing data. Attempts have been made by Parker and Schenker (2007) to use multiple imputation technique for imputing missing and implausible gestational age values. Multiple imputation is an appropriate technique to handle missing data, which takes into account both the relationships among the variables 23 and the uncertainty added from the imputation, thus it can yield more valid statistical results relating to gestational age in future analytical studies. We use SRMI, which is an extension of MI in which the missing values of each variable are imputed conditionally on all the other variables in the data set and the types of regression models used depend on the type of variable being imputed. Moreover, it can incorporate restrictions to a relevant subpopulation for some variables and logical bounds for the imputed data. In addition to the imputation techniques, to improve the quality of imputations, our imputation model includes variables from 10 categories with respects to both the newborn and the maternal characteristics. Because of these efforts, our diagnostic method identifies the imputations we create with high quality. The findings in this study contribute to the ongoing search to identify reasonable and reliable diagnostic techniques to check the quality of multiple imputation. An important assumption of these diagnostics is the missing at random. Nevertheless, in this study, the MAR assumption cannot be approved. Another limitation in this study is the nonlinear relationship between the values of the dependent variable and the propensity score. The scatter plots in Figure 6 show the curvilinear, rather than linear, relationship between Y and the propensity score. The future research can extend our method by using smoothing spline to model the relationship between Y and the propensity score. Furthermore, a quantitative test can be employed to numerically compare the conditional distribution of the observed and the imputed data given the propensity score. 24 Appendix Table 1: Summary of mean function of Y, correct, overfitted and incorrect imputation model. Mean Function ?y = Imputation Model, Y= Correct Overfitted Incorrect 1+X1+2*X2 ?0 +?1*X1+?2*X2 [X1,X2] ?0 +?1*X1+?2*X2+ ?+? 12*X12 [X1,X2,? , X 12] ?0 +?1*X1 [X1] 1+2* X1+2*X2+3*X1X2 ?0 +?1*X1+?2*X2+?3* X1X2 [X1,X2, X1X2] ?0 +?1*X1+?2*X2+?+ ?12*X12+?13* X1X2 [X1,X2, ?, X 12,X1X2] ?0 +?1*X1+?2*X2 [X1,X2] 1+2* X1+2*X2+3*X12 ?0 +?1*X1+?2*X2+?3* X12 [X1,X2, X12] ?0 +?1*X1+?2*X2+?+ ?12*X12+?13* X12 [X1,X2, ?, X 12, X12] ?0 +?1*X1+?2*X2 [X1,X2] 25 Table 2: Summary of mean function of Y, true propensity score function, percentage of missing of Y, correct and overfitted propensity model. Mean Function ?y = True Propensity Function logit (P)= M (%) Propensity Model logit (P)=logit (m=1|X1, X2,?, X p) Correct Overfitted 1+X1+2*X2 -2+X1+X2 15.8 ?0 +?1*X1+?2*X2 ?0 +?1*X1+?2*X2+ ?+? 12*X12 -2+3*X1+X2 26.0 X1 48.6 1+X1+X2 64.0 1+2* X1+2*X2+3*X1X2 -2+2* X1+2*X2+3*X1X2 23.6 ?0 +?1*X1+?2*X2+?3* X1X2 ?0 +?1*X1+?2*X2+?+ ? 12*X12+?13* X1X2 1+3* X 1-3*X2-4*X1X2 54.2 1+2* X1+3*X2-4*X1X2 63.6 1+2* X1+2*X2+3*X12 -4+3* X1+X2+X12 16.2 ?0 +?1*X1+?2*X2+?3* X12 ?0 +?1*X1+?2*X2+?+ ? 12*X12+?13* X12 -1+5* X1+1*X2-2*X12 33.8 2+X1+X2-2*X12 59.0 26 Table 3: Regression of Y (Yobs+Yimp) on the propensity score (p-score) and the missing indicator (m): three imputation models are fitted to impute missing data, and two propensity models are used to estimate p-scores. Imputation Model Propensity Model Model 1*: Correct Model 2*: Overfitted Model 3*: Incorrect Model 4*: Overfitted Mean Function M% Variable ? p-value ? p-value ? p-value ? p-value Y=1+X1+2*X2+? 15.8% p-score 11.76 <0.0001 11.71 <0.0001 9.57 <0.0001 10.42 <0.0001 m 0.07 0.7456 0.10 0.6972 -1.33 0.0005 0.13 0.7153 26.0% p-score 4.56 <0.0001 4.52 <0.0001 3.68 <0.0001 4.23 <0.0001 m -0.09 0.7828 0.01 0.9738 -1.01 0.0050 0.07 0.8533 48.6% p-score 5.43 <0.0001 5.32 <0.0001 4.44 0.0007 4.78 <0.0001 m -0.02 0.9480 0.08 0.7777 0.10 0.7643 -0.05 0.8258 64.0% p-score 7.72 <0.0001 7.95 <0.0001 3.59 <0.0001 7.25 <0.0001 m 0.04 0.8304 0.09 0.7344 -0.83 0.0101 -0.04 0.8343 Y=1+2*X1+2*X2+3*X1*X2+? 23.6% p-score 10.31 <0.0001 10.27 <0.0001 5.59 <0.0001 10.01 <0.0001 m 0.15 0.7518 0.04 0.9363 -2.34 0.0044 0.15 0.7520 54.2% p-score -4.35 <0.0001 -4.33 <0.0001 1.20 0.4018 -4.24 <0.0001 m -0.07 0.9005 0.05 0.9401 1.78 0.0279 0.00 0.9993 63.6% p-score 0.73 0.3903 0.74 0.3909 2.53 0.0102 0.81 0.3583 m 0.05 0.9434 0.13 0.8560 1.84 0.0818 -0.06 0.9381 Y=1+2*X1+2*X2+3*X12+? 16.2% p-score 13.54 <0.0001 13.31 <0.0001 2.58 0.0095 13.25 <0.0001 m -0.18 0.8152 -0.29 0.7128 -3.16 0.0001 -0.14 0.8564 33.8% p-score 5.33 <0.0001 5.25 <0.0001 3.92 0.0018 5.13 <0.0001 m 0.08 0.9137 0.11 0.8747 0.09 0.9578 0.06 0.9308 59.0% p-score -5.74 <0.0001 -5.84 <0.0001 -2.89 0.0640 -5.55 <0.0001 m -0.01 0.9844 -0.05 0.9347 3.57 <0.0001 0.05 0.9417 Model 1: Imputations are created by fitting correct imputation model. Propensity score is estimated by fitting correct model Model 2: Imputations are created by fitting overfitted imputation model. Propensity score is estimated by fitting correct model Model 3: Imputations are created by fitting incorrect imputation model. Propensity score is estimated by fitting correct model Model 4: Propensity score is estimated by fitting overfitted propensity model. Imputations are created by fitting correct imputation model. 27 Table 4: Correlation matrix among all continuous variables. BIRWT DMAGE DFAGE FMAPS WTGAIN NPREVIS CIGAR DRINK MEDRK NEWBN LABOR OBSTET CONGN BIRWT 1.000 **0.068 **0.040 **0.278 **0.175 **0.107 **-0.092 -0.007 **-0.101 **-0.163 **-0.074 **0.029 **-0.046 DMAGE 1.000 **0.758 0.003 **-0.072 **0.107 **-0.062 **0.020 **0.052 0.007 **0.013 *0.012 -0.006 DFAGE 1.000 -0.009 **-0.063 **0.062 **-0.042 **0.024 **0.042 0.006 0.009 0.005 0.000 FMAPS 1.000 **0.026 **0.067 -0.009 0.005 **-0.062 **-0.231 **-0.131 -0.010 **-0.104 WTGAIN 1.000 **0.092 **-0.022 0.008 **-0.015 **-0.016 **0.034 **0.036 **-0.018 NPREVIS 1.000 **-0.044 **-0.041 -0.003 **-0.048 **-0.025 **0.032 **-0.023 CIGAR 1.000 **0.090 **0.041 **0.017 **0.016 *0.012 0.004 DRINK 1.000 0.004 0.005 0.004 -0.001 -0.001 MEDRK 1.000 **0.181 **0.176 **0.196 **0.042 NEWBN 1.000 **0.196 **0.084 **0.117 LABOR 1.000 **0.176 **0.039 OBSTET 1.000 **0.026 CONGN 1.000 * P<0.01 ** 0.01|r| under H0: Rho=0 NLBNL NLBND NOTERM DTOTORD DLIVORD NLBNL 1.0000 0.0691 <.0001 0.1451 <.0001 0.8756 <.0001 0.9937 <.0001 NLBND 1.0000 0.0425 <.0001 0.1492 <.0001 0.1691 <.0001 NOTERM 1.0000 0.5474 <.0001 0.1479 <.0001 DTOTORD 1.0000 0.8817 <.0001 DLIVORD 1.0000 28 Table 6: Categorical variables included in the imputation model. Variable Name Definition of Variable Category Definition of Categories Freq Percent RESTATUS Residents status 1 Residents 30106 74.75 2 Intrastate nonresidents 9230 22.92 3 Interstate nonresidents 884 2.19 4 Foreign residents 54 0.13 PLDEL3 Place of delivery 1 In hospital 39909 99.09 2 Not in a hospital 363 0.90 . Unknown or not stated 2 0.00 REGNRES Region of residence 0 Foreign residents 54 0.13 1 Northeast 6786 16.85 2 Midwest 8767 21.77 3 South 14813 36.78 4 West 9853 24.46 CITRSPOP Population size of city of residence 0 >=1,000,000 3500 8.69 1 Place of 500,000 to 1,000,000 1821 4.52 2 Place of 250,000 to 500,000 3085 7.66 3 Place of 100,000 to 250,000 3776 9.38 9 All other areas in the U.S. 28038 69.62 z Foreign residents 54 0.13 METRORES Metropolitan 1 Metropolitan county 33209 82.46 2 Nonmetropolitan county 7011 17.41 z Foreign residents 54 0.13 CNTRSPOP Population size of county of residence 0 >=1,000,000 10243 25.43 1 Place of 500,000 to 1,000,000 7528 18.69 2 Place of 250,000 to 500,000 6113 15.18 3 Place of 100,000 to 250,000 6330 15.72 9 All other areas in the U.S. 10006 24.84 z Foreign residents 54 0.13 MRACE3 Race of Mother 1 White 31861 79.11 2 Races other than White or Black 2554 6.34 3 Black 5859 14.55 MEDUC6 Education of mother 1 0 ? 8 years 2397 5.95 2 9 ? 11 years 6061 15.05 3 12 years 12324 30.60 4 13 ? 15 years 8651 21.48 5 16 years and over 10288 25.55 . Not stated 553 1.37 DMAR Marital status of mother 1 Married 26768 66.46 2 Unmarried 13506 33.54 . Unknown or not stated 0 0.00 MPLBIRR Place of birth of mother 1 Native born 30712 76.26 2 Foreign born 9476 23.53 . Unknown or not stated 86 0.21 29 Table 6 (cont.): Categorical variables included in the imputation model. Variable Name Definition of Variable Category Definition of Categories Freq Percent ADEQUACY Adequacy of care 1 Adequate 29464 73.16 2 Intermediate 7179 17.83 3 Inadequate 2018 5.01 . Unknown 1613 4.01 MPRE5 Month prenatal care began 1 1st trimester 33040 82.04 2 2nd trimester 5017 12.46 3 3rd trimester 1056 2.62 4 No prenatal care 374 0.93 . Unknown or not stated 787 1.95 DFRACE4 Race of father 1 White 28186 69.99 2 Races other White, Black or unknown 2107 5.23 3 Black 4333 10.76 . Unknown or not stated 5648 14.02 CSEX Sex 1 Male 20544 51.01 2 Female 19730 48.99 DPLURAL Plurality 1 Single 38940 96.69 2 Twin 1267 3.15 3 Triplet 62 0.15 4 Quadruplet 5 0.01 5 Quintuplet or higher 0 0.00 DELMETH5 Method of delivery 1 Vaginal (excludes vaginal after previous C-section) 28974 71.94 2 Vaginal birth after previous C- section 589 1.46 3 Primary C-section 6328 15.71 4 Repeat C-section 4130 10.25 . Not stated 253 0.63 NLBNL Number of live birth, now living 0 No live birth, now living 16166 40.14 1 One live birth, now living 13303 33.03 2 Two live births, now living 6574 16.32 3 Three live births, now living 2585 6.42 4 Four live births, now living 894 2.22 5 Five live births, now living 349 0.87 6 Six live births, now living 155 0.38 7 Seven live births, now living 86 0.21 8 Eight live births, now living 31 0.08 9 Nine live births, now living 21 0.05 10 Ten live births, now living 14 0.03 11 Eleven live births, now living 8 0.02 12 Twelve live births, now living 5 0.01 13 Thirteen live births, now living 2 0 . Not stated 81 0.2 30 Table 6 (cont.): Categorical variables included in the imputation model. Variable Name Definition of Variable Category Definition of Categories Freq Percent NLBND Number of live births, now dead 0 No live birth, now dead 39515 98.12 1 One live birth, now dead 549 1.36 2 Two live births, now dead 73 0.18 3 Three live births, now dead 15 0.04 4 Four live births, now dead 7 0.02 5 Five live births, now dead 1 0 6 Six live births, now dead 1 0 9 Nine live births, now dead 3 0.01 . Not stated 110 0.27 NOTERM Number of other termination 0 No other termination 30572 75.91 1 One other termination 6463 16.05 2 Two other terminations 2062 5.12 3 Three other terminations 682 1.69 4 Four other terminations 226 0.56 5 Five other terminations 76 0.19 6 Six other terminations 44 0.11 7 Seven other terminations 12 0.03 8 Eight other terminations 9 0.02 9 Nine other terminations 3 0.01 10 Ten other terminations 3 0.01 . Not stated 121 0.3 MEDRK1 Total number of medical risks 0 No medical risk 27767 68.95 1 One medical risk 9721 24.14 2 Two medical risks 1975 4.9 3 Three medical risks 399 0.99 4 Four medical risks 58 0.14 5 Five medical risks 15 0.04 6 Six medical risks 4 0.01 . Not stated 335 0.83 OBSTET2 Total number of abnormal conditions 0 No abnormal condition 2816 6.99 1 One abnormal condition 8062 20.02 2 Two abnormal conditions 17251 42.83 3 Three abnormal conditions 9686 24.05 4 Four abnormal conditions 2063 5.12 5 Five abnormal conditions 193 0.48 6 Six abnormal conditions 11 0.03 . Not stated 192 0.48 CONGN3 Total number of congenital anomalies 0 No congenital anomaly 39276 97.52 1 One congenital anomaly 341 0.85 2 Two congenital anomalies 38 0.09 3 Three congenital anomalies 6 0.01 4 Four congenital anomalies 3 0.01 . Not stated 610 1.51 31 Table 6 (cont.): Categorical variables included in the imputation model. Variable Name Definition of Variable Category Definition of Categories Freq Percent NEWBN4 Total number of newborn complications 0 No newborn complication 36994 91.86 1 One newborn complication 2545 6.32 2 Two newborn complications 341 0.85 3 Three newborn complications 60 0.15 4 Four newborn complications 8 0.02 . Not stated 326 0.81 LABOR5 Total number of labor complications 0 No labor complication 27159 67.44 1 One labor complication 10118 25.12 2 Two labor complications 2307 5.73 3 Three labor complications 368 0.91 4 Four labor complications 68 0.17 5 Five labor complications 12 0.03 6 Six labor complications 2 0 . Not stated 240 0.6 MEDRISK1: medical risk variables include anemia, cardiac disease, acute or chronic lung disease, etc. OBSTETRIC2: obstetric procedures include amniocentesis, electronic fetal monitor, induction of labor, etc. CONGNTL3: congenital anomalies include anencephalus, spina bifida, hydrocephalus, microcephalus, etc. NEWBORN4: newborn complications include anemia, birth injury, fetal alcohol, etc. LABCOMP5: labor complications include febrile, meconium, premature rupture of membrane, etc. Table 7: Continuous, count or mixed variables included in the imputation model. Variable Name Definition of Variable N N Miss Mean Std Dev Min Max Miss% Continuous variable DBIRWT Birth weight of child (gram) 40240 34 3297.12 604.87 227 6039 0.08% DMAGE Age of mother 40274 0 27.35 6.2 12 52 0.00% DFAGE Age of father 34936 5338 30.49 6.85 14 72 13.25% FMAPS Apgar score 31106 9168 8.91 0.75 0 10 22.76% WTGAIN Weight gain (lb) 32784 7490 30.89 13.8 0 98 18.60% Count variable NPREVIS Total number of prenatal visits 39239 1035 11.55 3.99 0 49 2.57% Mixed variable CIGAR Number of cigars/day 34318 5956 1.08 3.88 0 70 14.79% DRINK Number of drinks/week 34659 5615 0.04 0.71 0 84 13.94% 32 Table 8: Variables included in the propensity model and percent missing. Variable Definition of Variable Missing % Included Categorical Variables RESTATUS Residence status - Y PLDEL3 Place of delivery - Y REGNRES Region of residence - Y CITRSPOP Population size of city of residence - Y CNTRPOP Population size of county of residence - Y METRORES Metropolitan of residence - Y CSEX Sex of child - Y DPLURAL Plurality - Y MARCE3 Race of mother - Y MEDUC6 Education of mother 1.37% N DMAR Marital status of mother - Y MPLBIRR Place of birth of mother 0.21% N ADEQUACY Adequacy of care 4.01% N MPRE5 Month prenatal care began 1.95% N DFRACE Race of father 14.02% N DELMETH5 Method of delivery 0.63% N NLBNL Number of live births, now living 0.20% N NLBND Number of live births, now dead 0.27% N NOTERM Number of other termination 0.30% N MEDRK Total number of medical risks 0.83% N NEWBN Total number of newborn complications 0.81% N LABOR Total number of labor complications 0.60% N OBSTET Total number of abnormal conditions 0.48% N CONGN Total number of congenital anomalies 1.51% N Continuous Variables DBIRWT Birth weight of child 0.08% N DMAGE Age of mother - Y DFAGE Age of father 13.25% N FMAPS Apgar score 22.76% N WTGAIN Weight gain 18.60% N Count Variable NPREVIS Total number of prenatal visits 2.57% N Mixed Variables CIGAR Number of cigars/day 14.79% N DRINK Number of drinks/week 13.94% N 33 Table 9: Point estimates (Standard Errors) and p-values of linear regression coefficients for model of gestational age for complete cases, only variables without any missing values included in the propensity model. Imputation Propensity score Missing indicator ? (SE) p-value ? (SE) p-value 1 -1.94 (0.26) <0.0001 0.02 (0.03) 0.5193 2 -1.87 (0.26) <0.0001 0.01 (0.03) 0.7955 3 -1.95 (0.26) <0.0001 0.01 (0.03) 0.8120 4 - 1.99 (0.26) <0.0001 0.02 (0.03) 0.4176 5 - 1.97 (0.26) <0.0001 0.01 (0.03) 0.7483 Summary - 1.95 (0.26) <0.0001 0.01 (0.03) 0.6619 34 Figure 1: Scatter plots of true propensity score versus estimated propensity score. Propensity score is estimated by fitting correct model. Y=1+X1+2*X2+? m%=15.8% m%=26.0% m%=48.6% m%=64.0% Y=1+2*X1+2*X2+3*X1X2+? m%=23.6% m%=54.2% m%=63.6% Y=1+2*X1+2*X2+3*X12+? m%=16.2% m%=33.8% m%=59.0% m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 35 Figure 2: Scatter plots of true propensity score versus estimated propensity score. Propensity score is estimated by fitting overfitted model. Y=1+X1+2*X2+? m%=15.8% m%=26.0% m%=48.6% m%=64.0% Y=1+2*X1+2*X2+3*X1X2+? m%=23.6% m%=54.2% m%=63.6% Y=1+2*X1+2*X2+3*X12+? m%=16.2% m%=33.8% m%=59.0% m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Est P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 True P-Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 36 Figure 3: Scatter plots of propensity score from correct model versus propensity score from overfitted model. Y=1+X1+2*X2+? m%=15.8% m%=26.0% m%=48.6% m%=64.0% Y=1+2*X1+2*X2+3*X1X2+? m%=23.6% m%=54.2% m%=63.6% Y=1+2*X1+2*X2+3*X12+? m%=16.2% m%=33.8% m%=59.0% P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 P -S c ore(True) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 .7 0 .8 .9 1 .0 P -S c ore(O verfitted) 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1 37 Figure 4: Distribution of completed Y vs. X1. m% Y=1+X1+2*X2+? 15.8% Y vs. X1 26.0% Y vs. X1 48.6% Y vs. X1 64.0% Y vs. X1 (a)* [X1, X2] (b)* [X1] (c)* (a) The observed and imputed Ys are plotted versus X1. Regression lines are produced with I=R operand to show the relationship between Ys and X1. Correct imputation models are fitted to generate the imputations of Y. (b) The observed and imputed Ys are plotted against X1. Incorrect imputation models are fitted to create imputations of Y. (c) The true Ys are plotted against X1 at two levels of the missingness (red: m=0, blue: m=1). m 0 1 y1 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y1 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y1 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y1 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y1 -8 -7 -6 -5 -4 -3 -2 -1 0 1 3 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y1 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y1 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y1 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 x1 -4 -3 -2 -1 0 1 2 3 m 0 1 y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 x1 -4 -3 -2 -1 0 1 2 3 38 Figure 4 (Cont.): Distribution of completed Y vs. X2. m% Y=1+2*X1+2*X2+3*X1X2+? 23.6% Y vs. X2 54.2% Y vs. X2 63.6% Y vs. X2 (a)* [X1, X2, X1X2] (b)* [X1, X2] (c)* Y=1+2*X1+2*X2+3*X12+? 16.2% Y vs. X2 33..8% Y vs. X2 59.0% Y vs. X2 (a)* [X1, X2, X12] (b)* [X1, X2] (c)* (a) The observed and imputed Ys are plotted versus X1. Regression lines are produced with I=R operand to show the relationship between Ys and X1. Correct imputation models are fitted to generate the imputations of Y. (b) The observed and imputed Ys are plotted against X1. Incorrect imputation models are fitted to create imputations of Y. (c) The true Ys are plotted against X1 at two levels of the missingness (red: m=0, blue: m=1). m 0 1 y1 -20 -10 0 10 20 30 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -20 -10 0 10 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -20 -10 0 10 20 30 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -20 -10 0 10 20 30 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y -20 -10 0 10 20 30 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -20 -10 0 10 20 30 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -10 0 10 20 30 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y -20 -10 0 10 20 30 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -10 0 10 20 30 40 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -10 0 10 20 30 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y -10 0 10 20 30 40 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -10 0 10 20 30 40 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 -20 -10 0 10 20 30 40 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y -10 0 10 20 30 40 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 - 1 0 0 10 20 30 40 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y1 - 2 0 - 1 0 0 10 20 30 40 x2 -3 -2 -1 0 1 2 3 4 m 0 1 y - 1 0 0 10 20 30 40 x2 -3 -2 -1 0 1 2 3 4 39 Figure 5: Histograms of observed Y and imputed Y. m% Y=1+X1+2*X2+? 15.8% 26.0% 48.6% 64.0% (a)* [X1, X2] (b)* [X1] (c)* (a) Histogram with Kernel Curve is plotted to show the distribution of Ys (Top: Observed Ys; Bottom: Imputed Ys): correct imputation model is fitted to create imputations. (b) Histogram of Y (Top: Observed Ys; Bottom: Imputed Ys): incorrect imputation model is fitted to create imputations. (c) Histogram of Y at two levels of missingness (Top: m=0, Bottom: m=1). 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -8.4 -6.8 -5.2 -3.6 -2 -0.4 1.2 2.8 4.4 6 7.6 9.2 10.8 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y1 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -8.4 -6.8 -5.2 -3.6 -2 -0.4 1.2 2.8 4.4 6 7.6 9.2 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y1 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -8.4 -6.8 -5.2 -3.6 -2 -0.4 1.2 2.8 4.4 6 7.6 9.2 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -8.4 -6.8 -5.2 -3.6 -2 -0.4 1.2 2.8 4.4 6 7.6 9.2 10.8 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y1 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -11.6 -10 -8.4 -6.8 -5.2 -3.6 -2 -0.4 1.2 2.8 4.4 6 7.6 9.2 10.8 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y1 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -8.4 -6.8 -5.2 -3.6 -2 -0.4 1.2 2.8 4.4 6 7.6 9.2 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 P e r c e n t 0 -9.5 -7.5 -5.5 -3.5 -1.5 0.5 2.5 4.5 6.5 8.5 10.5 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 P e r c e n t 1 y1 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -9.5 -7.5 -5.5 -3.5 -1.5 0.5 2.5 4.5 6.5 8.5 10.5 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y1 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 P e r c e n t 0 -8.5 -6.5 -4.5 -2.5 -0.5 1.5 3.5 5.5 7.5 9.5 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 P e r c e n t 1 y 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -8.8 -7.2 -5.6 -4 -2.4 -0.8 0.8 2.4 4 5.6 7.2 8.8 10.4 12 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y1 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -10.4 -8.8 -7.2 -5.6 -4 -2.4 -0.8 0.8 2.4 4 5.6 7.2 8.8 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y1 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 0 -8.8 -7.2 -5.6 -4 -2.4 -0.8 0.8 2.4 4 5.6 7.2 8.8 0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 P e r c e n t 1 y 40 Figure 5 (Cont.): Histograms of observed Y and imputed Y. m% Y=1+2*X1+2*X2+3*X1X2+? 23.6% 54.2% 63.6% (a)* [X1, X2, X1X2] (b)* [X1, X2] (c)* Y=1+2*X1+2*X2+3*X12+? 16.2% 33.8% 59.0% (a)* [X1, X2, X12] (b)* [X1, X2] (c)* (a) Histogram with Kernel Curve is plotted to show the distribution of Ys (Top: Observed Ys; Bottom: Imputed Ys): correct imputation model is fitted to create imputations. (b) Histogram of Y (Top: Observed Ys; Bottom: Imputed Ys): incorrect imputation model is fitted to create imputations. (c) Histogram of Y at two levels of missingness (Top: m=0, Bottom: m=1). 0 5 10 15 20 25 30 35 P e r c e n t 0 -20.25 -14.25 -8.25 -2.25 3.75 9.75 15.75 21.75 27.75 0 5 10 15 20 25 30 35 P e r c e n t 1 y1 0 5 10 15 20 25 30 35 P e r c e n t 0 -20.25 -17.25 -14.25 -11.25 -8.25 -5.25 -2.25 0.75 3.75 6.75 9.75 0 5 10 15 20 25 30 35 P e r c e n t 1 y1 0 5 10 15 20 25 30 35 P e r c e n t 0 -20.25 -15.75 -11.25 -6.75 -2.25 2.25 6.75 11.25 15.75 20.25 24.75 0 5 10 15 20 25 30 35 P e r c e n t 1 y 0 5 10 15 20 25 30 35 P e r c e n t 0 -22 -18 -14 -10 -6 -2 2 6 10 14 18 22 26 0 5 10 15 20 25 30 35 P e r c e n t 1 y1 0 5 10 15 20 25 30 P e r c e n t 0 -14 -10 -6 -2 2 6 10 14 18 22 26 0 5 10 15 20 25 30 P e r c e n t 1 y1 0 5 10 15 20 25 30 35 P e r c e n t 0 -22 -18 -14 -10 -6 -2 2 6 10 14 18 22 26 0 5 10 15 20 25 30 35 P e r c e n t 1 y 0 20 30 40 50 P e r c e n t 0 -22 -18 -14 -10 -6 -2 2 6 10 14 18 22 26 0 10 20 30 40 50 P e r c e n t 1 y1 0 20 30 40 50 P e r c e n t 0 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 14 16 18 20 22 24 26 0 10 20 30 40 50 P e r c e n t 1 y1 0 10 20 30 40 50 P e r c e n t 0 -22 -18 -14 -10 -6 -2 2 6 10 14 18 22 26 0 10 20 30 40 50 P e r c e n t 1 y 0 5 10 15 20 25 30 P e r c e n t 0 -8 -4 0 4 8 12 16 20 24 28 32 36 40 0 5 10 15 20 25 30 P e r c e n t 1 y1 0 5 10 15 20 25 30 P e r c e n t 0 -14 -10 -6 -2 2 6 10 14 18 22 26 30 0 5 10 15 20 25 30 P e r c e n t 1 y1 0 5 10 15 20 25 30 P e r c e n t 0 -8 -4 0 4 8 12 16 20 24 28 32 36 40 0 5 10 15 20 25 30 P e r c e n t 1 y 0 5 10 15 20 25 30 35 0 -8.75 -3.75 1.25 6.25 11.25 16.25 21.25 26.25 31.25 36.25 0 5 10 15 20 25 30 35 P e r c e n t 1 y1 0 5 10 15 20 25 30 35 r c e n t 0 -16.25 -11.25 -6.25 -1.25 3.75 8.75 13.75 18.75 23.75 28.75 33.75 0 5 10 15 20 25 30 35 P e r c e n t 1 y1 0 5 10 15 20 25 30 35 P e r c e n t 0 -8.75 -3.75 1.25 6.25 11.25 16.25 21.25 26.25 31.25 36.25 0 5 10 15 20 25 30 35 P e r c e n t 1 y 0 5 10 15 20 25 30 35 P e r c e n t 0 -8.75 -3.75 1.25 6.25 11.25 16.25 21.25 26.25 31.25 36.25 0 5 10 15 20 25 30 35 P e r c e n t 1 y1 0 5 10 15 20 25 30 35 P e r c e n t 0 -16.25 -11.25 -6.25 -1.25 3.75 8.75 13.75 18.75 23.75 28.75 33.75 0 5 10 15 20 25 30 35 P e r c e n t 1 y1 0 5 10 15 20 25 30 35 P e r c e n t 0 -8.75 -3.75 1.25 6.25 11.25 16.25 21.25 26.25 31.25 36.25 0 5 10 15 20 25 30 35 P e r c e n t 1 y 41 Figure 6: Distribution of completed Y versus the propensity score. m% Y=1+X1+2*X2+? 15.8% 26.0% 48.6% 64.0% (a)* [X1, X2] (b)* [X1, X2] (c)* [X1,X2,?,X 12] (d)* [X1] (e)* (a) Correct imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp) with smooth curves plotted to indicate a possible nonlinear relationship between Ys and the propensity score. (b) Correct imputation model and overfitted propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (c) Overfitted imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (d) Incorrect imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (e) True Ys are plotted versus the propensity scores from correct model at two levels of the missingness (red: m=0, blue: m=1) with smooth curves plotted. m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 42 Figure 6 (cont.): Distribution of completed Y versus the propensity score. m% Y=1+2*X1+2*X2+3*X1X2+? 23.6% 54.2% 63.6% (a)* [X1, X2, X1X2] (b)* [X1, X2, X1X2] (c)* [X1, X2,?, X 12, X1X2] (d)* [X1, X2] (e)* (a) Correct imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp) with smooth curves plotted to indicate a possible nonlinear relationship between Ys and the propensity score. (b) Correct imputation model and overfitted propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (c) Overfitted imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (d) Incorrect imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (e) True Ys are plotted versus the propensity scores from correct model at two levels of the missingness (red: m=0, blue: m=1) with smooth curves plotted. m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -2 0 -1 0 0 10 20 30 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -20 -10 0 10 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -3 0 -2 0 -1 0 0 10 20 30 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -2 0 -1 0 0 10 20 30 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -20 -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 43 Figure 6 (Cont.): Distribution of completed Y versus the propensity score. m% Y=1+2*X1+2*X2+3*X12+? 16.2% 33.8% 59.0% (a)* [X1, X2, X12] (b)* [X1, X2, X12] (c)* [X1, X2,?,X 12, X12] (d)* [X1, X2] (e)* (a) Correct imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp) with smooth curves plotted to indicate a possible nonlinear relationship between Ys and the propensity score. (b) Correct imputation model and overfitted propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (c) Overfitted imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (d) Incorrect imputation model and correct propensity model: observed and imputed Ys are plotted versus the propensity score (red: Yobs, blue: Yimp). (e) True Ys are plotted versus the propensity scores from correct model at two levels of the missingness (red: m=0, blue: m=1) with smooth curves plotted. m 0 1 Y -10 0 10 20 30 40 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -10 0 10 20 30 40 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -1 0 0 10 20 30 40 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -10 0 10 20 30 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -10 0 10 20 30 40 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -10 0 10 20 30 40 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -10 0 10 20 30 40 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -1 0 0 10 20 30 40 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y -20 -10 0 10 20 30 40 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y -10 0 10 20 30 40 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 m 0 1 Y - 10 0 10 20 30 40 Pr opensi t y Sco r e 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 m 0 1 Y -10 0 10 20 30 40 Propensity Score 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Y -1 0 0 10 20 30 40 P ro p e n s ity S c o re 0 .0 0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 .0 m 0 1m 0 1 Y - 20 - 10 0 10 20 30 40 Pr opensi t y Sco r e 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 m 0 1 Y - 10 0 10 20 30 40 Pr opensi t y Sco r e 0. 0 0. 1 0. 2 0. 3 0. 4 0. 5 0. 6 0. 7 0. 8 0. 9 1. 0 44 Figure 7: Plots of gestational age (observed+imputed) versus propensity score: imputations are created by using Sequential Regression Multiple Imputation method. Only variables without missing value are included in the propensity model. 1 4 2 5 3 G e s t a t i o n a l A g e 20 30 40 50 P r o p e n s i t y S c o r e 0 . 1 0 0 . 1 2 0 . 1 4 0 . 1 6 0 . 1 8 0 . 2 0 0 . 2 2 0 . 2 4 0 . 2 6 0 . 2 8 0 . 3 0 0 . 3 2 0 . 3 4 _ m u l t _ = 1 m 0 1 G e s t a t i o n a l A g e 20 30 40 50 P r o p e n s i t y S c o r e 0 . 1 0 0 . 1 2 0 . 1 4 0 . 1 6 0 . 1 8 0 . 2 0 0 . 2 2 0 . 2 4 0 . 2 6 0 . 2 8 0 . 3 0 0 . 3 2 0 . 3 4 0 . 3 6 _ m u l t _ = 4 m 0 1 G e s t a t i o n a l A g e 20 30 40 50 P r o p e n s i t y S c o r e 0 . 1 0 0 . 1 2 0 . 1 4 0 . 1 6 0 . 1 8 0 . 2 0 0 . 2 2 0 . 2 4 0 . 2 6 0 . 2 8 0 . 3 0 0 . 3 2 _ m u l t _ = 2 m 0 1 G e s t a t i o n a l A g e 20 30 40 50 P r o p e n s i t y S c o r e . 0 0 . 1 2 0 . 1 4 0 . 1 6 0 . 1 8 0 . 2 0 0 . 2 2 0 . 2 4 0 . 2 6 0 . 2 8 0 . 3 0 0 . 3 2 0 . 3 4 0 . 3 6 _ m u l t _ = 4 m 0 1 G e s t a t i o n a l A g e 20 30 40 50 P r o p e n s i t y S c o r e 0 . 1 0 0 . 1 2 0 . 1 4 0 . 1 6 0 . 1 8 0 . 2 0 0 . 2 2 0 . 2 4 0 . 2 6 0 . 2 8 0 . 3 0 0 . 3 2 0 . 3 4 0 . 3 6 _ m u l t _ = 3 m 0 1 45 Bibliography Abayomi, K., Gelman, A., & Levy, M. (2008). Diagnostics for multivariate imputations. Journal of the Royal Statistical Society, 57: 273-291. Alexander, G.R. & Allen, M.C. (1996). Conceptualization, measurement, and use of gestational age. I. Clinical and Public Health Practice. J Perinatal , 16:53?9. Barnard, J. & Meng, X. (1999). Application of multiple imputation in medical studies: From AIDS to NHANES. Statistical Methods in Medical Research, 8: 17-36. Hammad, H.T. (2009). Thesis: Identification of factors that relate to gestational age in term and preterm babies using 2002 National Birth Data. Little, R.J & Rubin, D.B. (1989). The analysis of social science data with missing values. Sociological Methods & Research, 18: 292-326. Little, R.J. & Rubin,D.B. (2002). Statistical Analysis with missing data. 2nd edn. New York: john Wiley & Sons. Little, R.J. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83: 1198-1202. Martin, J.A., Hamilton, B.E., Sutton, P.D., Ventura, S.J., Menacker, F., & Munson, M.L. (2003). Births: Final data for 2002. National vital statistics reports, vol 52 no 10. Hyattsville, Maryland: National Center for Health Statistics. National Center for Health Statistics. (2002). Natality 2002. Hyattsville, MD: National Center for Health Statistics. Parker, J. & Schenker, N. (2007). Multiple imputation for national public-use datasets and its possible application for gestational age in United States Natality files. Paediatric & Perinatal Epidemiology, 21: 97-105. Raghunathan, T.E., Lepkowski, J.M., Hoewyk, J.V., & Solenberger, P. (2001). a multivariate technique for multiple imputing missing values using a sequence of regression models. Suvery Methodology, 27: 85-95. Rosenbaum, P.R. & Rubin, D.B. (1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70: 41-55. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. New York: Wiley. Rubin, D.B. (1976). Inference and Missing Data. Biometrika, 63: 581-592. Sinharay, S., Stern, H.S., & Russell, D. (2001). The use of multiple imputation for the analysis of missing data. Psychological methods, 6: 3317-329.