ABSTRACT Title of Dissertation: STATISTICAL ESTIMATION METHODS IN VOLUNTEER PANEL WEB SURVEYS Sunghee Lee, Ph.D., 2004 Dissertation Directed By: Professor, Richard Valliant Joint Program in Survey Methodology Data collected through Web surveys, in general, do not adopt traditional probability-based sample designs. Therefore, the inferential techniques used for probability samples may not be guaranteed to be correct for Web surveys without adjustment, and estimates from these surveys are likely to be biased. However, research on the statistical aspect of Web surveys is lacking relative to other aspects of Web surveys. Propensity score adjustment (PSA) has been suggested as an alternative for statistically surmounting inherent problems, namely nonrandomized sample selection, in volunteer Web surveys. However, there has been a minimal amount of evidence for its applicability and performance, and the implications are not conclusive. Moreover, PSA does not take into account problems occurring from uncertain coverage of sampling frames in volunteer panel Web surveys. This study attempted to develop alternative statistical estimation methods for volunteer Web surveys and evaluate their effectiveness in adjusting biases arising from nonrandomized selection and unequal coverage in volunteer Web surveys. Specifically, the proposed adjustment used a two-step approach. First, PSA was utilized as a method to correct for nonrandomized sample selection, and secondly calibration adjustment was used for uncertain coverage of the sampling frames. The investigation found that the proposed estimation methods showed a potential for reducing the selection and coverage bias in estimates from volunteer panel Web surveys. The combined two-step adjustment not only reduced bias but also mean square errors to a greater degree than each individual adjustment. While the findings from this study may shed some light on Web survey data utilization, there are additional areas to be considered and explored. First, the proposed adjustment decreased bias but did not completely remove it. The adjusted estimates showed a larger variability than the unadjusted ones. The adjusted estimator was no longer in the linear form, but an appropriate variance estimator has not been developed yet. Finally, naively applying the variance estimator for linear statistics highly overestimated the variance, resulting in understating the efficiency of the survey estimates. STATISTICAL ESTIMATION METHODS IN VOLUTEER PANEL WEB SURVEYS By Sunghee Lee Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2004 Advisory Committee: Research Professor Richard Valliant, Chair Research Professor J. Michael Brick Research Associate Professor Michael P. Couper Professor Partha Lahiri Professor Robert Mislevy Professor Trivellore E. Raghunathan ? Copyright by Sunghee Lee 2004 ii Acknowledgements Collection of data in the Behavioral Risk Factor Surveillance Survey was funded in part through grant no. UR6/CCU517481-03 from the National Center for Health Statistics to the Michigan Center for Excellence in Health Statistics. iii Table of Contents Acknowledgements....................................................................................................... ii Table of Contents.........................................................................................................iii List of Tables ............................................................................................................... vi List of Figures............................................................................................................viii Chapter 1: Introduction........................................................................................... 9 Chapter 2: Web Survey Practice and Its Errors.................................................... 15 2.1 Types of Web Surveys................................................................................ 15 2.2 Cyber Culture and Web Surveys................................................................. 18 2.3 Web Usage by Demographic Characteristics and Web Surveys ................ 22 2.4 Web Survey Errors...................................................................................... 24 2.4.1 Coverage error .................................................................................... 25 2.4.2 Sampling Error.................................................................................... 27 2.4.3 Nonresponse Error .............................................................................. 28 2.4.4 Measurement Error ............................................................................. 30 Chapter 3: Statement of Purpose and Work ......................................................... 34 Chapter 4: Application of Traditional Adjustments for Web Survey Data .......... 38 4.1 Introduction................................................................................................. 38 4.2 Data Source................................................................................................. 41 4.2.1 Web Survey Data ................................................................................ 41 4.2.2 Current Population Survey Data ......................................................... 42 4.2.3 Variables of Interest and Covariates................................................... 43 4.3 Nonresponse Error Adjustment................................................................... 44 4.3.1 Sample-level Ratio-raking Adjustment............................................... 46 4.3.2 Multiple Imputation ............................................................................ 47 4.4 Coverage Error Adjustment ........................................................................ 50 4.5 Discussion................................................................................................... 55 Chapter 5: Propensity Score Adjustment.............................................................. 58 5.1 Introduction................................................................................................. 58 5.2 Treatment Effect in Observational Studies................................................. 60 5.2.1 Theoretical Treatment Effect .............................................................. 60 5.2.2 Inherent Problems of Treatment Effect Estimation in Observational Studies................................................................................................. 62 5.3 Bias Adjustment Using Auxiliary Information........................................... 64 5.3.1 Covariates for Bias Adjustment.......................................................... 64 5.3.2 Balancing Score .................................................................................. 65 5.3.3 Propensity Score ................................................................................. 66 5.3.3.1 Bias Reduction by Propensity Scores ............................................. 66 5.3.3.2 Assumptions in Propensity Score Adjustment................................ 68 5.3.3.3 Modeling Propensity Scores ........................................................... 69 5.3.4 Other Adjustment Methods for Bias Reduction.................................. 71 5.4 Methods for Applying Propensity Score Adjustment................................. 73 5.4.1 Matching by Propensity Scores .......................................................... 74 iv 5.4.2 Subclassification by Propensity Scores .............................................. 77 5.4.3 Covariance/Regression Adjustment by Propensity Scores................. 82 Chapter 6: Alternative Adjustments for Volunteer Panel Web Survey Data ....... 85 6.1 Problems in Volunteer Panel Web Surveys................................................ 85 6.2 Adjustment to the Reference Survey Sample: Propensity Score Adjustment ..................................................................................................................... 89 6.3 Adjustment to the Target Population: Calibration Adjustment .................. 93 6.4 Theory for Propensity Score Adjustment and Calibration Adjustment ...... 99 6.4.1 Stratification Model ............................................................................ 99 6.4.2 Regression Model ............................................................................. 102 Chapter 7: Application of the Alternative Adjustments for Volunteer Panel Web Surveys.............................................................................................. 106 7.1 Introduction............................................................................................... 106 7.2 Case Study 1: Application of Propensity Score Adjustment and Calibration Adjustment to 2002 General Social Survey Data ..................................... 107 7.2.1 Construction of Pseudo-population and Sample Selection for Simulation......................................................................................... 107 7.2.2 Propensity Score Adjustment............................................................ 111 7.2.3 Results Propensity Score Adjustment............................................... 114 7.2.3.1 Performance of Propensity Score Adjustment.............................. 115 7.2.3.1.A Bias and Percent Bias Reduction ........................................... 116 7.2.3.1.B Root Mean Square Deviation and Percent Root Mean Square Deviation Reduction............................................................... 117 7.2.3.1.C Standard Error ........................................................................ 117 7.2.3.2 Effect of Covariates in Propensity Score Models......................... 119 7.2.3.3 Discussion..................................................................................... 123 7.2.4 Calibration Adjustment..................................................................... 124 7.2.5 Results of Calibration Adjustment.................................................... 125 7.2.5.1 Performance of Calibration Adjustment ....................................... 126 7.2.5.1.A Root Mean Square Error and Percent Root Mean Square Error Reduction ............................................................................... 126 7.2.5.1.B Bias and Percent Bias Reduction ........................................... 127 7.2.5.1.C Standard Error and Percent Root Standard Error Reduction.. 128 7.2.5.2 Discussion..................................................................................... 130 7.3 Case Study 2: Application of Propensity Score Adjustment and Calibration Adjustment to 2003 Michigan Behavioral Risk Factor Surveillance Survey Data........................................................................................................... 131 7.3.1 Construction of Pseudo-population and Sample Selection for Simulation......................................................................................... 131 7.3.2 Adjustments ...................................................................................... 134 7.3.2.1 Propensity Score Adjustment........................................................ 134 7.3.2.2 Calibration Adjustment................................................................. 140 7.3.3 Results of Adjustments ..................................................................... 141 7.3.3.1 Comparison of Adjusted Estimates............................................... 141 7.3.3.2 Performance of Adjustments on Error Reduction......................... 143 v 7.3.4 Performance of Different Propensity Score Models and Calibration Models............................................................................................... 151 7.3.5 Variance Estimation.......................................................................... 154 7.3.5.1 Variance Estimation for Propensity Score Adjustment ................ 154 7.3.5.2 Variance Estimation for Calibration Adjustment ......................... 155 7.3.6 Discussion......................................................................................... 159 Chapter 8: Conclusion ........................................................................................ 162 Appendices................................................................................................................ 166 vi List of Tables Table 4.1. Full Sample and Unadjusted Respondent Estimates of Percentages and Means.................................................................................................. 45 Table 4.2. Population and Unadjusted Full Sample Estimate.............................. 51 Table 7.1. Distribution of Age, Gender, Education and Race of GSS Full Sample, GSS Web User and Harris Interactive Survey Respondents ............ 108 Table 7.2. P-values of the Auxiliary Variables in Logit Models Predicting blks y (Warm Feelings towards Blacks) and vote y (Voting Participation in 2000 Presidential Election)............................................................... 112 Table 7.3. Propensity Score Models and Their Covariates by Variable............ 113 Table 7.4. Simulation Mean of Estimate by Different Samples before Adjustment ........................................................................................................... 114 Table 7.5. Reference Sample and Unadjusted and Propensity Score Adjusted Web Sample Estimates for blks y and vote y ........................................ 118 Table 7.6. Comparison of Population Values, Reference Sample Estimates and Web Sample Estimates for blks y and vote y ........................................ 129 Table 7.7. Distribution of Age, Gender, Education and Race of BRFSS Full Sample, BRFSS Web User and Harris Interactive Survey Respondents ........................................................................................................... 132 Table 7.8. List of Covariates Used for Propensity Modeling ............................ 135 Table 7.9. Propensity Score Models and P-values of Covariates for Different Dependent Variables......................................................................... 138 Table 7.10. Population Values, Reference Sample Estimates and Web Sample Estimates for HBP, SMOKE and ACT............................................. 142 Table 7.11.A. Error Properties of Reference Sample and Web Sample Estimates for Proportion of People with High Blood Pressure............................... 145 Table 7.11.B. Error Properties of Reference Sample and Web Sample Estimates for Proportion of People Who Smoked 100 Cigarettes or More ............ 146 Table 7.11.C. Error Properties of Reference Sample and Web Sample Estimates for Proportion of People Who Do Vigorous Physical Activities ........... 146 Table 7.12. Distribution of Weights for All Adjustments over All Simulations . 148 Table 7.13. Least Square Mean of Percent Root Mean Square Error Reduction, Percent Bias Reduction and Percent Standard Error Increase by Propensity Score Adjustment Status, Calibration Adjustment Status and Their Interactions ....................................................................... 150 Table 7.14. Results of Analysis of Variance on Percent Root Mean Square Error Reduction, Percent Bias Reduction and Percent Standard Error Increase by Propensity Score Adjustment Models, Calibration Adjustment Models and Their Interactions....................................... 152 Table 7.15. Least Square Mean of Percent Root Mean Square Error Reduction, Percent Bias Reduction and Percent Standard Error Increase by Propensity Score Adjustment Models and Calibration Adjustment Models .............................................................................................. 153 vii Table 7.16. Estimated Standard Error and Simulation Standard Error of Propensity Score Adjusted Web Sample Estimates............................................ 155 Table 7.17. Coverage Rates of 95% Confidence Interval by Standard Error Estimated with v.ds and v.naive........................................................ 158 viii List of Figures Figure 2.1. Classification of Web Surveys............................................................ 16 Figure 4.1. Protocol of Pre-recruited Probability Panel Web Surveys.................. 39 Figure 4.2. Distributions of Covariates for Full Sample and Unadjusted Respondents ........................................................................................ 46 Figure 4.3. 95% Confidence Intervals of Deviations of Respondent Estimates from Full Sample Estimates......................................................................... 49 Figure 4.4. Distributions of Covariates for CPS and Unadjusted Full Sample..... 52 Figure 4.5. 95% Confidence Intervals of Deviations of Full Sample Estimates from CPS Comparison Estimates........................................................ 54 Figure 6.1. Volunteer Panel Web Survey Protocol ............................................... 86 Figure 6.2. Proposed Adjustment Procedure for Volunteer Panel Web Surveys.. 88 Figure 7.1. Relationship between the Distributions of the Different Web Sample Estimates and the Reference Sample Estimates for blks y (Warm Feelings towards Blacks) .................................................................. 120 Figure 7.2. Relationship between the Distributions of the Different Web Sample Estimates and the Reference Sample Estimates for vote y (Voting Participation)..................................................................................... 121 Figure 7.3. Distributions of the Web Estimates by Different Propensity Score Adjustments ...................................................................................... 122 Figure 7.4. Relationship between Percent Bias Reduction and Percent Standard Error Increase in Unadjusted and Adjusted Web Sample Estimates 130 Figure 7.5. Simulation Means of All Web Sample Estimates and Reference Sample Estimates and Population Values......................................... 144 Figure 7.6. Relationship between Percent Bias Reduction and Percent Standard Error Increase in Adjusted Web Sample Estimates .......................... 149 Figure 7.7. Standard Error of Adjusted Web Sample Estimates by Different Adjustment Method Combinations................................................... 156 Figure 7.8. Relationship between Standard Error and Percent Bias Reduction of Adjusted Web Sample Estimates...................................................... 157 Figure 7.9. Relationship between 95% Confidence Interval Coverage and Percent Bias Reduction of Adjusted Web Sample Estimates ........................ 159 9 Chapter 1: Introduction Survey methodology has a relatively short history as an academic field. It was not until the infamous debacle of the 1936 presidential election polling by the Literary Digest that the needs for scientific data collection were recognized. Since then, the survey methodology field has evolved dynamically along with the cultural and technological changes in the society. Among the evolutions the most notable is the telephone interview (Groves and Kahn, 1979; Dillman, 1998; and Dillman, 2002). When the idea of conducting surveys over telephone was first introduced, researchers were not fully convinced about its utility, because the failed Literary Digest poll used a telephone list and because the prevailing belief was that surveys should involve face-to-face interactions. Since the Health Survey Methods Conference in 1972 where telephone interviewing first received attention as a serious data collection mode (Dillman, 1998), there has been a great effort to build and improve telephone survey methodology (e.g., Groves and Kahn, 1979). Meanwhile, an innovative concept of balancing survey costs and errors to the maximum degree has influenced researchers to design surveys within some fixed amount of budget (e.g., Groves, 1989). A well-defined probability sampling procedure by random digit dialing has also been developed for telephone surveys (e.g., Mitofsky, 1970; Waksberg, 1978; Lepkowski, 1988; Casady and Lepkowski, 1993). Practical considerations and societal changes have also boosted the legitimacy of telephone interviews. For example, increased telephone usage and a lowered household contactability for face-to-face interviews due to an increase in female workforce and a decrease in household size have 10 made surveys by telephone more feasible and cost-effective. Now, telephone surveys are a standard data collection method in most developed countries. The survey research field is experiencing another challenging breakthrough ? Internet surveys. The origin of the Internet dates as early as 1962 when J.C.R. Licklider raised the ?Galactic Network? concept which depicted a set of computers globally interconnected through which everyone could quickly access data and programs from any site (Leiner et al., 2000). This was initiated by the military during the Cold War (Slevin, 2000), which set up the Advanced Research Projects Agency (ARPA) within the US Department of Defense in order to develop technologies for interlinking computer networks and facilitating computer-mediated communication. In 1969, ARPANET, the first packet switching network of four host computers at universities in the southwestern US, was launched and is the origin from which the Internet has grown. The Internet embodies a key underlying technical idea ? open architecture networking (Leiner et al., 2000). Under this networking, the choice of individual network technology is not dictated by one particular network architecture which enables coexistence of multiple independent networks of rather arbitrary design. Widespread development of Local Area Networking (LAN) and personal computers in the 1980?s sped up the usage of the Internet by the public. In 1992, CERN (the European Laboratory for Particle Physics) released the World Wide Web (WWW), graphics-based software. At the similar time, HyperText Markup Language (HTML) was invented at CERN. These two components later led to Web browsers, such as Netscape? and Microsoft Explorer? (Gattiker, 2001). 11 Now, utilization of the Internet is heavily dependent on graphics-based interaction, as more and more sites adopt this technology and graphical browsers are used to access the Internet. According to Leiner et al. (2000), the Internet is a world-wide broadcasting capability, a mechanism for information dissemination, and a medium for collaboration and interaction between individuals and their computers regardless of geographic locations. There are various forms of the Internet ? e-mail, newsgroups (Usenet), Multi-User Domains (MUDs), Internet Relay Chat (IRC), File Transferring Program (FTP), electronic mailing lists (listserv) and WWW (Web, hereafter) are some of the examples. Compared to other applications, the Web is user friendly as it does not require a high level of computing knowledge. The contents on the Web are displayed on browsers that enable an intuitive graphic-based interface between the contents and the web users. Sorting, retrieving, and sharing information based on a web of hyperlinks and hypertext are not complicated. Thanks to hypertext and hyperlinks, Web users may move from one webpage to another without a glitch, while deciding which information they wish to have transferred to their browser and which links they want to skip. Moreover, unlike conventional communication media relying on nonhuman channels, the Web carries information expressed in a multi-media format including text, sound, and still and moving graphics. Due to its prominence, the term ?Web? will be used interchangeably with ?Internet? throughout this study, although it is one device to employ the Internet. The popularity of personal computers and the convenience of the Web have made it the fastest growing communication medium in developed countries. It is not a radical 12 idea any more to have a flower shop deliver a bouquet to parents in another country or to pay bills over the Web. Technology changes; so does the society. ?Our survey methods are more a dependent variable of society than an independent variable,? according to Dillman (2002). The ideal survey methodology is likely to reflect the society and its culture. Just as telephone surveys began to be adopted extensively a few decades ago mirroring the societal and technological trends, the survey methodology field is currently witnessing a widespread growth in the use of Web surveys (Taylor and Terhanian, 2003). All these changes in survey modes occur because survey methods inevitably manifest societal trends. Nevertheless, there are mixed views about Web surveys. While many researchers think that Web surveys have a great potential as an addition to the existing methods and for the measurement improvement (e.g., Taylor, 2000; Couper, 2001a, 2001b; Dillman, 2002), others express pessimistic conjectures towards Web surveys (e.g., Mitofsky, 1999). The negative views seem due to the fact that there does not exist a well-accepted Web survey methodology for selecting probability sample surveys targeting the general population, as Web surveys are new to the field and the rapid increase in their use has far surpassed that of the methodological development. No matter how strongly survey methodologists warn about limitations of Web survey quality, it is unlikely that the field will give up on Web surveys. Thus, it is necessary to acknowledge the importance of Web surveys, instead of neglecting their potentials by regarding them as a cheap and dirty method. It becomes the methodologists? responsibility to devise ways to improve Web survey statistical methods (e.g., sample selection and estimation) and measurement techniques (e.g., questionnaire design and interface usability). 13 Luckily, there have been a number of substantial attempts by social scientists in the design aspect of Web surveys particularly in questionnaire design and usability issues. However, findings in these studies do not cover the full picture of Web survey methodology, as they are limited to improving the quality of data collected from persons who do participate in the surveys. Less attention has been given to statistical inference based on Web surveys. A basic statistical question is whether the data collected from a set of Web survey respondents can be used to make inferences about a desired target population. However, statistical properties of Web survey outcomes deviate from those in traditional surveys. Survey organizations may hope that their Web surveys represent the general population of households or persons. But, it is unrealistic to assume that Web surveys targeting the general population are based on randomization, because the frame coverage is uncertain, which means that drawing a probability sample from the target population is impossible. Moreover, response rates on Web surveys are low. Therefore, it is highly likely that Web surveys inherently carry errors related to coverage, sampling, and nonresponse. There are post-survey statistical approaches to compensate for these errors in traditional surveys, such as face-to-face and telephone surveys. Their performance on Web survey errors is open to discussion, as the underlying mechanism of these errors may be unique for Web surveys. To explore this possibility, this study will focus on the statistical aspect of Web surveys, more specifically post-survey adjustment. It will examine the existing survey adjustment methods and expand the possibilities by proposing and examining propensity score adjustment and calibration methods specifically devised for Web surveys. 14 The remainder of this study is comprised of the following eight chapters. The classification of the current Web survey practice and the structure of Web survey errors related to the cyber culture and Web usage will be introduced in Chapter 2. Chapter 3 will state the purposes of this dissertation and summarize of the work in the subsequent chapters. The extent to which traditional post-survey adjustment methods correct for coverage and nonresponse error will be evaluated in Chapter 4. The core of this study is Chapter 5, 6, and 7 where the propensity score adjustment and calibration will be examined as alternatives to more traditional post-stratification adjustment. Chapter 5 will start by documenting the propensity score adjustment as a bias reduction method in observational studies and will review the literature on propensity score adjustment. Chapter 6 will identify how this method along with calibration adjustment can improve estimation using Web survey data by relating to the characteristics of the Web sample discussed in Chapter 2. It will provide mathematical notation for the propensity score adjustment as well as the calibration adjustment. Chapter 7 will consist of two case studies where proposed adjustment methods are applied to the survey data and will appraise the magnitude of error reduction in simulations. Propensity score model building strategies and variance estimation issues will be also examined. This study will conclude with Chapter 8 with a summary of the implications and limitations of this research and suggestions for future research in order to advance this work. 15 Chapter 2: Web Survey Practice and Its Errors Surveys can be conducted on the Web at any time in any place with many types of colors and multi-media features literally at no cost. The facts that an increasing number of people use the Internet is an ordinary tool of communication, a channel for information, and a place for various daily activities have attracted an enormous amount of attention from survey researchers. The growth of Web survey practice is rapid, considering that the possibility of conducting surveys on the Web was first discussed less than a decade ago. There is an apparent gap between statistical and measurement features of Web survey practice and methodological research. Despite the facts that Web surveys have not been thoroughly studied and survey professionals express suspicions about their quality, the Internet seems to be somewhat overloaded with these dubious data collections. 1 This, however, should not discourage survey methodologists from seeing the Web as a potential data collection tool. Understanding Web surveys from different disciplinary and methodological perspectives should improve the quality of Web-based surveys. 2.1 Types of Web Surveys Web surveys are not the same as Internet surveys, as Internet surveys include both Web and e-mail surveys, whereas Web surveys include only those presented via WWW 1 The existence of websites, which claim that Internet users can make money by taking surveys, could be evidence of this concern (e.g. http://www.surveys4money.com) 16 browsers. Due to limitations with storage and software compatibility, e-mail surveys are less popular than Web surveys; thus, this research mainly focuses on Web surveys. Web surveys can be first classified into three categories as in Figure 1. This classification is based on the availability and the construction method of a sampling frame (Couper, 2001a; Manfreda, 2001; Couper, 2002; Couper and Tourangeau, 2002). When sampling frames are not available, the open invitation type of Web survey is conducted. Examples of this are entertainment polls, like QUICKVOTE on http://www.cnn.com, and unrestricted self-selection surveys. This survey is virtually open to anyone with Web access, and if they want to take the survey, they can respond as many times as they wish. Open invitation Web surveys are not suitable for scientific research, because researchers do not have any control over the participation mechanism. Source: Manfreda (2001); Couper (2001a); Couper (2002) Figure 2.1. Classification of Web Surveys 17 The second type of Web surveys constructs a list of participants during data collection, and this list may be used as a frame. Survey participants are recruited as they are intercepted to designated survey sites or encounter pop-up surveys or banner ads, when they log onto certain websites for other purposes. Depending on the intercept implementation methods, these surveys may accommodate probability sampling. However, their response rates are typically very low (far less than 10%), making this type of Web surveys unsuitable for scientific research. The third category of Web surveys has a sampling frame prior to data collection, which allows individual invitation of sample units. Researchers may have full control over respondents? participation by restricting the survey access. The quality of this Web survey method is considered better than the previous ones. This Web survey is further dichotomized depending on the probabilistic nature of the sample. The first uses nonprobability samples drawn from volunteer panels or commercially available e-mail lists. One example for this type would be the method currently used by Harris Interactive. Panel members in volunteer Web surveys self-select to join the panel, and commercial e-mail lists include Internet users who register for some other services on the Web. Such frames may have duplicate listings and there can be problems in identifying multiple listings on the sampling frame as well as in the sample and, thus, in obtaining the probability of inclusion. The second type of the Web surveys with sample frames constructed prior to data collection uses probability sampling. Under this method, there are currently four different ways to conduct Web surveys: (1) Web surveys using a list of some unique population whose members all have Web access, (2) Web surveys recruiting Internet 18 users via traditional survey modes with probabilistic mechanism, (3) Web surveys providing Web access to a set of recruited panel members who were probabilistically sampled from the general population, and (4) Web survey option in mixed-mode probability sample surveys 2 . The probability of inclusion is obtainable in these Web surveys and may be used in estimation. Strictly speaking, design-based statistical inferences can be drawn only under these last four Web survey methods. 2.2 Cyber Culture and Web Surveys One way of gaining fundamental knowledge about Web surveys is to understand cyber culture. This is because the relationship between survey methods and the cultural phenomena is substantial as discussed in Chapter 1. This section will examine the culture in cyberspace in order to provide integrative views on the Web survey, its respondents and its errors. The Internet is a special medium, for it enables both reciprocal and non-reciprocal communication. On the one hand, the Internet forms some types of solidarity among its users by deconstructing physical and social boundaries (Reid, 1991) and connecting all users who are willing to participate. On the other hand, the concept of ?community? does not appear to exist in the cyber world, because the culture in the cyber community is distinctive from that in the everyday community. 2 Web options in mixed-mode surveys differ by the control method of participation assignment. While some mixed-mode surveys use a random assignment, enabling researchers to know which units are answering on the Web prior to survey recipients? participation, others make respondents choose a preferred mode. 19 Cyber culture tends to have been treated negatively, as it is viewed to bring a destructive effect on both personal identity and social culture (Turkle, 1995). Turkle (1995) argues that ?in the real-time communities of cyberspace, we are dwellers on the threshold between the real and the virtual, unsure of our footing, inventing ourselves as we go along.? Cyber world connives at personal identities being de-centered, dispersed and multiplied. This fluctuating identity may be best portrayed by one term ? anonymity. Anonymity, indeed, is one of the highlights in identity formation on the Internet (Slevin, 2000; Burnett and Marshall, 2003). While scarce in real life, anonymity is omnipresent in cyber space. The idea that the physical or lawful being of users is not always verifiable on the Internet seems to have led people to counterfeit their identities or appear under many different identities. Nonetheless, the reality is that our Web activities leave remnants that can be traced and identified. While anonymity or identity invention is an elusive idea, Internet users misperceive that others are not able to obtain their true identity, unless they reveal it. ?Anonymity continues to operate as the boundary that one traverses as a Web user ? whether as a lurker in chatgroups or as a multiple personality in usegroups and chatgroups (Burnett and Marshall, 2003)?. The possibility of locating one?s true identity in cyberspace does not stop Internet users from enjoying their anonymity. Ironically, this possibility triggers another issue ? threats to the real-life privacy. Internet users are aware that it is easy to obtain personal information with the development of the Internet and that it is possible for some strangers to access and use their identity. Privacy has become a luxury item in the cyber world (Moore, 2002), and this has increased the privacy concerns. 20 The Internet has been found by some authors to cause a negative effect on interpersonal relationships (Kraut et al., 1998; Nie and Erbring, 2000). Internet usage weakens traditional relationships, lessens total social involvement, increases loneliness and depression. These authors argue that the quality of the Internet social relationships is poorer than those of the face-to-face relationships and that the time spent only to create a weak tie in the cyberspace takes away opportunities to form strong face-to-face ties with real human beings. Heavy Internet usage somehow makes its users lose touch with the social environment. In sum, Internet society does not require as much coherence in interpersonal relationships as real society does. The ?fluctuating identity? and ?social incoherence? (Burnett and Marshall, 2003) in cyberspace may affect response behavior in Web surveys in three ways. First, people may perceive a lower degree of social obligation, when they are online. E-mail addresses, the common route to sample and contact survey recipients, may not convey as much importance as needed for survey participation and completion. Moreover, the recipients know that their individual identity is not easy to verify through e-mail addresses. This may provide a safe feeling, when they discard the survey invitations or even when they behave as if they are someone else and forge the responses accordingly. The weak interpersonal ties and less-structured culture in the Internet society add more reasons for lowered social obligation. Social exchange theory, once used to explain how to stimulate survey cooperation in other surveys (e.g., Groves and Couper, 1998; Groves, 1989; Dillman, 2000), may not hold in Web surveys. Second, the heightened privacy concern on the Internet may make online behavior more vigilant, even when there is a slight chance of exposing the true identity. Two 21 survey errors may arise from the respondent behavior caused by the privacy concern. First, when an Internet user receives survey invitation e-mail from some organization that the user is not familiar with, the person is unlikely to pay attention to the invitation. Second, the user may want to provide desirable responses if some well-known organization, which the user believes to have a capability to track him or her down, conducts the survey. In this case, the respondent may want to depict himself or herself in a socially acceptable way. The second error may be completely opposite of Web survey pioneers? prediction that Web surveys, as a type of self-administered data collection, will obtain information free from the self-presentation pressure. Third, Web survey respondents? behavior may be affected by their Web usage behavior. Internet users are used to switching from one task to another by clicking and closing windows or moving to other websites, whenever they encounter something other than what they expect or something that they are not necessarily interested in. There are countless distracting features on the Web, from pop-up ads to instant messengers. This environment itself makes it difficult for Web user to focus their attention on one task. Accordingly, survey recipients may not open the invitation e-mail, if it appears uninteresting. Even when survey recipients open the survey, there is a great chance to depart from the survey at any time, if they find the survey is not as interesting or urgent as they first think. Their return to the survey is not guaranteed. There is likely to be more than one stimulus on the recipients? computer monitor, although survey researchers wish that the survey questionnaire is the only feature. In this case, the level of cognitive capacity consumed solely for the Web survey may be low. Computer viruses may be another factor of the Internet environment. Since they are spread widely via the Internet, 22 one recommendation for computer protection is to delete any suspicious e-mails. Imagine a Web survey fielded unfortunately during a virus epidemic ? why would people keep the invitation e-mail in their mailbox? 2.3 Web Usage by Demographic Characteristics and Web Surveys The demographic characteristics of Web users are another source of understanding Web survey respondents and may reveal information on their behaviors and subsequent survey errors. As in the previous section, we will examine who is on the Web and who Web surveys are likely to attract. Existing Web survey literature seems to take the possibility of conducting useful surveys on the Web for granted. This can be deceptive, because only a selected portion of the general population is privileged to have Internet access. Futurologist Toffler (1970; 1980; 1991), even before the Internet was introduced to the public, predicted that the technological changes would endanger people by leaving them behind in the postindustrial economy, if they do not heed and act on the changes. As predicted in his book Powershift (1991), an unconventional economic power paradigm is emerging ? the power is shifting from the people with more material resources to those with more information. The Internet is a critical medium to acquire bountiful and opportune information in a short time. However, the Internet usage is not evenly distributed with respect to the socio-economic status and demographic characteristics, which leads to an unequal chance to obtain the power predicted by Toffler, especially for less-privileged people. 23 Internet access rates differ considerably among countries, implying that the target population that can be covered by Web surveys will be much different as well. According to the 2003 International Telecommunication Union report (available at http://www.itu.int/ITU-D/ict/statistics/), there are only ten countries where more than half the population uses the Internet. 3 Some countries, like Myanmar, Tajikistan and Democratic Republic of the Congo, less than 10 out of 10,000 people use the Internet. The divergent Internet usage level across countries seems closely related to their economic status and telecommunication infrastructure, which is, in turn, related to education. Let us assume that there is a survey conducted in the U.S. including U.S. territories and outlying areas via Web. Given the facts that Web users may be different from nonusers and that people from each state, for instance, may be disproportionately represented, results from this survey may not be generalized to any degree. Until there are substantial proportions of Internet users around the world, the possibility of conducting Web surveys free from the physical and geographical boundaries may remain as a daydream. In the U.S., there is a broad range of information about Web usage by different demographic groups. There is a great concern about digital divide, the difference between online and offline population. A Nation Online (2002) indicated uneven Internet usage by age, income level, educational attainment, employment status, race/ethnicity, household composition, urbanicity, and health status. Not surprisingly, young people are 3 These countries are: Iceland: 67.5%, Republic of Korea: 60.3%, Sweden: 57.3%, US: 55.1%, New Zealand: 52.6%, Netherlands: 52.2%, Canada: 51.3%, Finland: 50.9%, Singapore: 50.4%, Norway: 50.3%. 24 leading the Internet usage, as 75% of youth between the ages of 5 and 17 years old use the Internet. In addition, the following groups of people are less likely to use the Internet than their respective counterparts: people with lower income, without employment, with lower education, or with disabilities; people living in the central city, or in non-family household or family household without children; or Blacks and Hispanics. Although there is evidence that the gaps in those characteristics between online and offline population are decreasing (US Department of Commerce, 2002), the uneven levels of Web usage with respect to these background characteristics are likely to remain. Moreover, there will remain certain groups of people who are unable to go online for financial, technical, or health reasons. This digital divide may affect the quality of Web surveys. Unless the people on the Internet are the population of interest, Web surveys are likely to include people with higher socioeconomic status and more socially engaged and younger people at disproportionately higher rates than traditional surveys. Depending on the target population of a survey, this can result in unequal coverage, as Internet nonusers may be systematically under-represented. Internet users may also have distinctive survey response behaviors ? for example, higher noncontact or nonresponse rates or lower compliance in completing the survey task. This will also cause different combinations and levels of survey errors than traditional surveys. 2.4 Web Survey Errors The best way to understand Web surveys is a systematic comparison between Web surveys and traditional surveys, such as telephone and face-to-face surveys, with 25 respect to total survey errors (Deming, 1944; Groves, 1989). Following the traditional approach illustrated in Groves (1989), this section will examine all components of the total survey errors: coverage error, sampling error, nonresponse error, and measurement error in Web surveys (also refer to Couper, 2002; Couper and Tourangeau, 2002). 2.4.1 Coverage error Coverage error arises when the survey frame does not cover the population of interest. Although Web surveys can be subject to either undercoverage or overcoverage, the former is the most serious problem in Web surveys. The Internet users in US are estimated by A Nation Online (2002) at 143 million, and about two million additional Americans go online annually. It is likely that the world shall see an increase in the number of Internet users and the continuation of this trend. While these numbers and the growth in the numbers are impressive, Internet users account for 54% of the American population. Consequently, even though the Internet population is large and growing, a huge portion of the general population would be omitted from a Web survey. Although some may claim that their large sample sizes would protect their surveys from systematic exclusion of large segment of the population, this is fallacious as sample sizes are not related to coverage error at all ? coverage error is a function of coverage rate and differences between covered and omitted units. It is true that there are certain populations whose members all have Web access, for example, faculty or students at colleges or universities and employees at government agencies or large corporations. In Web surveys targeting these populations, the frame may achieve full coverage, and their coverage errors may not be serious. Once the Web survey target population departs from these special groups, the coverage properties 26 become jeopardized. A possible solution for this problem may be providing Internet access to the offline population. This idea is currently practiced by Knowledge Networks (Huggins and Eyerman, 2001) ? pre-recruited panel Web survey examined in Section 2.1. In order to construct a controlled panel, first eligible telephone numbers are called via random digit dialing, and eligible people who answer the phone are invited to join a Web survey panel. If the call recipients agree to be panel members, they receive a Web TV 4 , regardless of their Web usage status prior to the recruitment. 5 Overcoverage of Web surveys is related to the possibility of multiple Internet identities which Section 2.2 introduced as an attribute of the cyber culture. In effect, any Internet users encounter many chances to set up multiple e-mail addresses, whether they intend to or not. For instance, a college freshman has an e-mail address which he has used since high school and is using it to communicate with his high school friends and his family. His college automatically assigned him another e-mail address, and he mainly uses it for school-related matters. Imagine his part-time job involves some computing and he sets up his third e-mail address for better work delivery within the company. This student already has three e-mail addresses. It is a matter of time for him to get assigned additional e-mail addresses that he may or may not be aware of. This possibility implies existence of overcoverage in volunteer panel Web surveys and commercially available e- mail list-based Web surveys. There is a potential threat that Web survey volunteers may 4 In principle, this may solve coverage problems, but its operation has shown some limitations: there are areas where the Web TV service is not available. This may be viewed as nonresponse error. However, it is not clear whether people who do not respond to the RDD invitation or who decline to join the panel affect coverage properties systematically. 5 KN is now allowing panel members who already have a computer and an Internet access to use their own system. For these members, KN provides different monetary incentives. 27 join the panel multiple times with different identities in order to increase the odds of receiving incentives. For commercial e-mail lists, it is impossible to distinguish to whom each e-mail address belongs. One approach to identify the duplicate units and adjust for them in these frames is to ask a sample person whether he/she has other email addresses and, if so and if possible, what they are. The selection probability for each person could then be adjusted in the same way that a household selection probability is adjusted in a random digit dialing telephone survey where the household has more than one telephone line. 2.4.2 Sampling Error Sampling error occurs due to the fact that not every unit in the target population is in the survey. The concept is usually considered in the context of probability sampling. In Web survey practice, nonprobability sampling is dominant because of its convenience and inexpensiveness. Researchers should bear in mind that nonprobability sampling can give biased estimates, as in the Literary Digest incident, and requires that strong structural assumptions hold in order for inferences to be valid. There is an effort by Harris Interactive as previously introduced to compensate for the coverage and sampling errors by sophisticated weighting. This technique adopts propensity score adjustment originally proposed by Rubin and Rosenbaum (1983) for causal inferences using observational data. Propensity score adjustment balances out the covariate differences between the treatment and control groups whose assignment mechanism is not random. Harris Interactive collects reference survey data through RDD telephone surveys as if they come from a control group and Web survey data as a treatment group. Through the use of weights, the estimated distribution from the Web 28 survey is adjusted to match that of the reference survey on certain variables that are collected in both. Although Harris Interactive has been advocating the effectiveness of propensity score adjustment, there have not been well-documented technical procedures for this application. Moreover, the amount of evaluation on the adjustment performance is very limited (e.g., Terhanian et al., 2000a; Taylor et al., 2001; Schonlau et al., 2003; Varedian and Forsman, 2003), which leads to inconclusive implications. This method will be elaborated in Chapter 5 and 6 and examined in Chapter 7. 2.4.3 Nonresponse Error Nonresponse error arises when not all survey recipients respond. This error is a multiplicative function of two components: the response rate and the difference between respondents and nonrespondents. One substantial problem of Web survey nonresponse is that response rates are not always measurable. For volunteer panel Web surveys or open- invitation Web surveys, it is impossible to measure the number of potential respondents who are actually exposed to the survey invitation. Web surveys using commercial e-mail lists may potentially allow response rates to be measured, but confront difficulties identifying whether the e-mail addresses are still being used. Thus, the nonresponse rate among eligibles is entangled with the rate of ineligibility on the frame. Web surveys whose response rates are measurable have achieved relatively poor results. Response rates for the intercept or pop-up surveys do not exceed 10%; around 20 to 30% for volunteer panel Web surveys (e.g., Harris Interactive); and around 50% for surveys on panel members who are given Web access (e.g., Knowledge Networks). When the use of Web surveys started to increase, many researchers noted the problems associated with coverage and sampling errors. Interestingly, few were 29 concerned about the nonresponse in Web surveys. Some pioneers were even optimistic about the response rates by arguing that respondents could take surveys on the Internet at their convenience and this gives more chances to respond. In reality, response rates in Web surveys are low relative to other survey modes. After adjusting for the cumulative nature of Web panel recruitment and survey participation, the final response rates may dip far below the nominal response rates noted above. What are the possible causes of Web survey nonresponse? First of all, compared to traditional surveys, it is difficult in a Web survey to provide tangible financial incentives and is impossible to build rapport between the survey conductor and takers. This is because an interviewer who plays a role as a motivator and a mediator is eliminated. It is also related to the laxity of the Internet society ? Web survey recipients may not feel obligated to abide by the survey request. A second source of nonresponse error may be found in limited computer literacy among some groups. While it is true that browsing websites does not require a high level of computer literacy thanks to the adoption of Graphic User Interfaces, there are people, especially older and less educated people, who may still feel uncomfortable with using computers and the Internet. Although the Web survey design quality is most likely to influence the measurement error which will be examined shortly, the lack of computer literacy may not permit them to access or operate Web surveys. When considering the frequency of encountering badly designed Web questionnaires, the cognitive challenges that these people may perceive on top of the burden caused by low computer literacy, may elicit a high level of nonresponse. 30 The level of system accessibility may be another reason. Depending on the popularity and the age of computer platforms and/or Internet browsers, Web questionnaires may appear in various ways. Some survey recipients with an older platform or a less popular browser, for instance, may not even have a chance to view the questionnaire as implemented. Those with slower modems or processors may experience a lengthy delay in questionnaire loading and give up carrying out survey task. These recipients become nonrespondents or partial respondents, not because they avoid surveys, but because their system restricts them from accessing survey instruments. The most critical cause for nonresponse in Web surveys seems related to the cyber culture examined in Section 2.2. The guaranteed anonymity and relaxed social ties add more reasons for respondents to neglect the survey requests. Heightened concerns about the personal privacy may weaken the legitimacy of the survey organizations in the minds of potential respondents, while the authority of survey organizations has been found to have positive effect on the completion of other surveys (Presser et al., 1992; Groves and Couper, 1998). Quick and easy navigation from one location to another or one task to another and distracting features on the Web may produce higher levels of nonresponse and break-offs. 2.4.4 Measurement Error Unlike the previous three types of survey errors, measurement error exists within collected data. Among four survey error components, measurement is the area where Web surveys may have distinctive advantages over other data collection modes. Accordingly, it has been studied more rigorously than other error components. 31 What are the measurement advantages of conducting surveys on the Web? First, interviewers are eliminated, which can be a key source of response error and variance. Ideally, this nullifies interviewer effect on survey statistics and helps to minimize respondents? fear of exposing sensitive answers. This advantage, however, is common to all self-administered surveys. Second, Web surveys with a minimal addition in programming make it feasible to automate and customize the questionnaires: skip patterns, item branching, randomization on question and response-option order, answer range checks, and tailoring of question wording may be built into the questionnaire. Feedback or error messages may be pre- programmed so that the survey instrument could point the respondents in the right direction whenever mistakes occur. Note that the automation and customization are not unique only for Web surveys ? they are attainable in all computer assisted survey modes. The greatest advantage of using the Web is its richness of visual presentation. There is an unlimited range of colors and images one can choose for Web surveys, which would cause a substantial cost increase in other modes. Even multi-media features, such as video clips, which are not always possible to implement in other modes, can be freely employed in Web surveys, if the respondents have the appropriate equipment. These unique characteristics of Web surveys may not only make survey instruments look more appealing but also reduce the cognitive and operational burden of respondents. These advantageous attributes of Web surveys, unfortunately, may turn into disadvantages, because it is easy to overuse or misuse them. If colors, images and multi- media features do not match to the respondents? cognitive map, they may confuse respondents. This is because respondents may try to make inferences from those 32 features, which are not intended by the survey designers. Question wording customization could backfire with sensitive topics, as personalized questions may trigger respondents? privacy concerns. With feedbacks, help menus and instructions, Web surveys attempt to facilitate respondents? question comprehension and minimize questionnaire operation errors. However, it is uncertain whether respondents use these features and whether they find them informative and useful. Absence of interviewers may result in a greater chance of satisficing response behavior, as respondents may sense a lower degree of motivation. Unlike other surveys, Web surveys demand a higher degree of cognitive capability and computer knowledge. In addition to the cognitive processes solely for survey tasks, respondents need to allocate their remaining cognitive capability to manage the questionnaire design components and distracting Web features and to understand the operation of the questionnaire. Unequal technological competence among respondents may cause a problem ? novice and expert Internet users may encounter different burdens, therefore, produce different measurement errors. If a Web survey targets a population of novice Internet users, the measurement error may be detrimental. We have examined types of Web surveys and integrated errors in Web surveys with the cyber culture and webographics. To recapitulate, first, it is important not to lump all types of Web surveys into one. Burnett and Marshall (2003) documented that ?Unifying the Web into a simple medium is fraught with inconsistencies and exceptions to a degree that is unparalleled in past media. Researchers have been more successful at laying claim to the idea of ?television?, where its intrinsic modality was evident.? The same argument made by Burnett and Marshal (2003) seems to hold for Web surveys. 33 There are few variations of telephone surveys one can carryout. The error mechanism for each of these telephone surveys is rather simple and predictable. However, the story changes completely for Web surveys ? there are a number of different Web surveys, at least nine types were identified in this chapter based on the method used for sampling. These surveys are all idiosyncratic with respect to survey errors ? they differ from one another with respect to the most critical error components, the sources of errors, and the absolute and relative magnitude of each error. This may be clear in a comparison between open invitation and pre-recruited Web user Web surveys. While the latter is capable in covering the target population and drawing probability samples, the former is unlikely to achieve these. In addition, there is a dramatic difference in response rates between the two. The properties of measurement error, however, may be comparable. Therefore, it is necessary to understand and evaluate particular Web surveys at one time, not Web surveys as one unity. Second, there is a need for systematic investigation of Web survey errors. Studies of Web survey error to date have made a laundry list of errors and are limited in providing a meaningful foundation of mechanisms for those errors. This chapter described a number of sources of Web survey errors in the cyber culture and digital divide. It may be necessary to incorporate findings from other fields in order to broaden the understanding of the error mechanism in Web surveys. 34 Chapter 3: Statement of Purpose and Work The proposed research is intended to find innovative statistical approaches for adjusting errors caused by unrepresentativeness of Web surveys. Based on the implications in Chapter 2, among various types of Web surveys, this study will focus on one ? volunteer panel Web surveys. The foremost problem is that, unlike in traditional surveys, the samples in this Web survey type are not guaranteed to be randomly selected. Units in those samples are comprised of either probabilistically or nonprobabilistically drawn units from a set of nonrandom volunteers. Because of nonresponse, the responding units generally cannot be considered as a probability sample even from the frame of volunteers. They are likely to systematically differ from the scope of survey target populations, reflecting the unequal ownership of a Web access and the impossibility to place a control on the frame population. The occurrence of nonrandomization in Web surveys inevitably increases biases in survey estimates. Bias reduction becomes crucial to make use of results from these Web surveys. As the biases are difficult to control in the survey preparation phase, some post-survey adjustments may reduce bias more efficiently. There is one approach that has been discussed as a potential method of compensating for the nonrandomness in causal studies ? propensity score adjustment. Harris Interactive first introduced propensity score adjustment for their Web survey data, which are collected from volunteer panels (e.g., Taylor, 2000; Terhanian and Bremer, 2000). Propensity score adjustment uses covariates collected in surveys and provides additional layer of weights in order to produce post-survey weights that ideally remedy selection bias in Web 35 surveys. Harris Interactive claims that the results from their volunteer panel Web surveys are generalizable to the U.S. population, according to their report which can be accessed from http://www.harrisinteractive.com/tech/HI_Methodology_Overview.pdf. Although there have been a few studies examining the application of PSA for volunteer panel Web surveys (e.g., Schonlau et al. 2004, Danielssen, 2002, Varedian and Forsman, 2002, Taylor et al., 2001, Taylor, 2000, Terhanian et al., 2000), more in-depth evaluation is needed for a number of reasons. First, the resemblance between Web surveys and the situations where propensity score adjustment originated needs to be scrutinized, before adopting it for Web survey data. Second, the technical procedure of the propensity score adjustment is not well documented. This makes the adjustment method more a mystery than a well-proved scientific method. The mathematics behind the propensity score adjustment for Web survey data needs to be clearly presented. Third, adjusted Web estimates in those studies have often been compared to estimates from other surveys, typically telephone surveys which were conducted in parallel to the Web surveys. Since both estimates are subject to sampling, coverage, nonresponse, and measurement error, the implication of any observed differences is unclear. Fourth, existing studies have focused only on bias properties of the estimates. The other component of survey errors, variance, has not been examined, although propensity score adjustment is likely to increase variability. Weights, in general, add an extra component to the variability of the estimates and, thus, decrease the precision. Therefore, it is important to examine both aspects of errors in evaluating the performance of the propensity score adjustment. Fifth, some of the existing studies favored Web surveys by comparing the Web polling estimates and the election outcomes. These findings may not 36 be indicative of the quality of Web surveys on other subjects; these conclusions may be flawed, if Web survey respondents are more likely to vote than others. This fact alone may make Web surveys favorable, because, in this case, the likelihood of voting may determine the election outcomes. The last issue is that propensity score adjustment needs to be used in conjunction with another adjustment that compensates for the coverage errors. As we will show in later chapters, coverage adjustments are needed, because the propensity score adjustment can correct imbalances between the Web sample and some reference sample from the target population. It is worthwhile to examine the performance of the propensity score adjustment when interacting with other adjustments. This research attempts to overcome the shortcomings in the existing literature of propensity score adjustment described above. It will examine the validity of modifying propensity score adjustments for studies other than causal inferences, exploit the adjustment as a candidate for improving Web survey data, present the mathematical procedure for its application, and evaluate its performance. The evaluation will be extensive, as it includes several study variables measuring different characteristics, the choice of covariates for building propensity score models, the inclusion of additional adjustments for coverage errors and its interaction with the propensity score adjustment, and the effect of adjustment on three aspects of errors: mean square error, bias and variance. In order to accomplish the stated purposes, this research will carry out the following activities in subsequent chapters: 37 Chapter 4. Review and apply traditional adjustment methods, which are currently used to correct for nonresponse and coverage errors in Web surveys. Evaluate the performance of these adjustments. Chapter 5. Introduce propensity score adjustments, and review the ways it can be applied: pair matching, subclassification, and covariance adjustment. Identify pertinence of employing propensity score adjustment for correcting estimates from Web survey data. Chapter 6. Present the mathematical procedure for deriving weights using propensity score adjustment for the lack of randomness in the Web survey data. Introduce calibration as an additional adjustment method for compensating for coverage problems in Web survey data. Chapter 7. Apply the identified propensity score adjustment method and calibration adjustment in two case studies. Simulation using the 2002 General Social Survey and 2002 Behavioral Risk Factor Surveillance Survey will be used for the application. The effectiveness of different types of adjustments will be discussed in relation to all error components. Chapter 8. Conclude the research with its implications and limitations. Suggest directions that future research may take to address the limitations in this research. 38 Chapter 4: Application of Traditional Adjustments for Web Survey Data 4.1 Introduction Possible sources of errors in Web surveys are examined in Chapter 2. The good news is that it may be possible to control those errors, especially nonresponse and coverage errors, using traditional post-survey statistical adjustments. This is feasible because Web survey companies create a panel pool whose members provide a range of background information before taking actual surveys. How effectively this can be done depends on the population to which inferences are to be made. Pre-recruited probability panel Web surveys invented by Knowledge Networks (KN) described in Huggins and Eyerman (2001) use one of the distinctive survey protocols (See Figure 4.1 for the illustration). KN recruits a controlled panel via random digit dialing (RDD) and equips the entire panel with a Web accessing medium regardless of their prior Web usage status. At the first Web survey, the panel members take a profile survey collecting a range of background information. Therefore, it is the idea that for any given subsequent survey, the profile data are available for both respondents and nonrespondents that participate in the initial panel. In addition, reliable population estimates for many of the profile characteristics may be obtained from large-scale government surveys. The abundance of covariates may shed light on how different weighting approaches to Web surveys could improve data quality. Ideally, the recruited Web panel described above represents the population of households or persons that have telephones as the panel members have a known 39 probability of selection into the panel and the samples drawn from the panel also have a known probability. This protocol may diminish unequal coverage and nonprobabilistic sampling problems, which are inherent to other Web surveys. It may be viewed as the most scientific method among Web surveys. However, there are significant complications. Partly shown in Figure 4.1 and partly discussed above, potential respondents go through roughly four stages before any survey that they participate: initial RDD panel recruitment, Web device installation, profile survey completion, and post profile panel retention. All these stages as well as actual survey participation are susceptible to some type of loss in the potential respondent pool. The coverage and nonresponse errors are intertwined in this protocol. Popula- tion RDD Sample RDD Respo- ndent Panel Active Panel Survey Sample RDD Invitation Survey Respo- ndent Recruitment Attrition Sampling Response Figure 4.1. Protocol of Pre-recruited Probability Panel Web Surveys Traditional post-survey adjustments, such as post-stratification, are used as a one- shot remedy for both errors in practice. The application of these adjustments implicitly assumes that the error mechanism is ignorable in the sense of Little and Rubin (1987). Since the Web survey in this chapter employs a multi-step protocol not found in other surveys, it may not be reasonable to assume ignorability. Therefore, traditional 40 adjustments may not be effective enough to compensate for coverage and nonresponse errors in Web surveys of this type. Moreover, the fact that these two errors are corrected simultaneously makes the respective error evaluation especially difficult to disentangle. One study (Vehovar and Manfreda, 1999) examined the effect of post-stratification for a Web survey, but its findings are somewhat limited. The sample was considered self- selected due to ambiguity of the eligibility of the units in the frame. The standard of comparison came from a telephone survey, which may not be a reliable source for adjustment as it is also subject to coverage and nonresponse errors. This chapter attempts to evaluate the magnitude of nonresponse and coverage errors in a particular type of Web survey which aims to form and maintain a panel of respondents obtained through probability-based samples. There are statistics known for the Web survey respondents, the Web survey full sample, and the target population. This enables one to carry out a separate examination of the two errors. Section 4.2 will provide a detailed description about the data sources and the variables used in the analysis. Nonresponse properties will be evaluated in Section 4.3. The full sample which includes both respondents and nonrespondents will be assumed to provide the true values. Two adjustment approaches, ratio-raking and multiple imputation, will be applied. Unadjusted and two types of adjusted respondent estimates will be compared to the true values. Section 4.4 will examine the coverage error. Population estimates from a large government survey will be assumed to be true. Ratio-raking will be used to compensate for coverage error. The deviation of unadjusted and adjusted full sample estimates from the true values will be examined. The last section will summarize findings and raise considerations for future research. 41 4.2 Data Source The analysis involves a two-stage adjustment and requires three types of data sets, one for the respondents, one for the full sample, and one for the population. The first two data sets will come from a Web survey and the last from the Current Population Survey (CPS). 4.2.1 Web Survey Data The Web survey data come from the 2002 Survey Practicum class at the Joint Program in Survey Methodology (JPSM). Data collection was funded jointly by the Bureau of Labor Statistics (BLS) and JPSM for the practicum class at JPSM. The data were collected through a Web panel survey conducted by KN from August 23, 2002 to November 4, 2002. KN employs the special protocol introduced in Section I for its Web surveys. Note that the profile data are available for both Web survey respondents and its nonrespondents, as the KN web surveys are conducted solely among the panel members. KN drew a sample of 2,501 households containing at least one parental figure with at least one child between the ages of 14 and 19 from its enrolled panel. Because later comparisons will be made between the Web survey and the CPS data, households with 18 and 19 year olds are dropped from the analysis to make the two stages of error compensation comparable. 6 This decreases the full sample to 1,700. Among the sampled units, 978 households completed the Web survey. The response rate to the Web survey was 57.4%. In order to qualify as a responding household, both parental figure and teen 6 The closest possible teen age category identifiable in the CPS was 14 to 17 42 were expected to complete the survey. This might have played a negative role in the response rate. After incorporating nonresponse from the four pre-survey stages examined previously as well as two additional layers particular for this Web survey due to teen?s involvement in the survey, the cumulative response rate became 5.5%. This final response rate is calculated with the nominal response rate within the survey (57.4%) in conjunction with other stages in the overall survey operation: panel recruitment rate (36%), Web TV connectability rate (67%), profile completion rate (98%), post-profile survey retention rate (47%), and parent?s consent rate for teen?s participation (86%). Two data sets are created by combining the Web survey data and the profile data. The respondent data ( 978)n = are constructed by applying the response status in the Web survey to the profile data. The KN full sample data ( 1,700)n = are the entire profile data for the eligible sample units. The existence of profile data allows one to examine differences between survey responders and nonresponders and to examine various kinds of survey adjustments. The teen profiles are subject to a large amount of item missing data because parental consent was required for the profile survey. Thus, the target population for this analysis focuses only on parents living with at least one teen member between 14 and 17 in the same household. 4.2.2 Current Population Survey Data The population estimates come from the 2001 September Current Population Survey (CPS). 7 This particular wave of CPS contained the Computer and Internet Use 7 When considering the temporal equivalency, the 2002 September CPS seems more appealing, since the Web survey was conducted around that time. Nevertheless, this paper will use the data from 2001, as the 2002 data do not include computer and Internet 43 Supplement which collected information about Internet and computer usage of the eligible members of the sampled households (for methodological documents about this CPS supplement, refer to http://www.bls.census.gov/cps/computer/2001/smethdocz.htm). When restricting the 2001 September CPS sample to the scope of the target population defined above, the eligible sample size decreases from 143,300 to 11,290. The CPS target population and its samples do include persons living in households that do not have telephones, whereas this type of Web survey starts from the telephone population. This is a source of noncomparibility between the coverage of our data set and the CPS, despite that only 3.5% of persons in the U.S. fall under nontelephone category. 8 However, Web survey organizations often claim that their surveys represent the full population including telephone as well as nontelephone. To evaluate this claim, we have used estimates based on the full CPS for comparison. 4.2.3 Variables of Interest and Covariates All variables used in the analysis are available from both data sources. There are four dependent variables whose means will be estimated: number of owned computers in the household (none, one or more); prior Web usage experience (no, yes); employment status (unemployed, employed); and household size (number of household members), denoted as 1 y , 2 y , 3 y and 4 y . Estimates based on these variables will be adjusted with respect to the following covariates: age level (20-40, 41-45, 46-50, 51 or older); education level (less than high school, high school, some college, college or above); ethnicity (White Non-Hispanics, Black Non-Hispanics, other Non-Hispanics, Hispanics); usage and the distributions of covariates described in the following section are very close between the 2001 and the 2002 September CPS. 8 The estimate is based on the 2001 CPS data. 44 region (Northeast, Midwest, South, West); and gender (male, female), denoted as 15 ,...,x x in ratio-raking adjustment or 19 ,...,x x in multiple imputation. 9 These covariates are selected as they are currently used in KN?s existing ratio-raking procedure. 10 The covariates will serve another function: all categories in all covariates will be the units of subgroup estimation. The reasons for estimating at the subgroup level are two-fold. First, studies make comparisons between Web surveys and traditional surveys typically at the total population level. Post-survey adjustments may correct the errors in the total population estimates, but not necessarily in the subgroup estimates. The second reason reflects the more realistic analytical interests ? analyses are often done at the subgroup level to obtain more insightful conclusions than simply at the population level. For these reasons, this chapter expands the scope of estimation to the subgroup level. 4.3 Nonresponse Error Adjustment Nonresponse error examined in this section focuses on the nonresponse on this particular Web survey among the full sample units (not the cumulative nonresponse for the entire panel). In this section, the full sample will be treated as a simple random sample of the target population and the weights will not be included in deriving estimates 9 In multiple imputation, 1 x , 2 x , and 9 x are assigned to age, education, and gender, as the first two are considered as continuous and the last dichotomous. Ethnicity and region are polytomous variables with 4 (=k) categories, which require 3 (=k-1) binary response variables. Thus, 345 ,,x xx are assigned to ethnicity and 678 ,,x xx to region. 10 KN?s original adjustment includes one additional covariate, household income. However, there are many missing cases for the household income in the CPS. This item will be excluded from the analysis. 45 of means. The sample-level response rate, 57.5%, indicates the potential for the presence of nonresponse errors. Table 4.1. Full Sample and Unadjusted Respondent Estimates of Percentages and Means Full Sample Unadjusted Respondents Estimate SE Estimate SE Deviation a. Computer Ownership (%) 79.6 0.98 81.4 1.25 1.8* Prior Web Experience (%) 72.0 1.09 71.2 1.45 -0.8 Unemployment (%) 3.9 0.47 4.1 0.63 0.2 Household Size 4.2 0.03 4.1 0.04 -0.1** *p<.05 **p<.01 ***p<.001 a. ? Unadjusted Respondent Full Sample Deviation y y=? Table 4.1 compares the distribution of total population level estimates for the unadjusted respondents to those of the full sample and includes the initial deviations, ? Unadjusted Respondent Full Sample yy? . Contrary to the initial speculation, the deviations of unadjusted statistics are surprisingly small. Since the estimates for the full sample and the respondents are not independent, variances of the deviations are calculated as follows: ( ) ( ) ( ) ?? F Full Sample R Unadjusted Respondent N Unadjusted Nonrespondent rnr yy y nn ? =+ , and, therefore, () () () () 2 ????? var var var var RF RN R N nr nr yy yy y y nn ?? ???? ? ? ?= ? = + ?? ?? ? ? ???? , (4.1) where there are n units in the full sample and r respondents and () ?? cov , 0 RN yy = is assumed. This is possible because information on nonrespondents is available from the profile data set. The deviations for computer ownership and household size, although statistically significant, do not appear meaningful. 46 0 20 40 60 80 20 ~ 4 0 41 ~ 4 5 46 ~ 5 0 51+ -H S HS So m e Co l l e g e C o lle g e + Wh it e Bl a c k Ot h e r Hi s p a n i c NE MID SO WE Ma l e Fe m a l e Age Education Ethnicity* Region Gender Full Sample Repondents (? ?=5.685) (? ?=2.466) (? ?=9.369) (? ?=1.689) (? ?=.348) *p<.05 Figure 4.2. Distributions of Covariates for Full Sample and Unadjusted Respondents The distributions of the five covariates are shown in Figure 4.2. The two comparison groups are fairly identically distributed. Based on the Chi-square test for equality of distributions, only ethnicity is differently distributed. There are more Whites but fewer Blacks and Hispanics in the respondents than in the full sample, but these gaps are not large. Almost perfect comparability of the unadjusted estimates examined in Table 4.1 and Figure 4.2 may suggest that the respondents represent the full sample, i.e., the nonresponse occurs completely at random. One important implication from the identical covariate distributions is that the statistical adjustments using these covariates will not correct for any biases that may exist in variables that are not examined in this chapter, because the benchmark distributions are the same as the initial ones. 4.3.1 Sample-level Ratio-raking Adjustment Ratio-raking adjustment is a popular modification of post-stratification which follows the iterative steps described in Deming and Stephan (1940). Unlike cell 47 weighting, ratio-raking controls the marginal distributions of covariates. This decreases difficulties that arise with unknown benchmarks or zero observation in cross-classified cells. The marginal counts of the five covariates from the full sample are used as benchmarks. For this study, ratio-raking was performed using WesVar? 4.0 (Westat, 2000). Post-survey weights that adjust for sample-level nonresponse are generated and used in the estimation. 4.3.2 Multiple Imputation Multiple imputation was first suggested by Rubin (1978) for item nonresponse. Although this chapter does not examine item nonresponse, unit nonresponse in this Web survey may be regarded as item nonresponse in some sense ? there is enough background information for survey respondents and nonrespondents. Multiple imputation incorporates the frequentist concept of estimate variability evaluation into a Bayesian imputation approach. Values for the missing observations are imputed by specifying an explicit model that produces posterior predictive distributions of the missing data, conditional on the distribution of the observed data. The models for the three dichotomous variables, 1 y , 2 y , 3 y are specified in the following way: ~() ii y Bernoulli ? , 9 1 logit( ) ii ijji j x? ??? = = ++ ? , where ,~ (0,1) iij Normal? ? and i ? ?s are random errors with a mean of zero for 1,...,9j = and 1, 2, 3i = . Note that the same covariates are adopted here as in the ratio- raking procedure above. Since the i y ?s are categorical, they are modeled as having 48 Bernoulli distributions determined by the parameters, i ? . The i ? ?s are predicted by the covariates known for both respondents and nonrespondents. The model parameters, i ? ?s and ij ? ?s, have normal prior distributions ? with mean 0 and variance 1. Similarly, the continuous variable, 4 y , is modeled as follows; 44 ~(,)yNormal? ? , 9 44 4 4 1 jj j x? ??? = = ++ ? , where 4 ? is the prior of 4 y predicted in a linear function of the same series of covariates, using prior information, ~(0.5, 1)Gamma? , 44 ,~ (0,1) j Normal? ? for 1,...,9j = , and 4 ? a random error. Note that the model fit and modification are not considered here, because the purpose of this chapter is to compare sample-level ratio-raking adjustment and multiple imputation, given the same auxiliary information. Winbugs 1.4 (Spiegelhalter, et al., 1999) is used for the multiple imputation. The prior distributions of the model parameters are updated by the profile data. Missing values are predicted by the updated values of model parameters. Each missing value for each nonrespondent is imputed using five different initial values, which result in five different predicted values. Each model stated above is run in 10,000 iterations using the Markov Chain Monte Carlo method (details in Gelman, et al., 1995, Ch.1). In order to use samples that produce convergent statistics among different initial values, the first 2,999 iterations were regarded as burn-in. For each chain, imputed values for nonrespondents are combined with observed values from respondents. The estimation and inference follows the procedure in Rubin (1987, Ch.3). 49 -10 0 10 20 URMURMURMURMURMURMURMURMURMURMURMURMURMURMURMURMURMURMURM ALL 20~40 41~45 46~50 51+ -HS HS Some College College +WhiteBlackOther Hispa- nic NE MID SO WE Male Female OVER- AGE EDUCATION ETHNICITY REGION GENDER -10 0 10 20 -10 0 10 -1 0 1 C o mputer O w ne r s hip ( % ) Pri o r W e b Exp e r i e n ce ( % ) Une m ploy m e nt R a te (% ) H ouse hold S i ze Point Estimate U: Unadjusted Respondents R: Ratio-raking Adjusted Respondents, M: Multiply Imputed Respondents Figure 4.3. 95% Confidence Intervals of Deviations of Respondent Estimates from Full Sample Estimates Figure 4.3 displays the 95% confidence intervals for the deviation of unadjusted (U), ratio-raking adjusted (R), and multiply imputed (M) estimates from the true values. Estimation for standard errors follows expression (4.1). More specifically, ( ) ? var N y and 50 () ? var R y for the unadjusted ? R y are calculated based on the variance formula for simple random samples. For the ratio-raking adjusted ? R y , ( ) ? var R y are obtained from WesVar? 4.0. Variance estimation for the multiple-imputation adjusted ? R y uses procedure described in Rubin (1987). If the intervals contain zero, the deviations are not statistically significant, leading to the conclusion that the nonresponse error is negligible. Figure 4.3 shows that most deviations are not significant both at the total population and the subgroup level. The deviation in household size appears to be statistically significant but not so much meaningful. When examined by subgroup, estimates for different racial/ethnic groups are likely to diverge the most from the true values. It is interesting to note that U, R, and M estimates are not very different from each other, especially given a sample nonresponse rate of 42.5%. In terms of deviation and variance, performance of ratio-raking and that of multiple imputation are almost equivalent. Recall that the preliminary analysis showed that the unadjusted estimates for all variables match the full sample values well. Nonresponse adjustments on these variables might have been unnecessary after all. 4.4 Coverage Error Adjustment Coverage error in this analysis is not due solely to problems with the frame coverage per se. It also includes the combined response status from the four pre-survey stages. Unlike traditional surveys where full samples represent the target populations through sampling frames, this Web survey may not have a reliable sampling frame, 51 because there are multiple chances to systematically lose potentially eligible people. In other words, the frames built only on the active panel members may be biased to begin with. The population values used for comparison are calculated by applying the final weights provided in the CPS public use data and will be assumed as true values. The full sample Web survey estimates are calculated by applying the base design weights to the 1,700 cases in the full sample dataset. Since the sample design variables are not provided in the CPS public use data and the CPS data analyzed for this study are truncated, direct calculation of the standard error for the CPS estimates is impossible. Instead, the following ad-hoc formula is used for calculating the standard error of the biases: ()var()var() var( ) var( ) 1 ( ), CPS Sample CPS Sample Sample Sample Sample se y y y y ky y ksey ?= + =+ =+? (4.2) where () Sample se y is the standard error of the full sample estimate and k is some constant based on the ratio of the Web survey sample size to the CPS size. It should be noted that (4.2) is a crude approach to derive variance estimates because it assumes that the variability of an estimate is a function of the sample sizes. Table 4.2. Population and Unadjusted Full Sample Estimate CPS Unadjusted Full Sample a. Estimate SE Estimate SE Deviation b. Computer Ownership (%) 80.85 0.57 77.45 1.29 -3.40** Prior Web Experience (%) 65.81 0.62 70.91 1.39 5.10*** Unemployment (%) 2.59 0.19 4.11 0.59 1.52** Household Size 4.34 0.02 4.19 0.04 -0.15*** a. Design weighted full sample. *p<.05 **p<.01 ***p<.001 b. ? Unadjsuted Full Sample CPS Deviation y y=? 52 Unlike the previous section, the comparison between the true values and the unadjusted full sample estimates suggests potential coverage problems as shown in Table 4.2. The weighted full sample estimates, when not adjusted, significantly stray from the population. This is more obvious for the computer ownership and prior Web experience. People in the frame are less likely to own computers but more likely to have Web experience. Moreover, remarkable inconsistencies in covariates can be found in Figure 4.4, especially for education and ethnicity. It becomes imperative to remedy these discrepancies. (? ?=20.8) (? ?=504.5) (? ?=114.4) (? ?=28.0) (? ?=1.62) (%) 0 20 40 60 80 20~ 4 0 41~ 4 5 46~ 5 0 51 + -H S HS So m e C o l l ege C o l l ege+ Wh i t e Bla c k Ot h e r Hi s p a n i c NE MI D SO WE Mal e Fe m a l e Age*** Education*** Ethnicity*** Region*** Gender CPS Full Sample *p<.05 **p<.01 ***p<.001 Figure 4.4. Distributions of Covariates for CPS and Unadjusted Full Sample The coverage properties are examined by replicating the same ratio-raking procedure used in the previous section at the population level. The final adjustment weights are computed by ratio-raking the Web survey base weights to covariate marginal 53 counts from the CPS. The base weights are provided by KN. Both base weights and ratio-raking weights are simultaneously included in the estimation, using WesVar? 4.0. Imputation is not used for the evaluation of coverage error. This is because imputation is developed for data with item nonresponse. More specifically, we need to have some information about the units whose values are to be imputed. In this case, we do not have any information about the units in the target population other than ones in the full sample. Therefore, it is impossible to impute any values for the nonsampled units in the target population. The 95% confidence intervals of the deviations of the unadjusted (U) and ratio- raking adjusted (R) estimates from the population values are shown in Figure 4.5. If the ratio-raking procedure is effective in reducing bias in estimates due to coverage error, Figure 4.5 would show confidence intervals of the deviations more likely to contain zero for the R estimates than for the U estimates. Roughly speaking, the adjustment seems to make a trivial improvement. The adjusted values are still closer to the unadjusted ones than to the population figures. Significant deviations still exist and they become more conspicuous for the subgroup estimates. Discrepancies are most prevalent for the education and ethnicity subgroups. This coincides with the divergence found in Figure 4.4. Although this divergence is supposed to be corrected by ratio-raking, estimates for subgroups formed by these covariates are still distant from the true values. 54 -10 0 10 20 30 40 -10 -5 0 5 10 15 -1 0 1 C o mputer O w ne r s hip ( % ) Pri o r W e b Exp e r i e n ce ( % ) Une m ploy m e nt R a te (% ) H ouse hold S i ze Point Estimate -20 -10 0 10 20 30 URURURURURURURURURURURURURURURURURURUR ALL 20~40 41~45 46~50 51+ -HS HS Some College College + White Black Other Hisp- anic NW MID SO WE Male Fe- male OVER- AGE EDUCATION ETHNICITY REGION GENDER U: Unadjusted Full Sample, R: Ratio-raking Adjusted Full Sample Figure 4.5. 95% Confidence Intervals of Deviations of Full Sample Estimates from CPS Comparison Estimates Persons with less than a high school education report having prior Web experience at a far higher rate in the Web survey than in the CPS. In fact, its percentage in the Web sample is about 20 percentage points higher than in the CPS. One explanation may be a misunderstanding by persons in the Web sample what ?Web? experience means. Another explanation may be that people with lower education in the 55 Web sample, before they join the KN panel, tend to own fewer computers and are more likely to be unemployed, but have had experience with the Internet at a higher rate than their counterpart in the population. These people are likely to have more time, thus, more potential opportunities to access the Internet, but are less able to afford computers because they are unemployed. This may make their reaction to obtaining free access to the Web more positively than persons with higher education, inducing them to stay active on the panel to maintain the access. The discrepancies in computer ownership and Web usage by ethnicity warrant attention. The Web sample seems to include higher proportions of technology-savvy Blacks and Hispanics at a higher level than the CPS does. Both unadjusted and adjusted sample estimates of the computer ownership for Blacks and Hispanics are 10 percentage points higher than the population values. Equivalent racial/ethnic groups in the Web sample have higher levels of Web experience than their counterpart in the population ? the full sample overestimates the Web experience by far over 20 percentage points. Interestingly, Whites in the Web sample are somehow less technologically experienced than those in the population as measured by computer ownership and Web experience. This suggests that the Web sampling frame coverage is systematically different from the population with respect to ethnicity. Ratio-raking does not seem a sufficient solution. 4.5 Discussion This chapter is one of the first examinations of statistical adjustment approaches for Web surveys. The respondents in the particular survey studied seemed to represent 56 the full sample well, although the completion rate was fairly low. Consequently, the sample-level nonresponse adjustment was not even necessary for at least the variables examined in this chapter. This is similar to the recent findings about nonresponse (e.g. Curtin et al., 2000; Keeter et al., 2000; Merkle and Edelman, 2002). Additionally, the covariate distributions for the respondents and the full sample were very close. This implies that adjustments based on these characteristics may not improve the estimates based on respondents much. However, it does not seem safe to conclude the Web sample frame adequately covers the population. Estimates for the subgroups whose population and sample covariate distributions showed inconsistencies tended to deviate significantly from the population values. Traditional adjustments like raking had a limited effect in correcting for this deviation. Thus, this result failed to support the assumption of ignorability of the coverage mechanism inherent in the ratio-raking procedure. Three points should be made about the implications of this chapter. First, they apply only to this particular type of Web survey and this particular topic. Other Web survey protocols targeting the general population are considered less scientific, as they often rely on convenience or volunteer samples, and, thus, may have completely different error structures. Second, coverage and nonresponse errors are properties of a statistic, not of a survey. Other statistics may show different nonresponse and coverage properties. Statistics in this chapter were selected because they are available at the respondent, the full sample, and the population level. Third, the target population of this chapter is very specific, parent figures with at least one teen household member. This population may have different nonresponse and coverage properties in this Web panel sample from other 57 populations. Findings in this chapter can serve as a window behind those error mechanisms, but cannot be generalized. This chapter found that the coverage errors of this Web panel survey were more severe than nonresponse errors conditional on the RDD survey response. However, the full sample already includes multiple stages of nonresponse prior to the survey, which were captured under the coverage error examination in this chapter. Coverage errors from nonresponse or non-cooperation in the procedures of recruiting and maintaining panel members may be more serious than ones in the actual survey. Further investigations to statistically disentangle the coverage and nonresponse mechanisms at each stage would be informative. If consistent evidence against ignorability of the error mechanism is found, more innovative adjustment methods will be needed for sound inferences from Web survey data. 58 Chapter 5: Propensity Score Adjustment 5.1 Introduction One of common methods for presenting scientific research results is group comparison. Especially in medical research reporting, it is not unusual to encounter such comparisons. For example, a report may claim that a health survey found people who consume a recommended amount of vegetables have a lower risk of cancer than people who do not. One notable fact about the comparison is that it tacitly implies a causal relationship. This report may seem reasonable prima facie, although the study design, an observational survey, does not necessarily accommodate grounds for such a finding. A closer examination may reveal that the claim relies on an assumption that sufficient vegetable consumption alone may decrease the cancer risk, whereas the control on other factors is not assured in the study. A fundamental problem of the comparison above is that the two groups, high and low vegetable consumers, may be different with respect to not only the diet pattern but also other characteristics, such as age, gender, race, education, health status, etc. This occurs because this study uses observational data in which the assignment of the study subjects to the two groups to be compared is not guaranteed to be random. Unless the study sufficiently controls for conditions other than the experimental factor under study so that study subjects are balanced with respect to those other conditions, the difference in cancer prevalence between the groups may not be any more than an artifact. 59 Randomization, although desirable, is impractical, unethical or impossible in many cases. In a controlled lab experiment for the effect of vegetable consumption, randomization may be possible, but the generalization of such experimental findings may be problematic. The experiment may be unethical, when considering the study outcome may have a direct link with the cancer risk. Observational studies are the only alternative in this example, and it becomes impossible for the researcher to make one randomly assigned group of people eat more vegetables and the other eat less. The control is out of the researcher?s hand, and those unrandomized conditions may lead to confounding the effect of interest with other uncontrolled effects. Now, the researcher is confined to what is available. In order to solve this problem, the researcher may use a statistical approach to control for the undesirable confounding effects. In the context of Web surveys, the experimental treatment is translated into ?being in a Web survey? or ?having Web access.? The selection of people under this condition is assumed to be nonrandom. The control treatment is the complement but persons receiving the controls are assumed to be randomly selected from the target population. By the same statistical approach used to remedy the confounder described above, the experimental group may be adjusted to resemble the control group so that the randomness in the control group is borrowed for the Web survey group. 60 5.2 Treatment Effect in Observational Studies 5.2.1 Theoretical Treatment Effect In this section, we summarize some of the considerations in estimating treatment effects based on Rosenbaum and Rubin (1983). Let the theoretical underlying treatment effect in the superpopulation, U , be denoted as 10 ? ??= ? . The outcome under the experimental condition is 1 ? , which is the mean of 1i ? , the outcome of all individuals in U , where i?U and U has N units. The control group outcome is 0 ? , the mean of 0i ? . Theoretically, the treatment effect is obtainable for each unit i in U as 10iii ? ??=?. The overall treatment effect is calculated over all units in U as 1 i i ? ? ? = ? U N . In the finite population approach, the treatment effect is realized as t , the mean of the individual treatment effect, i t , where the unit i belongs to the population, U , as 1,...,iN= . Therefore, () 10 1 0 11 ii i iU iU ttt t t t NN ?? =?= ? = ? ? . Theoretically, the treatment effect, t , is obtained when all units in the population are exposed to both control and experimental condition so that the realization of treatment effect for the i th unit alone, 10iii ttt=?, is computable. In reality, whether the study is experimental or observational, only a set of sampled units from the population is examined and the study subjects are exposed to only one condition. We observe either 1i t or 0i t for the i th unit, but not both. Assume that study units in an experiment come from two separate simple random samples, one under the experimental condition ( 1 s ) with 1 n units, the other under the controlled condition ( 0 s ) with 0 n units. From such a study, we obtain an estimate of the 61 treatment effect such that 10 10 1 0 10 11 ??? ii is is ttt t t nn ?? =?= ? ? ? . Therefore, the computation of treatment effect always involves some degree of speculation about the unobserved components and unexamined population units. Let M be a mechanism that all experimental/control treatments are repeatedly assigned to all units an infinite number of times. Under this mechanism, we may expect () 11M Et ?= , () 00M Et ?= , and ( ) 10M Ett ?? = , where ( ) M E ? is the expected value over M . The mechanism M is assumed to be satisfied as N ??. What we need is to link our sample estimates, 1 ? t and 0 ? t , to the finite population quantities, 1 t and 0 t , that approximate the underlying superpopulation figures, 1 ? and 0 ? , through M . This linkage may be guaranteed under randomization of the treatment assignment distribution, denoted as ? , such that ( ) 11 ? Et t ? = , ( ) 00 ? Et t ? = , and ( ) 10 ?? Et t t ? ? = . As long as the condition of the i th unit is not dependent on that of the j th unit in the same sample, implying that there is non-interference between subjects, the average treatment effect becomes ( ) 10 ?? M EE t t ? ?? = , (5.1) where ()E ? ? is the expected value over the randomized assignment mechanism, ? . The requirement for (5.1) is that we must be able to estimate ( ) 1 ? M EE t ? and ( ) 0 ? M EE t ? from the observed data, 1 s and 0 s . Note that ? is the intended effect ? not the actual effect. The actual effect may have an unintended effect arising from the imperfect or incomplete 62 randomization, as study units may opt to drop out from the study, cross over the assigned groups, or affect one another. In order to estimate the average treatment effect from observed data, a stable unit treatment value assumption (SUTVA) must hold. Under SUTVA, 1ii tt= , if 1 i g = (treatment group) for all units, and 0ii tt= , if 0 i g = (control group). Thus, the outcome for the i th unit can be expressed as ( ) 10 1 iiii i ttgt g=+?, where i g is 0 or 1. SUTVA implies that there is no interference among study subjects, meaning that potential outcomes for each unit are not related to the treatment status of other units. In addition to SUTVA, independence between the outcome and the treatment assignment is needed. When two random variables, x and y , are independent, we symbolize this by x y? . If () 10 ,tt g? , () ( ) ( ) 11 ?? ? |1 |1 MM M E Etg EEtg EEt ?? ? == == and () ? |0 M EE tg ? = = ()() 00 ?? |0 MM EE t g EE t ?? == . Thus, the estimated average treatment effect is equal to ? : () ( ) ( )() 10 ?? ?? |1 |0 MM MM EE tg EE tg EE t EE t ?? ?? ?=? == ? =. (5.2) The unbiasedness in the estimation of treatment effect in (5.2) is guaranteed only under randomization with large samples. 5.2.2 Inherent Problems of Treatment Effect Estimation in Observational Studies The unbiasedness in (5.2) does not hold in observational studies, because factors affecting the group assignment, g , are beyond researchers? control, as examined in Section 5.1. The resulting treatment effect estimates may inherently have discrepancies 63 between the treatment and the control group with respect to some demographic characteristics, behaviors, and/or attitudes. These attributes may confound the true treatment effect, ? . Let us return to the example in Section 5.1 ? the study on the effect of vegetable consumption on cancer risk. Suppose the researcher finds that high vegetable consumption decreases cancer risk. But he also finds that there are more females in the high vegetable consumption group and that females show a lower level of cancer than males. The question becomes whether the differentiation in cancer risk level is attributable to the amount of vegetables eaten or the gender. A sensible step to solve this dilemma is to compare the cancer risk between the groups within the same gender. Generally speaking, lab experiments or cross-national surveys do not collect data for only one study variable. Often times, the data are analyzed for underlying relationships among variables. This means that the collected data readily contain variables that are related to the study variables ? namely covariates. When the covariate means are different in two comparison groups, standard practice is to adjust for such differences when comparing means of outcome variables. Analogous to controlling for the gender effect in the example above, one can imagine adjustments on the treatment effect using auxiliary information in most studies. 64 5.3 Bias Adjustment Using Auxiliary Information 5.3.1 Covariates for Bias Adjustment If the study subjects differ systematically with respect to a set of some covariates, x , other than the assigned group characteristics, g , the realized outcome of the i th unit in the treatment group and that of the j th unit in the control group can be modeled as follows: ( ) () 11 1 1 00 0 0 iii jjj tue tue ? ? =+ + = ++ x x , (5.3) where ()u x is a function of x , a matrix of auxiliary variables; and 1 j e and 0 j e are random residuals with zero means. This implies that 1i t , the outcome of the i th unit in the experimental treatment group, may deviate from 1 ? , the true study outcome of the same group, by () 1i u x , its own distinctive characteristics, and 1i e , some random effect. The same is true for the individual unit outcome in the control group. The comparison of the outcomes should reflect the grouping characteristics only. Otherwise, the imbalance in the distribution of 1 x and 0 x confounds the comparison. When this confounding effect of covariates is not adjusted out, the expected treatment effect becomes biased: () ( ) ( ) ( ) 1 0 10 10 10MM E tEt uu uu?? ??=?+?=+?, (5.4) where () () 11 uu d?= ? xxx and ( ) ( ) 00 uu d?= ? xxx; and ( ) 1 ? x and () 0 ? x are the frequency functions of the covariates in the comparison groups. The expected value in (5.4) is over repeated applications of the treatments to units. Note that the expected 65 effect in (5.4) assumes that there is no interaction between the treatment effect and the covariates. By comparing (5.1) and (5.4), it is clear that the treatment effect is biased by 10 uu? . This bias may be removed or reduced by balancing the covariates between the two groups. The problem of achieving the balance in estimating ? arises when x takes a high dimension. It is not practical to obtain equivalent distribution on many covariates, although theoretically desirable. An alternative is to summarize all covariates into one quantity and either balance or adjust based on this summary measure. Propensity score adjustment is the effective and intuitive method that serves this purpose, as it uses available covariate information and provides a scalar quantity for each unit, while requiring a minimal set of assumptions. 5.3.2 Balancing Score For treatment effect estimation, covariates may be balanced on a function, ( )b x . An appropriately constructed balancing score ( )b x has the property that the treatment assignment is conditionally independent of the covariates given ()b x . That is, the distribution of x conditional on ( )b x is the same for both treatment groups. It can be mathematically expressed as ( )|gb?xx , (5.5) where ()b x is called a balancing score as it balances out the distributional imbalance in covariates between the comparing groups. The finest balancing score is x , the covariates themselves, but this is not practical as discussed above. While many functions of x can 66 serve as balancing scores, the propensity score, ( ) ( ) { } efb=xx, is frequently used. The propensity score takes the coarsest form of the balancing score. We discuss these scores in the next section. 5.3.3 Propensity Score 5.3.3.1 Bias Reduction by Propensity Scores A propensity score is simply the probability of a unit being assigned to the treatment group ( 1g = ) given a set of covariates and is denoted as ( ) ( )Pr 1| iii eg==xx, (5.6) where ()()() {} ()1 11 1 Pr ,..., | ,..., 1 i i n gg nn i i gg e e ? = =? ? xx x x is assumed and () i e x is a scalar with a value between 0 and 1. Since the propensity score is a type of balancing score, the conditional independence holds as (5.5); ( )|ge?xx . Returning to the earlier model in (5.3), if ( ) ( )ue=xx and if the unit i from the treatment group and the unit j from the control group have the same propensity scores, the difference between these two units becomes confounder-free because () ( )() 10 1 1 1 0 0 0 10 1 0ij i i j j i j tt e e e e ee?? ? ?? ?=+ +?+ + =?+? ?? xx . (5.7) Omitting the subscripts, i and j , the expected value over model (5.3) is then () 10 1 0M Ett ? ???=?=, because ( ) ( ) 10 0 MM Ee Ee= = . More formally, following Rosenbaum and Rubin (1983, Sec. 2.2), when a treatment and control unit have the same 67 propensity score, ()e x , and the treatment assignment is strongly ignorable (see Section 5.3.3.2), ( )()( )( ) ()() ()() ()() 10 10 10 |,1 |,0 | | | . MM MM M Eteg Eteg Ete Ete Ette = ?= =? =? xx xx x (5.8) That is, the expected difference in observed responses for two units with the same ( )e x is equal to the average treatment effect at the propensity score, ()e x . When averaged over the distribution of the propensity score in the population, we have () ( )( ) () ( )( ) () ()() 10 10 10 |,1 |,0 | , MMee Me EEte g EEte g EEtte ?? ? = ?= =? =? = xx x xx x (5.9) since, by definition, the effect of the treatment is the average of the effects for the individuals in the population. As long as ( )e x contains all potential confounders, the adjustment based on propensity score will lead to an unbiased estimate of treatment effect in expectation. In words, strong ignorability means that given a score, ( )e x , the assignment of a unit to the treatment or control group ( 1g = or 0) and the outcome for the unit ( 1i t or 0i t ) are independent. If a group of units with the same propensity score were randomly divided between the treatment and control group, (5.8) implies that we will get an unbiased estimate of the treatment effect for units that all have the same propensity score. 68 As discussed earlier, treatment means ?being in a Web survey? in the Web survey context. In Chapter 6, we will apply the propensity score adjustment to create groups of units with approximately the same propensity of being in a Web survey within each group. The aim is to create groups so that 1 ? for the Web sample persons equals 0 ? for the non-Web sample within each propensity score group, thus, allowing the Web sample to be used to make inference for the target population. 5.3.3.2 Assumptions in Propensity Score Adjustment When propensity score is used to adjust for biases in observational studies, bias reduction is attainable as long as five assumptions hold. First, any propensity score should meet the strong ignorability assumption: ( ) ( ) 10 ,|tt ge? x , (5.10) and () () 0Pr 1 1ge<= , and 0g = when *g threshold< . Suppose g here defines the eligibility to a certain job training program; *g is working hours per week; and the threshold for eligibility is 10. If a person is working more than 10 hours per week, he is automatically entitled to enroll in the program. 72 Another bias reduction method in econometrics incorporates instrumental variables and is known as the Rubin Causal Model. This was first outlined by Angrist, Imbens and Rubin (1996) for situations where the treatment is randomly assigned to the units, but study units comply with the assignment imperfectly, resulting in nonignorable reception of the treatment. The initial assignment here is used as an instrumental variable. The influence of the instrumental variable on the fundamental treatment outcome is assumed to go only through the actual compliance. In other words, the instrumental variable is highly correlated with the treatment receipt but not with the treatment outcome. The example for such a case is the military lottery example of the authors? article. Under a set of assumptions listed in Angrist, Imbens and Rubin (1996), the treatment effect incorporating both treatment assignment and reception identifies the average causal effect without selection bias. Both econometric methods have not been applied extensively due to their shortcomings compared to the propensity score adjustment. More specifically, Heckman?s approach uses a two-step approach to construct a variable that controls for the bias due to unobserved sources associated with treatment selection, and its sample selection models account for unobserved factors of bias only if distributional assumptions are valid. The variable that controls for selection bias should be correlated with the selected treatment but not with the treatment outcomes (Crown, 2001). The instrumental variable estimation has been criticized for strong behavioral assumptions that may not hold in reality (Heckman, 1997). Another limitation is that this method derives the causal effect only for the compliers, hence, ignores the other nonignorable components in the treatment receipt. As in Heckman?s method, the instrumental variable method also 73 requires variables that control for selection bias should be correlated with the study variable but uncorrelated with the treatment outcomes (Crown, 2001). It is not easy to find such variables. Moreover, the two econometric model methods require less realistic distributional assumptions, are very sensitive to model specification details, and quickly become complex (see Obenchain and Melfi, 1998). These limitations lower the applicability of econometric selection methods. Thus, these are excluded from further discussion. Outside of econometrics, Cook and Goldman (1989) compared analyses based on propensity score method to a multivariate confounder score method in epidemiological unrandomized research. The authors found that propensity score method is less affected by the high correlation between the treatment (or exposure) level and the confounders than the multivariate confounder score. 5.4 Methods for Applying Propensity Score Adjustment Three application methods of propensity score adjustment are identified from the literature. The first approach matches two units based on the propensity score ? one from the treatment group and the other from the control group, and forms a pair. The group comparison is done within a given propensity score, and the average treatment effect is calculated over all matching propensity scores. Subclassification is the second application method. From a combined pool of subjects from both conditions, units are stratified based on the propensity score so that ( )e x are approximately constant for all units in each stratum. The expected difference between the two assignments at a given 74 propensity score is equal to the average treatment effect. In the third application method, propensity scores are applied by adjusting covariance in a linear response model. The detailed operationalization of the three methods will be discussed below (see Rosenbaum and Rubin, 1983, 1984 and D?Agostino, 1998 for a review). 5.4.1 Matching by Propensity Scores Matching is a natural approach to bias reduction when the cost of experimentation is high and when a large reservoir of units under control condition is available. In fact, most methodological studies of propensity scores application are concentrated on matching, especially pair matching. This may be because the propensity score adjustment is originated from causal inference studies, where only a small portion of the population is exposed to the experimental condition, which makes the size of the control group much larger than that of the treatment group. The basic idea in matching is compare all experiment treated units only with controlled units whose covariates show similar distributions. The illustration of matching is first carried out in terms of univariate covariate x as in Rubin (1973). Suppose that there is a random sample of size n from some population of a treatment group ( 1g = ) 1 P and denote the sample as 1 S ; and a larger random sample of size mkn= with 1k ? from a control group ( 0g = )population 0 P , denoted as 0 S . It is further assumed that x is recorded for all subjects in 1 S and 0 S . All subjects in 1 S are to be matched to their counterparts selected from 0 S . Based on x , a subsample of size n is drawn from 0 S , denoted by 0 S? such that each unit in 0 S? has an equivalent value of x to a certain unit in 1 S . The treatment effect is estimated from 1 S 75 and 0 S?. If 1k = , a purposeful matching is not attainable, as 0 S? is in essence a random sample of 0 P . In this case, the bias due to imbalanced x is retained. If k ??, a perfect matching between 1 S and 0 S? is highly feasible ? the bias may be reduced, if not removed. Rubin (1973) documented three simple approaches of constructing 0 S? for pair matching. They all assign { } 11i sS? with 1,...,in= the closest match from the unmatched units of { } 00j sS? with 1,......,jm= with mkn= and 1k ? base on x . The selection mechanism of 0 S? is completely defined by how the order of { } 0 j s is specified: 1) random ordering (units are randomly ordered); 2) low-high ordering (a unit not yet matched with the lowest x score is matched next); and 3) high-low ordering (a unit not yet matched with the highest x score is matched next). All three methods show similar bias reduction patterns. Unless the ratio of the treatment group variance to that of the control group for the matching variable is larger than 1, all three ordering methods attain sizable bias reduction (Rubin, 1973). So far, matching has been examined in terms of a univariate x . In practice, matching is done in a multivariate fashion, because the exact matching on all covariates is impossible. Instead, matching is carried out by using propensity score ()?e x from (5.6) in order to equalize all covariate distributions between the treatment and the control group. The matching methods and bias reduction patterns examined for univariate x should apply similarly when using ( )?e x . 76 Rosenbaum and Rubin (1985) compared three multivariate matching methods using propensity scores. Each of these methods has similarities to the nearest neighbor hot-deck method used for imputation in sample surveys (Little and Rubin, 2002, p.69). They differ by the level of importance given to the estimated propensity score relative to the other auxiliary variables in x . Under the first method, nearest available matching, the first subject in randomly ordered 1 S is matched with the subject in 0 S having the nearest ()?e x . Both subjects are removed from the lists and the same matching procedure continues for the remaining unmatched subjects in 1 S . The remaining two matching methods rely on the Mahalanobis metric quantity using all auxiliary variables and propensity scores, calculated from the Mahalanobis distance function: ()()() 0 1 , S dC ? ? =? ?uv uv uv, where u and v are values of { }?,( )e? ?xx; and 0 1 S C ? is the sample covariance matrix of { } ?,( )e? ?xx in 0 S . The second method uses nearest available Mahalanobis metric matching. Units are matched as in the first method but with respect to Mahalanobis distance quantity. The third approach is nearest available Mahalanobis metric matching within calipers. For a unit in the randomly ordered 1 S , create a subset of 0 S with all available subjects whose ( )?e x is within the range of a specified constant. This specified range is the caliper. Then, find the subject in the subset of 0 S that has the closest match to the unit in 1 S with respect to the Mahalanobis distance. Rosenbaum and Rubin (1985) demonstrated that the third method is superior to the other two with respect to the balance in covariates and in propensity scores. This is a reasonable finding, since nearest available Mahalanobis metric matching within calipers uses all covariates and propensity scores and takes advantage of the first two matching methods. 77 There is an issue with the degree of closeness of a matched pair. Rosenbaum and Rubin (1985b) compared inexact matching (failure to match on the exact covariate score) with incomplete matching (failure to match all units in the treatment group). The study showed that incomplete matching has a higher likelihood of retaining severe bias in the treatment effect than inexact matching. The authors recommended using an appropriate nearest multivariate matching to complete the matching, even if this may leave some residual due to inexact matching. Pair marching of a Web sample to the nonsampled part of a population has limited relevance to finite population estimation. While matching of Web respondents to nonresponding or nonsampled cases from a larger pool constructed based on randomization might be feasible, the interest is in estimating population means, totals, and other population quantities. No data, other than covariates, are available for the nonsampled units. Also, estimating the difference between the Web sample and nonsample quantities is not possible nor is it of interest. 5.4.2 Subclassification by Propensity Scores All units in the treatment and control groups may be combined into one and partitioned into a number of subclasses based on the covariate distributions such that each subclass has a restricted range of covariate values. This idea was first presented in Cochran (1968) with the underlying rationale that the units within one subclass become comparable with respect to the covariates. The major advantage of subclassification is that the treatment effect can be adjusted by restructuring subclass weights based on the covariate distribution without assumptions about response surface modeling. Propensity score adjustment using the subclassification method appears frequently in clinical trials 78 (e.g., Rubin and Rosenbaum, 1984; Hoffer et al., 1985; Lavori and Keller, 1988; Cook and Goldman, 1989; Czajka et al., 1992; Stone et al., 1995; Lieberman et al., 1996; Rubin, 1997; Benjamin, 2001). Its popularity is not surprising, when considering (1) that subclassification allows easier operation than matching, (2) that the number of the control group units need not be larger than that of the treatment group, and (3) that subclassification uses all study subjects, unlike matching where unmatched units in the control group are discarded. Returning to the initial demonstration of (5.2), suppose that there is a univariate x available in the data. All units from both conditions first need to be sorted by x in order to use subclassification. Let the boundaries of x be 1c x ? and c x for the c th subclass; and the sample means of study outcome of the c th subclass for the two conditions be 1c t and 0c t . The expected outcomes from the experimental and control groups are ( ) () 111 000 M cc M cc EE t u EE t u ? ? ? ? =+ = + , (5.12) where () () () 1 1 1 1 1 c c c c x x c x x ux xdx u x dx ? ? ? ? = ? ? , () () () 1 1 0 0 0 c c c c x x c x x ux xdx u x dx ? ? ? ? = ? ? , and ( )ux is a function of the covariates; and c is an indicator of the subclass with 1,...,cC= . It becomes clear from (5.12) that the initial bias of the average treatment effect is () 01 1 0 1 C cc c uu u u = ?= ? ? , i.e., the cumulative difference in the covariate means. All units within one subclass are comparable with respect to the covariates included in the propensity score model. By allocating appropriate weights, the overall 79 treatment effect can be adjusted. The treatment effect is the weighted mean of the differences between the experimental and control group units given a subclass. The weight for each subclass is derived, for example, based on the proportion of each subclass in the experimental or control group. After adjusting for the distributional differences in x , the remaining bias is () 10 1 C cc c c wu u = ? ? , (5.13) where c w is a weight assigned to the c th subclass. The proportion of the initial bias reduced by the adjustment is, therefore, () () 10 1 10 100 1 C cc c c wu u uu ? = ?? ? ?? =?? ? ?? ? . (5.14) Five implications may be drawn from (5.12) and (5.14) ? the bias reduction in subclassification adjustment depends on (1) the function of the covariate, ( )ux; (2) the shape of frequency functions, 0 ()x? and 1 ()x? ; (3) the number of subclasses, c; (4) the division points, i x ; and (5) the choice of weights. In practice, more than one covariate is likely to be used for reducing bias. Subclassification based on multiple collateral variables is not easy to carry out, because the number of subclasses increases exponentially with an increase in the number of covariates and/or their categories. This may lead to a number of subclasses with zero observation. Using propensity scores instead of multiple covariates becomes a sensible choice as they represent all covariates included in the model approximately. 80 The same procedure in subclassification on a univariate x holds for the subclassification using estimated propensity scores (see Rosenbaum and Rubin, 1984, for the full illustration). It is possible to create numerous subclasses so as to refine units in each subclass to have almost identical propensity scores. Cochran (1968) found that five subclasses are often sufficient to remove over 90% of the bias and that having more than five subclasses does not add much bias reduction. It seems to be a norm to adopt five subclasses, more specifically quintiles of the propensity scores, in existing literature (e.g., Rosenbaum and Rubin, 1984; Terhanian et al. 2000a). This seems more reasonable as one would want to create subclasses of approximately the same size. Each subclass needs to have at least one unit from both conditions to meet the assumption of assignment ignorability. In addition, there should be enough observations from both conditions within each subclass in order to derive less volatile weights. Propensity score adjustment by subclassification examined above resembles post- stratification. According to Kish (1965, p.92), the initial error in any sample survey estimate is 11 HH hh hh hh wy WY == ? ?? , where there are H strata in the population; h y and h Y are the sample estimate and the population quantity for the h th stratum; and h w and h W are proportion of the h th stratum in the sample and population. What post-stratification aims to achieve is to obtain h w that is close to h W , i.e., hh wW? , so that the error becomes () 1 H hh h h Wy Y = ? ? . (5.15) From (5.15), it is clear that the magnitude of error is related not only to the choice of the weight, h w , but also to the difference between the sample estimate and the 81 population quantity. When the sample selection is random, ( ) hh yY? is free from bias. However, when not, much bias is retained (Kish, 1965). That is, even if hh wW= , the error can be sizable because h y is not guaranteed to approximate h Y . Relating this to propensity score subclassification, we are trying to derive weights in (5.13), c w , that minimize the quantity () 10 1 C cc c c wu u = ? ? . However, if the covariate included in the function, ()ux, is different for the treatment and control group, we may not expect a substantial reduction in the overall bias. The only difference from post-stratification is that the weights are calculated by adopting explicit models. Therefore, subclassification based on propensity scores may be regarded as model-based post-stratification. The former is more efficient than the latter, as it incorporates multi-dimensional covariates without concerns about the convergence and allows explicit modeling for adjustment. In sample surveys, classes constructed based on propensity scores may be included in the post-survey adjustment, such as calibration adjustment. The requirements in this case are that the population or reference data must be available, unlike the marginal or cell counts in the traditional adjustments, and contain all variables included in the propensity model. While this may serve as an alternative, applying this strategy may be limited due to the requirements in the reference data. 82 5.4.3 Covariance/Regression Adjustment by Propensity Scores The bias in treatment effect may be reduced by regression adjustment using propensity scores. Under this method, the expected value of the responses are modeled as ( ) () 1111 0000 M M EE t x EE t x ? ? ?? ? ? =+ =+ , (5.16) and the expected treatment effect is () ( ) 10 1 0 11 00 1 0 Er r x x x x??? ? ???=?+ ? =+ ?, (5.17) when the response surfaces in both conditions are parallel ( 10 ? ?= ). The bias in treatment effect is () 10 x x? ? , which may be removed when 10 x x= . When multiple covariates are used, propensity scores provide a convenient alternative. It only requires finding the regression of the responses on the propensity scores in the treatment and control groups and uses the regression for treatment effect estimation. If propensity scores are used in (5.16) instead of x , the expected treatment effect (5.17) becomes bias-free given a propensity score as ( )() ( ) ( ) 10 1 0 1 0 |Er r ex ex ex? ?? ? ??=?+?=, (5.18) where the response surfaces of the two groups are parallel, i.e., 10 ? ?= . What is the difference between removing bias using propensity scores in the regression and performing regression adjustment directly on the responses using all covariates? Rosenbaum and Rubin (1984) illustrated that the ?point estimate of treatment effect from an analysis of covariance adjustment for x ? is equal to the estimate 83 obtained from a univariate covariance adjustment for the sample linear discriminant based on x , whenever the same sample covariance matrix is used for both the covariance adjustment and the discriminant analysis.? D?Agostino (1998) noted that the propensity adjustment is more convenient. Fitting complicated propensity score models is not as difficult as fitting complicated response surface models, because the goal of propensity score modeling is to get good estimates of the probability of receiving a certain treatment, not to obtain parsimony. The covariance adjustment using propensity scores is not as widely applied in the literature as subclassification or matching due to two reasons. First, there is a restriction imposed on the response surfaces. As in (5.17) and (5.18), the response surfaces in the two group assignments should be parallel, which may be difficult to verify. Second, there are many cases where a regression adjustment performs poorly and increases biases. When the linear discriminant in response surfaces is not a monotone function of the propensity score (i.e. the covariance matrices in the experimental and control groups are unequal), the covariance adjustment may seriously increase the expected squared bias, because it implicitly adjusts for a poor approximation to the propensity score (Rubin, 1979). For the nonlinear response surfaces, univariate covariance adjustment can either increase the bias or overcorrect bias if the variances of x in the two conditions differ (Rubin, 1973). Therefore, unless the requirements are well met and the linear discriminant is highly correlated with the propensity score, matching or subclassification may serve the bias reduction better. The theoretical underpinnings of propensity score adjustment and its application methods have been examined in this chapter. Propensity score adjustment may serve as a 84 potential post-hoc adjustment method for bias reduction, when the sample selection mechanism is not guaranteed to follow a random fashion. In order to utilize propensity score adjustment legitimately, the five assumptions examined previously need to hold. They are strong ignorability, no contamination among study units, nonzero probability of being assigned to both experimental and control conditions, the observed covariates? representativeness of the unobserved covariates, and no effect of assigned treatment on the covariates. It should be noted that propensity score adjustment achieves the balance on covariates averaging over repeated studies. This implies that not all studies using propensity score adjustment necessarily achieve the balance. 85 Chapter 6: Alternative Adjustments for Volunteer Panel Web Survey Data The focus in this and later chapters will be on the estimation of population means and totals. To do this, we will determine alternative sets of weights, {} 1 n i i w = , that (1) adjust for imbalance in the distribution of covariates between the Web survey sample and a reference survey data set and (2) use auxiliary data to produce weights that are properly scaled for estimating totals in addition to means. The first purpose will be served by the use of adjustment subclasses formed on the basis of propensity scores as described in Section 6.1 and 6.2. Auxiliary or covariate data will be used to further adjust weights in calibration estimation, discussed in Section 6.3. Both the propensity score and calibration adjustments are mainly intended to reduce biases caused by nonrandom sample selection and deficient coverage in Web surveys. 6.1 Problems in Volunteer Panel Web Surveys Volunteer panel Web surveys are conducted among a set of people who have Web access and self-select to join the panel. The overall survey protocol is described in Section 2.1 and is depicted in Figure 6.1. As the colors in this figure suggest, people under each step are not guaranteed to resemble one another. Therefore, the relationships between steps are not necessarily known. The greatest threat to a Web survey is the uncertain and incomplete coverage of the frame, because one must have Web access and voluntarily join the panel in order to be eligible for the survey participation. Unless the population of interest is the volunteer panel itself, the protocol in Figure 6.1 does not 86 allow construction of frames with known coverage of the population of interest. This problem with coverage yields the next problem that it is impossible to draw samples from the full population with known probabilities and to assign selection weights to the sample units in the ways normally done in sample surveys. Moreover, poor response rates of this type of Web survey leave more room for survey errors. Figure 6.1. Volunteer Panel Web Survey Protocol Estimates from this type of Web survey may suffer from a combination of noncoverage, nonprobability sampling, and nonresponse. Chapter 4 examined whether some parts of these errors may be corrected by traditional adjustment methods. The finding indicated the limitations of the traditional methods and the needs for more innovative adjustments. When integrating the causal inference views in Chapter 5, these problems may be summarized into one simple term ? selection bias. The respondents? self-selection mechanism from one step to the next in Figure 6.1 is not guaranteed to be random, which causes biased survey estimates. As in Chapter 5, propensity score adjustment may be adopted as a post-hoc remedy to diminish the bias. In this case, we will model the propensity of being in the responding Web sample. 87 Ideally, propensity score adjustment is not necessary in survey data analysis to correct initial selection bias, as most surveys rely on randomized sample selection. Samples are assumed to represent the characteristics of the desired population. In theory, survey estimates are expected to be design unbiased or consistent estimates of the population quantities. On the other hand, propensity score adjustment is not novel in survey statistics, especially in post-survey adjustment. It has been used to derive adjustment weights for reducing biases in survey estimates arising from coverage problems (e.g., Duncan and Stasny, 2001), late response (e.g., Czajka et al., 1992) and nonresponse (e.g., Smith et al., 2000; Vartivarian and Little, 2003). Focusing on volunteer panel Web surveys, this chapter will propose a two-stage adjustment method in combination with the survey protocols in Figure 6.1. The first stage adjustment will be examined in Section 6.2, which will provide a detailed mathematical presentation of how to use propensity score adjustment for the volunteer panel Web surveys. The adjustment will require a reference survey that is conducted parallel to the Web survey (Terhanian and Bremer, 2000). The reference survey is required to have more desirable coverage and sampling properties and higher response rates than the Web survey. For instance, the reference survey may be conducted using traditional survey modes, such as random digit dialing telephone method in Harris Interactive?s case. As Figure 6.2, the reference survey data are used as a source for benchmarks for the first-stage adjustment. This benchmarking is carried out via propensity score adjustment as it balances the covariate distribution between the Web and reference survey samples. A reference survey needs to collect only the covariate information needed to compute propensity scores. Through this method described in 88 detail in Section 6.2, it is hoped to use the strength of the reference survey and reduce biases in the Web survey estimates. However, it should be noted that the employment of the reference survey implicitly disregards the dissimilar measurement properties due to mode difference between the Web and reference surveys. Figure 6.2. Proposed Adjustment Procedure for Volunteer Panel Web Surveys Section 3 will introduce calibration adjustment as the second stage adjustment. The remaining disparities in covariates between the population and propensity-score- adjusted Web sample are expected to be tuned by adding another layer of weights by calibration adjustment. Section 4 will summarize the combination of the propensity score and calibration adjustments and provide a theoretical illustration on how the bias properties are modified through the course of adjustment application. 89 6.2 Adjustment to the Reference Survey Sample: Propensity Score Adjustment Subclassification is most applicable and practical for Web survey situations among the three application methods of the propensity score adjustment examined in Chapter 5. Although pair matching is a possibility when comparing treatment and control groups, how to apply the method to estimate finite population quantities is unclear, as discussed in Section 5.4.1. Therefore, we do not regard pair matching method as feasible. As noted earlier, a larger reservoir of control units is needed for pair matching in the analysis of quasi-experimental designs. If a larger reference survey is conducted in a traditional mode parallel to a Web survey only to acquire covariate information, it will be more logical to discard the Web survey and collect information on all variables in the reference survey. However, a large reference survey, like the Current Population Survey, conducted by an established survey organization with high coverage and response rates, can be quite useful. Regression adjustment using propensity scores is possible, but the restrictions associated with building response models make this approach less appealing. The requirements for the response surface examined in Section 5.4.3 are difficult to be achieved. Therefore, the subsequent discussion on the application of propensity score adjustment will be focused on subclassification. Suppose that there are two samples ? a volunteer panel Web survey sample ( W s ) with W n units each with a base weight of W j d , where 1,..., W j n= ; and a reference survey sample ( R s ) with R n units each with a base weight of R k d , where 1,..., R kn= . Note that these base weights will not be inverses of selection probabilities, since the volunteers are not obtained by probability sampling. First, the two samples are combined into one, 90 () WR ss s=? with WR nn n=+ units. We need to calculate the propensity score from the combined sample, s . The propensity score of the i th unit, where 1,...,in= , is the likelihood of the unit participating in the volunteer panel Web survey rather than the reference survey and is defined as ( ) ( ) Pr | , 1,..., W ii eisin=? =xx. The propensity score is estimated in a logistic regression as in (5.11) using covariates collected in both the Web and reference surveys ( obs x ). If all relevant covariates are included in both surveys, then ,i obs i =xx for each unit i. A critical assumption in doing this is that the combined sample can legitimately be used to estimate the probability of being in the volunteer panel. Given a set of covariate values, a person must have some nonzero probability of being in the Web survey or not, and that probability must be estimable from the combined sample, s . Based on the predicted propensity score, ?()e x , the distribution of the Web sample units is rearranged so that W s resembles R s in terms of obs x included in the propensity model. Mechanically, this is first done by sorting the combined data ( s ) by the predicted propensity score of each unit and partitioning s into C subclasses, where each subclass has about the same number of units. Based on Cochran (1968), the conventional choice in practice is to use five subclasses based on quintile points. Ideally, all units in a given subclass will have about the same propensity score or, at least, the range of scores in each class is fairly narrow. This is so that (5.8) and (5.9) will apply approximately. In the c th subclass in the merged data denoted as c s , there are WR cc c nn n= + units, where W c n is the number of units from the Web survey data, and R c n from the reference survey. The total number of units in the merged data remains the same because 91 () 11 CC WR cc c nn nn == + == ?? . Second, we compute the following adjustment weights to all units in W c n , the c th subclass of the Web survey data: () () () () RR c WW c RR kk ks ks c WW jj js js dd f dd ?? ?? = ? ? ?? , (6.1) where R c s and W c s are the sets of units in the reference sample and Web sample of the c th subclass. If the weights in (6.1) are the inverse of selection probability, it can be expanded to: () () () () ?? ?? RR c WW c RR kk RR ks ks c c WWWW jjc js js dd NN f ddNN ?? ?? =? ? ? ?? . The adjusted weight for unit j in class c in the Web sample is . ?? ?? == RR WPSA W Wc jcj j WW c NN dfd d NN . (6.2) When the base weights are equal for all units or are not available, one may use an alternative adjustment weight as follows: RR c c WW c nn f nn = . (6.3) The adjustment using (6.3) does not allow populations totals to be estimated since the weights are not appropriately scaled, unless the population sizes for both the reference survey and Web survey are known. 92 From (6.2), we can see that () W c R WWc ccc R js n fnfn n ? == ? . The weights from (6.1) or (6.3) may make the distribution of the Web survey sample equal to the reference survey sample in terms of propensity scores. For example, the estimated number of units in class c from the Web sample using the adjusted weights is () .. ? ? ? . ? W c WPSA WPSA cj js R W c R Nd N N N ? = = ? In words, the estimated number of units from the Web survey, ? W N , is distributed among the classes according to the distribution from the reference survey, ?? RR c NN. The estimator for the mean of a study variable, y, for the Web survey sample ( W s ) becomes () () . . . ? W c W c WPSA jj c js WPSA WPSA j c js dy y d ? ? = ? ? ?? . Note that the reference sample units are not used in deriving . ? WPSA y after adjustment weights, .WPSA j d , are assigned. Therefore, the reference sample is required to have only the covariate data, not necessarily the variables of interest. The algorithm for propensity score adjustment is computationally implemented in psa.fcn using R ? (Venables et al., 2003) shown in Appendix 1.1 (part of the code comes from Obenchain, 1999). 93 The set of covariates typically includes similar kinds of demographic variables to those used in post-stratification. Harris Interactive includes both demographic and nondemographic variables in the propensity models (e.g., Terhanian et al., 2000; Taylor et al., 2001). The importance of covariates in propensity score adjustment should be understood in relation to the substantive study variable, y , and the group assignment variable, g (Rosenbaum and Rubin, 1984). How important it is to include nondemographic variables in propensity score adjustment for Web surveys is unclear due to two facts: (1) the inclusion of more variables automatically increases the predictive power of the model and (2) the nondemographic, e.g., attitudinal, covariates can often be explained by demographic variables. 6.3 Adjustment to the Target Population: Calibration Adjustment The second stage adjustment makes the adjusted Web survey sample resemble the target population. More specifically, this section will examine calibration using generalized regression estimators (GREG) in Deville and S?rndal (1992) and Deville et al. (1993) as a method of deriving the second stage weights. Suppose that an initial set of weights, { } () 0 Wj js w ? , is available and that a population total, y t , of the variable of interest, y, is estimated as () 0 ? W W yjj js twy ? = ? , 94 By using calibration with GREG, we may modify the set of initial weights, { } () 0 Wj js w ? , in order to find a new set of calibration weights, { } () Wj js w ?  , that produces () W W yjj js twy ? = ?   , while respecting =  zz tt, (6.4) where {} 12 , ,..., ,..., jjj jpjP zz z z ? =z (6.5) is a vector of values for P auxiliary variables for the unit j in the Web survey; () 12 , ,..., ,..., ? == ? pP izz z z iU tt t t z tz , is the set of the population (U) marginal totals of all P covariates with { }1, 2,....,UN= ; and () ? = ?   W jj js w z tz, estimates of z t adjusted by the calibration weights, j w . If these population total for the p th auxiliary variable is known and fixed, such as the number of males in the U.S., then the total for that variable becomes 1 pp N zz ip i tT z = ?= ? . When the population totals are not readily available, such as the number of persons with some disability, p z t may be replaced with estimates from a larger independent survey where the estimates may be more reliable than the survey on hand (Deville et al., 1993, p.1015). The initial weights ( 0 j w ), in our case, will be .WPSA j d ( () ? W j s ) in (6.1), (6.3), when propensity score adjustment is applied initially, or sample design weights, where the simplest form may be WW j wNn= , when no adjustment is applied beforehand. 95 The GREG algorithm minimizes the measures of distance between 0 j w and j w , that is, ( ) () *0 , ? ?  w jj is Gww, (6.6) subject to the constraint (6.4), where * G is a distance function associated with generalized least squares (GLS), such that () 2 0 *0 0 00 ,1 2 ?? ? ? ==? ?? ? ? ?? ? ?   jjj jj j jj www Gww wG ww . (6.7) We seek to find { } j w by minimizing (6.6) with (6.7), while respecting (6.4). This operation is equivalent to minimizing the quantity ( ) () () 0* 0 , ? ??? ?   W jjj js wG w w zz ? tt, where () 12 , ,..., ,..., pP ? ??? ? =? is a P vector of Lagrange multipliers. This minimization leads to the desired calibration weights, ( ) 0 jj j wwF ? = z ? , where ( )j F ? z ? is the inverse function of ( ) ( ) *0 *0 ,, jj jj j gww dGww dw=. For the GLS distance function in (6.7), () 1 * () 1Fu g u u ? = =+. In order to compute j w , ? must be determined by solving the calibration equation () ( ) () 0 ?? ? = = ??  WW jj j j j js js wwF z zz? zt, (6.8) where the vector ? is the only unknown component. Following Deville and S?rndal (1992), we rearrange (6.8) and define 96 () ( ){ } () 0 ? 1? ? ? = ?=? ? W W jj j s js wF zz ? z ? ztt. (6.9) Based on the iterative procedure, such as Newton?s method, ? is solved for as follows. First, expand () 1 W t s ? + ? around t ? , where t ? is the value at the t th iteration and 1t+ ? at the ()1t + st , such that () ( ) ( )( ) 11 WWW ttttt sss ??? ++ ?=+ ?????, (6.10) where () ( ) ( ) () 0 ? ? ? = ? ? ??== ? W W W t t ts t jjjj s js d wF d ?? ? ? z ? zz ? (6.11) is the PP? matrix of partial derivatives and ( )j F ? ? z ? is the derivative with respect to the argument j ? z ? . Using (6.9), (6.10) and (6.11), we obtain ( ) ( ) 1 1 ? ?? ? + ? ?? ?? =+ ?? ? ?? ? WW tt t t ss zz ?? ? tt ? as in the equation (3.5) in Deville and S?rndal (1992). For an unrestricted GLS distance function, a closed form solution for the Lagrange multiplier can be obtained as () () 1 0 ? ? ? ?? ? ?? =? ?? ? W jjj js w zz ? zz t t . One problem associated with the GLS distance function is that the final weight may be negative or extremely large (see Deville and S?rndal, 1992 for detail). In order to avoid such situations, this study uses the truncated linear (L, U) distance function presented in Deville et al. (1993) and Jayasuriya and Valliant (1996): 97 () 2 1 2 1 , if () , otherwise ? ?<? ? . Define three subsets of the sample as () { } :1 WW Aj sjs L ? =? ?z ? . Expression (6.9) can be decomposed as () () ( ) ( ) () () () () 00 0 ( 1) ( 1) , ???? ?? ? =++ ?=? + +? ?? ? W WWW ABC WW W AB C t ttt s sss jj j j j jj js js js Lw w Uw ???? zz? zz (6.12) . Since ( ) 0Fu? = for W A s and W C s , and ( ) 1Fu? = for W B s , (6.11) becomes ( ) () 0 ? ? ? ? = ? W W B t jjj s js w? zz . (6.13) The computational algorithm is implemented in cal.fcn using R ? shown in Appendix 1.2 as: (1) Assign starting value 0 1P? =? 0 . (2) Evaluate (6.10), substituting (6.12) and (6.13) to compute the components. 98 (3) Check whether the convergence criterion is achieved by using () 1 max ttt ? + ?? ?? ?? ??? , where ? is some small constant, say 2 10? ? = . (4) If convergence is obtained, go to step (5), otherwise repeat (2) and (3). (5) Evaluate the final weights as ( ) 0* jj j wwF?= z ? , where * ? is the converged value. Because j w satisfies (6.4) and (6.9), all population constraints are satisfied even with the restriction placed on the range of weights. The resulting restricted regression estimator of y t for the Web survey is then () . W WCal yjj js twy ? = ?   . If the L and U restrictions have no effect, then the estimator reduces to the GREG defined as () . ?? ? WCal W W yy ws tt ? =+? zz tt B  . (6.14) The regression coefficient, when there are no weight restrictions, can be estimated directly, using weighted least squares, as () () 1 00 ? ww ws j j j j j j js js y ? ?? ???? ???? ?= ???? ?? Bzzz. (6.15) In terms of matrices and vectors, (6.15) becomes ( ) ( ) () ( ) 1 1 ? ? ? ?? ? == WWWWWW WWWW ws s ss s ss s s ss B ZWZ ZWY A ZWY , (6.16) where W s Z is a ? W nP matrix of covariates for the W n cases in the Web sample; () =  s j diag wW ; W s Y a 1? W n vector of the study variable in the Web sample; and 99 ? = WWWW s sss AZWZ. In the general case with calibration and weight restriction, the estimator of the mean becomes () () . W W jj js WCal j js wy y w ? ? = ? ?    , where the calibration weight is ( ) 0* ?= jj j wwFz ? . 6.4 Theory for Propensity Score Adjustment and Calibration Adjustment The bias properties of Web survey estimates will be examined in this section with respect to population total estimates. Two structural models are considered: one in which the population follows a stratified model with strata defined by propensity score subclasses and the other in which covariates are used. The unadjusted Web survey estimate, ? W y t , will generally be biased under either of these models. 6.4.1 Stratification Model Suppose that there is an underlying structural model M that produces ( ) ?= M ic Ey , where ? c iU; and c U is subclass c in the universe, U . Under this model, the expected value of the population total is () 1 .? = = ? C M ycc c Et N 100 Because this model uses subclasses formed based on the quintiles of ()e x , we interpret c N to be the count that would be obtained if the propensity score adjustment were applied to the entire population. The Web survey estimate without adjustment is () 1 ? = ? = ?? W c C WW yjj c js tdyand its expectation over the model is () () 1 1 ? ? . ? ? = ? = = = ?? ? W c C WW M ycj c js C W cc c Et d N (6.17) The bias in (6.17) with respect to M is () () 1 ? ? .? = ?= ? ? C WW My y c c c c Et t N N (6.18) Suppose that there is a mechanism ? that describes how persons voluntarily become part of the Web sample. In particular, suppose that 1, if unit in Web sample 0, otherwise ? ? = ? ? i i , and that () ? ? ?= W ii E . The ? may be difficult or impossible to model, although the propensity score modeling is an attempt to do this. If ( ) ? ? = W cc EN N, the model bias (6.18) averages to zero over the voluntary mechanism: () () 1 ? ? =0. ?? ? = ?? ?= ? ?? ?? ? C WW My y c c c c EE t t E N N Only under both the model M and the volunteering mechanism ? , the unadjusted Web sample estimate ? W y t becomes unbiased. Note that it is quite likely that 0? = W i for some persons, because they would never volunteer to participate in a Web survey. Generally, 101 () () ? ? ? ? = ? c WWW cii iU EN d . If 1 ?= WW ii d , then ( ) ? ? = W cc EN N, but if 0 W i ? = for any persons, this cannot hold. As a result, we expect ? W y t to be biased. By applying propensity score adjustment weights, we obtain a new Web survey estimate, () .. 1 ? = ? = ?? W c C WPSA WPSA yjj c js tdy, (6.19) where .WPSA j d is from (6.2). The M-expected value of this estimate is () () . 1 1 ?? ? ?? ? ? , ? ? ? = ? = = = ?? ? W c RR C WPSA Wc M yjc WW c js c W C R cc R c NN Et d NN N N N and its model bias is () . 1 ? ? ? . ? ? = ?? ?= ? ?? ?? ? WC WPSA R My y c c c R c N Et t N N N If the weights in the reference sample and the Web sample are scaled so that ?? = WR NN, and if the application of the ? distribution, appropriate to the reference sample, produces () ? ? = R cc EN N, . ? WPSA y t will be an unbiased estimator of the population total in the sense that () . ? 0 ? ?= WPSA My y EE t t . Therefore, the role of the reference survey sample is as important as the propensity score model that attempts to describe ? . Another approach to analyzing the propensity score adjustment estimator is to consider the correction factor, c f , described earlier to be a response propensity adjustment factor. That is, ( ) ( ) 11 .. ? ? ? == WPSA W WPSA jcjj fd d is the estimated propensity of 102 being in the Web sample. Since all ? W c js get the same weight adjustment factor, c f , we could use () 1 ? = ? W c cj jsc ee n x as an alternative to 1 c f , although this option is not pursued in this study. If . 1 WPSA j d can be interpreted as an inverse inclusion probability, then (6.19) becomes analogous to a Horvitz-Thompson estimator and is unbiased with respect to the volunteering mechanism because ( ) . ? ?? WPSA jj E . Consequently, the propensity score adjustment estimator is M-? unbiased, if ? R c N is a ? - unbiased estimator and is unbiased with respect to the volunteering mechanism, and if . 1 WPSA j d is an inclusion probability. 6.4.2 Regression Model A more elaborate model would be one that accounts for covariates which are good predictors of y. To that end, suppose that there are covariates that affect the study variable in the following way: ( ) ?= M ii Ey ? z , (6.20) where ? c iU and {} 12 , ,..., ,..., ? = iii ipiP zz z zz similarly defined as in (6.5). Here, the model bias of an unadjusted Web sample estimate for the population total is 103 () () () () () () 11 11 ? ? ? , W c c W c c CC WW M yy Mjj Mi cciU js CC W jj i ccU js W W Et t Edy Ey d ==? ? ==? ? ?= ? ?? ?? ?? ??=? ?? ?? ?? ??=? ?=? ?? ?? ?? ?? zz zz ? z ? z ? t ? t ? tt (6.21) where 1=? = ?? c C i ciU z tz and () 1 ? W c C WW jj c js d = ? = ??z tz. If volunteering, i.e., the ? mechanism, satisfies () ? 0 ? ?= W E zz tt , ? W y t becomes M-? unbiased. However, as noted in the previous section, this assumption is unrealistic. Consequently, we can expect the unadjusted estimator, ? W y t , to be biased. The combination of calibration adjustment using GREG without the weight constraints and propensity score adjustment produces the following estimator from (6.14): ( ) .. . ? ? ? ? =+ ?  WCal WPSA WPSA yy ws tt zz Btt . (6.22) Based on the model (6.20), its model expectation is ()( ) ( ) ( ) .. . ? ? ? ? =+ ?  W Cal W PSA W PSA My My M ws Et Et E zz Btt . (6.23) The expectation of this regression coefficient is ( ) ( ) ( ) () 1 1 ? , ? ? ? = ? = = WWW W Mws s ssMs WWWW ssss EEBAZ Y AZWZ? ? (6.24) i.e., ? ws B is M-unbiased of ? . Using (6.24), (6.23) becomes 104 ()() ( ) ( ) () () () .. . .. .. ? ? ? ? ?? . W c W Cal W PSA W PSA My My M ws WPSA WPSA jj js WPSA WPSA My Et Et E d Et ? ? =+ ? ?? ???=+? ?= = ? zz zz zzz z Btt ? z ? tt ? t ? t ? t ? t  (6.25) Important facts from above are (1) that (6.25) holds even if . ? WPSA z t has a ? -bias, because they cancel out each other in the M expectation but (2) that z t does have to be correct. If z t contains estimates from some other survey, the model bias will have the form () * ,, ? ? sub sub subzz ? tt, where the subscript ?sub? denotes the part of the z -vector whose totals come from that survey, and * ,subz t is the vector of covariate estimates from that survey. If ()* * ,, 0 ? ?= sub sub E zz tt , where, in this case, * ? E is the expectation over the selection mechanism for the other survey, then .  WCal y t is M- * ? unbiased. Therefore, the calibration adjustment will produce M-unbiased estimates (or possibly, M- * ? unbiased estimates), when the model (6.20) holds. In a case where the propensity score adjustment successfully adjusts for the probability of being in the Web survey sample, under the assumption of () . . 1 ? ??== WPSA WPSA jjj Ed , we obtain ( ) . ? ? = WPSA yy Et t, () . ? ? = WPSA E zz tt, and () ( ) 1 ? ? ? ?? = ws N N N N E B B ZZ ZY, which is the finite population version of the regression slope (6.23). These three expectations lead to ( ) . ?   WCal yy E tt, when the population total z t is used in deriving .  WCal y t . Therefore, (1) if the propensity score adjustment correctly 105 accounts for the volunteering mechanism ? , .  WCal y t is ? -unbiased, and (2) if the model M is correct, .  WCal y t is M-unbiased. If unbiased estimates from another survey are used, then the calibration estimator will be M- * ? unbiased. Suppose that the propensity score adjustment does not fully adjust for ? -bias. Then () . ? ? =+ WPSA yyy Et t b and ( ) . ? ? = + WPSA E zzz ttb, where y b is the bias which can take a positive or negative direction and ( ) 1 ,... ,...,= pP zz z bb b z b whose components can also be positive or negative. The ? expectation of the calibration adjusted estimate is then () ()() ( ) ( ) . ? ? . ?? ? ? ++ ? + ? =+?   WCal yyy ws yy ws Et t b E tbE zzz z Bt tb Bb (6.26) Expression (6.26) is not equal to y t , unless ( ) ? ? ? = yws bE z Bb, which is not true in general. When propensity score adjustment is not correct, .  WCal y t will not generally be ? -unbiased, meaning that the estimate is not unbiased with respect to the volunteering mechanism. However, it can be model unbiased as long as i y follows a linear model M in (6.20) which we specify correctly. 106 Chapter 7: Application of the Alternative Adjustments for Volunteer Panel Web Surveys 7.1 Introduction This chapter will document the performance of the proposed adjustment in Chapter 6 for volunteer panel Web surveys. The role of the adjustment is to decrease the bias occurring from the possibly nonrandom mechanism in the selection of panel Web survey respondents. In order to examine the degree of bias reduction, it is necessary to apply the adjustment for more than one sample realization. A logical approach for this purpose is to adopt simulation studies that use pseudo-populations whose population values are known. This chapter will employ two survey data sets: the 2002 General Social Survey (GSS) and the 2003 Michigan Behavioral Risk Factor Surveillance System (BRFSS). Each of these will be used as a pseudo-population data set. The reasons for using these data are two-fold. First, one interesting feature of both surveys is that they contain an Internet supplement which provides information about whether respondents have Internet access or not. Since the volunteer panels in Web surveys are required to have their own Web access, information on Web access ownership becomes essential for constructing a pool of units potentially eligible to be included the Web surveys. The full sample of GSS and BRFSS themselves are capable of serving as populations as well as potential pools of reference survey sample units. Second, unlike existing research where the focus of adjustment is placed on polling and election outcomes, having two data sets will expand the scope of the examination to a wide range of different substantive study variables. 107 More specifically, GSS provides attitudinal information toward general social issues, and BRFSS gives factual information about health-related behaviors. Two case studies comprise this chapter, where one study utilizes GSS data and the other utilizes BRFSS data. While both case studies will examine the performance of propensity score adjustment and calibration adjustment as bias reduction techniques, the emphasis of each study will differ. Section 7.2 will present the first study using GSS data, where the focus will be on the effectiveness of adjustment. Propensity-score- adjusted Web survey sample estimates will be compared to the reference survey sample estimates, and calibration-adjusted estimates will be compared to the population values. The second case study is presented in Section 7.3. It will use BRFSS data and expand the examination to multiple dimensions: the impact of covariate selection both in propensity score adjustment and calibration adjustment, the effectiveness of combining calibration adjustment with propensity score adjustment, and the calculation of variance estimates when multiple adjustment weights are applied. 7.2 Case Study 1: Application of Propensity Score Adjustment and Calibration Adjustment to 2002 General Social Survey Data 7.2.1 Construction of Pseudo-population and Sample Selection for Simulation In order to assess the performance of bias reduction as described above, three different data sets are required: a population, a reference survey and a Web survey data set. Samples mimicking the respondents in the Harris Interactive volunteer panel Web survey will be drawn based on subclass proportions from a real Harris Interactive Web 108 survey data set (obtained via a personal communication with Matthias Schonlau, see Schonlau et al., 2004). The cells are formed by four demographic variables: age, gender, education and race. These proportions of Harris Interactive data are displayed in Table 7.1 along with the same cross-classified cell proportions of all respondents and respondents who use the Internet in the 2002 General Social Survey (GSS) data. Table 7.1. Distribution of Age, Gender, Education and Race of GSS Full Sample, GSS Web User and Harris Interactive Survey Respondents High School or Less Some College or Above White Nonwhite White Nonwhite A. GSS Full Sample (n=2,746) a ? 40 yrs Female 9.76% 6.61% 5.51% 1.79% Male 9.65% 4.18% 4.41% 1.37% 41 yrs + Female 16.75% 4.75% 8.39% 1.44% Male 13.14% 3.15% 7.75% 1.37% Sum 100.00% B. GSS Web Users (n=1,692) b ? 40 yrs Female 11.68% 6.08% 7.97% 2.62% Male 10.52% 3.22% 6.69% 2.01% 41 yrs + Female 11.50% 2.31% 11.01% 1.64% Male 9.49% 1.46% 10.16% 1.64% Sum 100.00% C. Harris Interactive Respondents (n=8,195) ? 40 yrs Female 2.03% 1.64% 13.28% 13.37% Male 0.85% 0.61% 7.58% 9.09% 41 yrs + Female 2.45% 0.48% 15.58% 4.58% Male 1.70% 0.24% 20.82% 5.71% Sum 100.00% a. This sample size reflects the exclusion of 19 cases where some of the four covariates is missing. b. This is the subset of Web users from the original 2002 GSS sample. The 2002 GSS is a part of on-going biennial survey conducted by National Opinion Research Center with core funding from the National Science Foundation. The data were gathered in order to measure contemporary American society targeting 109 noninstitutionalized adults 18 years old and older. A representative national sample was drawn using multi-stage area probability sampling. Respondents were surveyed in a 90- minute in-person interview. The reported response rate for the 2002 GSS is 70%. 11 The protocol for the Harris Interactive Web surveys was discussed in Section 2.1 and 6.1. From Table 7.1, we can examine the distributions of the 2002 GSS sample, its Web user subgroup and the Harris Interactive respondents. There is a noteworthy gap not only between the GSS sample and the two Web samples but, surprisingly, also between the two Web samples. The GSS full sample includes fewer young people and those with higher education than the two Web samples. The most notable disparity between the Harris Interactive data and the two GSS data is in the educational attainment level. While less than a half of the GSS and its Web users have some college or higher education, the same group of people makes up 90% of the Harris Interactive respondent data. Also, Harris Interactive respondents include more minorities, especially educated minorities, than the GSS samples. If a sample distributed like the Harris Interactive respondents is to provide unbiased estimates for the general population or even the population with Web access, some major weighting adjustment will be required. The creation of the full pseudo-population starts from the GSS data set (U) which contains 2,746 cases with complete information on four stratifying variables in Table 7.1 and the Web usage variable. 12 The propensity score adjustment is feasible when all cases in the merged data have information on covariates included in the propensity score models. Otherwise, propensity scores for the units where some of the covariates are 11 Information about the GSS is available at http://webapp.icpsr.umich.edu/GSS/ and http://norc.org/projects/gensoc3.asp. 12 19 units where the information on these five variables is missing are excluded from the original GSS data with 2,765 units. 110 missing cannot be computed, which hinders the adjustment. For this problem, missing values on the 14 covariates described in Table 7.2 that are used to build the propensity score models are imputed within the cell defined in Table 7.1 using hot-deck method. A larger population will facilitate testing of methods by simulation. By bootstrapping U with simple random sampling with replacement, the full pseudo-population ( F P ) is created with a size of 20,000 persons. As discussed earlier the 2002 GSS collected information about e-mail 13 and Internet usage. 14 Based on this information, people who are classified as Web users from F P are retained for the pseudo-Web population ( W P ), which results in the size of 12,306. 15 This pseudo-Web population will allow us to draw different types of Web samples, especially the one resembling Harris Interactive Web survey respondents, since Web usage is the prerequisite for the panel members in those surveys. Using the two pseudo-populations, a reference sample and two types of Web sample are drawn in each simulation. The reference survey sample ( R s ) is drawn from F P by simple random sampling for the size of 200 R n = using ref.sam function created in R (see Appendix 1.3). Since the 2002 GSS was conducted in the face-to-face mode, these reference samples will serve as face-to-face reference samples with known probabilities of selection. Two types of Web samples are drawn from W P by Poisson sampling with selection probabilities equal to cell proportions in Table 7.1.B and 7.1.C. For example, 13 Question wording: ?About how many minutes or hours per week do you spend sending and answering electronic mail or e-mail?? 14 Question wording: ?Other than e-mail, do you ever use the Internet or World Wide Web?? 15 The proportion of the Web users in the full pseudo-population is the same as that in the original GSS data at 61%. 111 for the first Web sample, White female with high school education or less who were 40 years old or less were selected with a probability of 0.1168. Thus, the two samples were allocated according to the covariate distributions from Table 7.1.B and 7.1.C, where each cell serves as a stratum. The first Web sample, .WST s , is assumed to resemble the pseudo- Web population (Table 7.1.B), and the second, .WHI s , the Harris Interactive respondents (Table 7.1.C). Both Web samples are drawn using pois.sam in Appendix 1.4 for the desired size of .. 800 WST WHI nn==. 16 This procedure of selecting the three samples ( R s , .WST s and .WHI s ) is repeated 2,000 times. 7.2.2 Propensity Score Adjustment This study examines two variables: (1) blks y : the proportion of people indicating warm feelings towards Blacks; and (2) vote y : the proportion of people who voted in the 2000 presidential election. The estimates of blks y and vote y from the simulated Web samples, .WST s and .WHI s , are corrected by applying propensity score adjustment described in Section 6.2. There are 14 covariates used for adjusting blks y and 13 for vote y , where nine of each set of all covariates are demographic and the remainder are nondemographic characteristics. 17 As shown in Table 7.2, the significance of these covariates on blks y and vote y differs greatly. Some of the variables are continuous, while others are categorical with different numbers of categories. 16 The actual Web sample sizes vary around 800, as Poisson sampling is used. 17 The demographic/nondemographic nature of a given covariate is tentatively determined based on whether the variable is typically used in post-stratification or not. 112 Table 7.2. P-values of the Auxiliary Variables in Logit Models Predicting blks y (Warm Feelings towards Blacks) and vote y (Voting Participation in 2000 Presidential Election) a p-value Covariate Description Type blks y vote y Demographic age Age in years Continuous <.0001 <.0001 educ Education in years Continuous <.0001 <.0001 newsize Size of the residential area Continuous .2006 .1804 hhldsize Household size Continuous .8318 .3496 income Family income Continuous .4548 .0002 race Race 4 categories <.0001 .0002 gender Gender 2 categories <.0001 .1568 married Marital Status 2 categories .0616 .0280 region Region of the residential area 4 categories .0391 .2017 Nondemographic class Self-rated social class Continuous .1435 <.0001 work Employment status 2 categories .6502 .1680 party Political party affiliation 3 categories .2174 <.0001 religion Having a religion 2 categories .1197 .8480 ethnofit Opinion towards ethnic minorities Continuous <.0001 - a. These analyses were done using the original GSS sample (n=2,746) Based on the significance level (p-value) and the characteristics of the covariates (demographic or nondemographic) listed in Table 7.2, propensity score models are developed. The first model which serves as the base propensity model, D1, includes all demographic variables as main effects in a logistic regression as in (5.11) 18 , such that 12 3 4 5 67 8 9 Pr( 1) : ln 1Pr( 1) , g D1 age educ newsize hhldsize income g race gender married region ?? ? ? ? ? ?? ? ? ??= =+ + + + + ?? ?= ?? ++ + + where 1g = for Web sample units and 0g = for reference sample units. The subsequent models are shown in Table 7.3, and their detailed specifications in R ? are shown in 18 This study focuses on the relationship between the substantive study variables and the covariates than on the relationship between the treatment variables and the covariates in constructing propensity score models. 113 Appendix 2. The respective effectiveness of different models will be compared in the following section. This will allow us to detect the importance of including highly predictive and/or nondemographic covariates in the propensity model. Table 7.3. Propensity Score Models and Their Covariates by Variable a. Covariate Propensity Score Models D1 D2 D3 All (1) Significant (2) Nonsignificant (3) Demographic (D) blks y vote y blks y vote y blks y vote y age ? ? ? ? educ ? ? ? ? newsize ? ? ? ? hhldsize ? ? ? ? income ? ? ? ? race ? ? ? ? gender ? ? ? ? Married ? ? ? ? Region ? ? ? ? N1 N2 N3 All (1) Significant (2) Nonsignificant (3) Nondemographic (N) blks y vote y blks y vote y blks y vote y class ? ? ? ? Work ? ? ? ? party ? ? ? ? Religion ? ? ? ? ethnofit ? - ? - ? - A1 A2 A3 All (1) Significant (2) Nonsignificant (3) Demographics & Nondemographics (A) blks y vote y blks y vote y blks y vote y D1+N1 D1+N1 D2+N2 D2+N2 D3+N3 D3+N3 a. Included covariates are indicated by check marks Note: Propensity model 4 not shown in the table is the combination of D1 and N2. The general steps for each simulation are: (1) to combine the reference sample ( R s ) and the Web sample ( .WST s or .WHI s ), (2) to estimate the propensity ( ) i e x of being in the Web sample rather than the reference sample for the i th person in the combined sample, 114 (3) to divide the combined sample into five groups based on quintiles of the propensity scores, and (4) to compute the weight .WPSA j d defined in (6.2) for each person j in the Web sample. 7.2.3 Results Propensity Score Adjustment Reference and Web samples are drawn using ref.sam and pois.sam. The adjustment and estimation described are carried out by psa.fcn function. The propensity score adjustment function includes the adjustment weight in (6.2). This is because the reference sample units have an equal probability of selection and the Web sample units are supposed to have unknown selection probabilities. The simulation is done over 2,000 times using psa.sim function in Appendix 1.5 which includes all functions introduced previously. Since the estimation benchmarks in this adjustment stage (propensity score adjustment) are from the reference sample ( R s ) estimates, population values are not included in the discussion. However, for convenience, we will refer to the difference between the average of a Web sample estimate and the means of the reference sample estimates as a ?bias.? Table 7.4. Simulation Means of Estimates by Different Samples before Adjustment R s .WST s .WHI s blks y : Proportion of warm feelings towards blacks (M=2000) 0.612 0.636 0.675 vote y : Proportion of voters in 2000 election (M=1971) a 0.650 0.715 0.817 a. In simulations for vote y , 29 simulations were not completed due to zero cases in subclasses in R c s defined by propensity scores, which resulted in inability to derive weights in (6.2). 115 Table 7.4 shows the respective unadjusted means of blks y and vote y from the three samples, R s , .WST s , and .WHI s over all simulations. They are calculated as 1 M m m yyM = = ? , (7.1) where M is the total number of simulated samples and m y is an estimate from the m th simulation. Web estimates deviate from the reference sample estimates, indicating that people in the Web samples are more likely to express warm feelings towards Blacks and more likely to have participated in the election than people in the reference sample. This result seems plausible when considering the cell proportions in Table 7.1 which were used to create .WST s and .WHI s . There is likely to be a higher proportion of people with higher levels of education and minorities in the Web samples than in R s . The biases are even larger between .WHI s and R s . For the voting behavior, the estimate from .WHI s is off by 16.7% from the reference sample estimate. Therefore, it becomes necessary to decrease the bias. 7.2.3.1 Performance of Propensity Score Adjustment Correction for the deviations of Web sample estimates is carried out by applying propensity score adjustment. First the base propensity model (D1) which includes all demographic covariates was applied. Table 7.5 compares unadjusted and D1 adjusted estimates in the relation to reference sample estimates. For example, the D1 propensity score adjusted mean (y.D1) for blks y is 0.623 based on .WST s samples, which is closer to the reference sample mean (y.R: 0.615) than the unadjusted mean (y.U: 0.636). By incorporating adjustment weights, the Web estimates are closer to the reference sample values than the unadjusted estimates. 116 Throughout this section, the performance of propensity score adjustment can be evaluated with respect to three criteria: bias and its reduction, root mean square deviation and its reduction, and standard error. 7.2.3.1.A Bias and Percent Bias Reduction As discussed above, the ?bias? measure of the Web survey estimates compared to the reference survey estimates takes the following form: () 11== =? ? ? MM WWR mm mm bias y y y , where R m y and W m y are the reference and Web estimates from the m th simulation with 1,...,mM= . Additionally, percent bias reduction ( .p bias ) is calculated using an adapted form of (5.14) as () ( ) ( ) () .. . . . 100 ?? ? ?? =? ?? WU WPSA WPSA WU bias y bias y pbias y bias y , (7.2) where .WU y is the simulation mean of the unadjusted Web estimate and .WPSA y is the simulation mean by propensity score adjustment (PSA is substituted by model names hereafter). It is expected that the unadjusted estimates have larger bias than the adjusted ones. The larger the .p bias , the more effective the propensity score adjustment in reducing bias. A negative .p bias indicates that the adjustment actually makes the estimates worse. 117 7.2.3.1.B Root Mean Square Deviation and Percent Root Mean Square Deviation Reduction The second evaluation criterion is related to the root mean square deviation ( rmsd ) summarizes the deviation of Web estimates from the reference estimate over all simulations. This statistic is calculated as () ( ) 2 1= =? ? M WWR mm m rmsd y y y M . From this statistic, we may compare rmsd ?s of the Web sample estimates derived from adjustments using different propensity models. Estimates with smaller rmsd may be considered less-deviated from the reference estimates than others. Just like (7.2), the percent deviation reduction ( .p rmsd ) is also computed in order to provide the relative size of rmsd of the adjusted estimates to the unadjusted estimates as: () ( ) ( ) () .. . . . 100 WU WPSA WPSA WU rmsd y rmsd y prmsd y rmsd y ?? ? ??=? ?? . This will provide the reduction of deviation in Web survey estimates achieved by propensity score adjustment. 7.2.3.1.C Standard Error While applying adjustment in the estimation may reduce biases in the estimates, the variability introduced by the weights may increase the variability of the estimates. It is important to understand the trade-off between bias reduction and variance increase. The variability in estimates is calculated in the form of a standard error ( se ) of the simulation mean as: 118 () () 2 1= ?? ? M WWW m m se y y y M , where W m y is the Web sample estimate from the m th simulation and W y is the average of W m y defined in (7.1). This statistic allows us to examine the magnitude of added variability on the estimates due to the propensity score adjustment. Table 7.5. Reference Sample and Unadjusted and Propensity Score Adjusted Web Sample Estimates for blks y and vote y .WST s .WHI s estimate bias p.bias rmsd p.rmsd se estimate bias p.bias rmsd p.rmsd se blks y (M=2,000) y.R 0.612 0.0339 0.612 0.034 y.U 0.636 0.024 0.045 0.0160 0.675 0.064 0.074 0.016 y.D1 0.623 0.012 52.4% 0.040 9.6% 0.0221 0.638 0.026 58.6% 0.052 29.4% 0.032 vote y (M=1,971) y.R 0.650 0.034 0.650 0.034 y.U 0.715 0.065 0.075 0.015 0.817 0.167 0.171 0.013 y.D1 0.709 0.059 9.7% 0.069 8.3% 0.022 0.724 0.074 55.7% 0.086 50.0% 0.031 Note: y.R: Reference sample estimate. y.U: Unadjusted Web sample estimate. y.D1: Web sample estimate after propensity score adjustment using model D1. Table 7.5 exhibits simulation estimates of blks y and vote y and their evaluation statistics when no adjustment and adjustment by D1 model are applied for both .WST s and .WHI s (see Appendix 3 for the same information for all adjusted estimates based on all propensity models). When D1 adjustment is applied, biases and deviations in Web estimates from the reference sample estimates are decreased dramatically. The greatest advantage of using propensity score adjustment is in the samples mimicking Harris Interactive respondents ? the larger bias reduction is in .WHI s than in .WST s for both study 119 variables. This echoes the statement in Cochran et al. (1954, pp.246) that ?adjustment will only be seriously helpful when the sampling procedure is not random?.? The reductions in the bias of estimates from .WHI s are 58.6% and 55.7%. Their .p rmsd ?s are also large at 29.4% and 50%. Nonetheless, the adjusted estimates have larger standard error, showing that the reduction in the bias and deviation comes at the cost of increased variability. The trade-off between the decrease in deviation and the increase in standard error will be discussed in detail shortly. 7.2.3.2 Effect of Covariates in Propensity Score Models The choice of covariates can be on important factors in the performance of propensity score adjustment. Assessment of the role of covariates is carried out exclusively using .WHI s for several different sets of covariates. First, different propensity models are developed by the significance of the covariates predicting blks y and vote y . Using a cut-point of .05p = , covariates in Table 7.2 are classified by whether they are highly ( .05p < ) or weakly predictive ( .05p ? ). As a result, there are three models related only to demographic variables as shown in Table 7.3: all demographic covariates are included in the base propensity model (D1); highly predictive covariates only (D2); and weakly predictive covariates only (D3). 120 Figure 7.1. Relationship between the Distributions of the Different Web Sample Estimates and the Reference Sample Estimates for blks y (Warm Feelings towards Blacks) The unadjusted ( .yU ) and the adjusted Web estimates using D1, D2, and D3 ( .yD1, .yD2, and .yD3, respectively) are plotted against the reference sample estimate ( .yR) for blks y in Figure 7.1 and for vote y in Figure 7.2 for all simulated samples (see Appendix 4 for all scatter plots of the estimates using all propensity models for both study variables in both .WST s and .WHI s ). Underneath each scatter plot is displayed the corresponding rmsd for each adjustment. A diagonal yx= reference line is drawn in each panel in Figure 7.1 and Figure 7.2. If the propensity score adjusted Web sample estimates were always equal to the reference sample estimates, then all points would fall on the reference line. Therefore, in the scatter plots, as the cluster of dots is approaching the reference line, the disparity of Web estimates is diminished. The scatter plot with the dots closest to the identity line indicates the best adjustment method in terms of deviation. Widely dispersed clusters are the evidence of increased variability. 121 Figure 7.2. Relationship between the Distributions of the Different Web Sample Estimates and the Reference Sample Estimates for vote y (Voting Participation) Figure 7.1 and 7.2 convey the same messages. Among the three adjustments, D1 and D2 outperform D3. When the propensity score model is composed of only highly predictive covariates (D2), the level of adjustment is comparable to the base model that includes all variables (D1). The propensity score adjustment based on weakly predictive covariates (D3) does not improve the point estimates to any degree. The figures also illustrate the increased variability of estimates when using propensity score adjustment weights. Once the weights are incorporated, the scatter plots in the panel 2 and 3 show higher variability. In particular, estimates from the better performing models show widely scattered distributions. In the case of the propensity model D3 for blks y , the adjustment increases variability without decreasing the deviation to any degree, which ultimately worsens the quality of estimates in an absolute sense. Next, we examine the importance of including nondemographic (or attitudinal) variables in the propensity score model by comparing four different models: all demographic covariates (D1), all nondemographic covariate (N1), all covariates (A1=D1+ 122 N1), and all demographic and important nondemographic covariates (4). Again, Table 7.3 shows the variables included in each model. The distributions of the adjusted estimates using these models are displayed in Figure 7.3 along with those of the reference sample estimates ( .yR) and the unadjusted estimates ( .yU ) (see Appendix 5 for all box plots of the estimates using all propensity models for both study variables in both .WST s and .WHI s ). Figure 7.3. Distributions of the Web Estimates by Different Propensity Score Adjustments For both study variables, the reference sample estimates ( .yR) are more widely distributed than the unadjusted Web sample estimates ( .yU ). This is not surprising since the size of the Web samples is four times larger than the reference samples. However, the distributions of .yU do not contain the simulation means of .yR. For vote y , the distribution of .yU and .yR are almost non-overlapping. Among the four adjustment models, ones including demographic variables (D1, A1 and 4) produce less biased Web estimates than ones only with nondemographics (N1). The marginal effect of including 123 nondemographic variables in addition to demographic ones can be seen by comparing the box plots for A1 with D1. Figure 7.3 shows that this effect is minimal, since the performance of A1 and D1 are comparable. Although the distributions of the adjusted estimates differ noticeably, none of the methods successfully removes the deviation. As in Figure 7.1 and 7.2, the variance of the Web sample estimates increases when adjustment weights are applied and when the adjustments are more effective in reducing deviation. The increase in variance is primarily due to including the demographic covariates in the propensity models. This may be translated into the significance of these covariates in predicting the propensity score, () ( ) Pr | , 1,..., W eisin=? =xx. However, it should be noted that the variability of effective model estimates can be as large as that of the reference sample estimates, meaning that the precision obtained from the larger sample size in Web surveys may be completely lost. 7.2.3.3 Discussion This section illustrates the exclusive application of propensity score adjustment for volunteer panel Web surveys. The adjustment decreases but does not eliminate the difference between the benchmark sample estimates and the Web sample estimates. This reduction comes at the cost of increased variance. The relationship between the covariates and the study variables is found to be important in forming propensity models, since the propensity models with weakly predictive covariates do not decrease the deviation but add to the variability. It seems to be a reasonable practice to include all available covariates from the given data set, as Rubin and Thomas (1996) suggest. The assertion that including nondemographic variables in the propensity models is useful is 124 not verified, as the value of including nondemographic variables appears limited in comparison to demographic ones. This may be due to the nature of the two study variables, warm feelings towards Blacks and voting behavior, which are highly correlated to demographic variables, such as race and education. Web sample estimates are compared to reference sample estimates, which may also be contaminated by sampling and nonresponse error. It seems to be a logical approach to combine the propensity score adjustment weights with additional weights that project the adjusted Web samples to the general population. For example, calibration adjustment using general regression estimation proposed in the previous chapter may be an alternative. The combination of the two weights may reduce selection bias in Web surveys to a greater degree. This will be examined in the following section. 7.2.4 Calibration Adjustment In this section, we apply calibration adjustment as described in Section 6.3 using the propensity score adjusted weights as the starting point. More specifically, the weight in (6.16) is applied as in (6.15) in order to correct for remaining discrepancies between the propensity score adjusted Web sample estimates and the population values. This procedure makes the Web sample covariates that are already balanced to the probabilistically drawn reference sample further balanced to the target population. While the propensity score adjustment is on the reference sample level to attempt to correct for the nonprobability nature of Web samples, the calibration adjustment is on the population level for noncoverage and nonresponse problems in survey samples (refer to Figure 6.2). Two different sets of covariates are respectively used in calibrating adjustment weights for each of blks y and vote y . For blks y , the first calibration (Calibration 1) uses age, 125 educ, race, gender, region, and ethnofit listed in Table 7.2, and the second (Calibration 2) uses all but ethnofit. For vote y , the first (Calibration 1) includes age, educ, race, gender, region, and party, whereas the second (Calibration 2) excludes party. The adoption of different sets of covariates in the adjustment is to assess the advantage of calibration ? that is, including estimated population totals in addition to the known population values as the benchmarks in the adjustment. The first models will show the marginal effect of incorporating rather unconventional variables (ethnofit and party) in the adjustment. The R code for the calibration using linear distance function with trimmed upper and lower bounds is cal.fcn and the simulation is done over 2,000 iterations using cal.sim in Appendix 1.6. 7.2.5 Results of Calibration Adjustment Adjustments are focused on .WHI s from this section on. For brevity, four different propensity models (A1, A2, A3, 4) will be combined with the two calibration adjustments (Calibration 1 and 2) ? resulting in 15 combinations of adjustment (= 5 (4 propensity score models + no propensity score adjustment) x 3 (2 calibrations + no calibration)). As a notational convention, we will denote an estimator of the mean by y.(propensity score adjustment type).(calibration type). The unadjusted estimator is denoted by y.U and the reference sample estimator by y.R. For instance, the Web estimate using the A1 propensity score model and no calibration will be denoted as . .yA1n. As in the simulation in Section 7.2.1, each R s was selected by simple random sampling without replacement with 200 R n = and .WHI s was a Poisson sample of size . 800 WHI n = . Table 126 7.6 presents the population values and the summary statistics for estimates from the reference samples, and the unadjusted and adjusted Web samples. 7.2.5.1 Performance of Calibration Adjustment The benchmarks of calibration adjustment are the population values. Therefore, we need different evaluation criteria than when propensity score adjustment alone is used 7.2.5.1.A Root Mean Square Error and Percent Root Mean Square Error Reduction Since we have a fixed known value from the population, the first evaluation criterion is root mean square error (rmse) calculated as follows: () () 2 1 M m m rmse y y Y M = =? ? , (7.3) where m y is the sample estimate from the m th simulation in (7.1) and Y is the full pseudo-population mean. The magnitude of rmse reduction achieved in adjustment, compared to no adjustment can be compared across different adjustment methods and sets of covariates by percent root mean square error reduction ( .p rmse ): () ( ) ( ) () .. . . . 100 WU W A WA WU rmse y rmse y prmse y rmse y ?? ? ??=? ?? , (7.4) where .WA y and .WU y are the adjusted and unadjusted Web survey estimates, respectively. The larger the .p rmse , the smaller the error in .WA y . A negative .p rmse indicates that the error is increased by the adjustment. 127 7.2.5.1.B Bias and Percent Bias Reduction The error component that the propensity score and calibration adjustments attempt to decrease is the bias, which is the difference between the expected value of the sample estimate and the population value, that is, ( ) ( )bias y E y Y= ? . (7.5) From one realization of samples, biases cannot be estimated from (7.5) because the expected sample estimate, ( )E y , is not available. However, simulation makes (7.1) approximate the expected Web sample estimate. As usual, the square of the rmse can be decomposed into two components: bias squared and variance (var), () () () 2 rmse y var y bias y=+? ? ? ? and produces () () () ( ) 2 bias y rmse y var y=?. (7.6) The standardized measure of bias reduction achieved by the adjustment is percent bias reduction ( .p bias ). This is computed like (7.2) as () ( ) ( ) () .. . . . 100 WU W A WA WU bias y bias y pbias y bias y ?? ? ??=? ?? . (7.7) Just like .p rmse , a larger .p bias indicates that the adjustment performed accomplishes bias reduction to a greater degree, and a negative .p bias indicates that the adjustment increases the bias. 128 7.2.5.1.C Standard Error and Percent Root Standard Error Reduction The variability of estimates is measured by standard error ( se ) calculated as the following: () ()se y var y= . (7.8) The adjusted estimates are expected to have larger standard errors than the unadjusted ones as the weights in the adjustment are likely to introduce extra variability in the estimates. The impact of adjustment on the variability can be measured with percent standard error increase ( .p se ): () ( ) ( ) () .. . . . 100 WA WU WA WU se y se y pse y se y ?? ? ??=? ?? . (7.9) Since the estimates with a smaller variability are considered better, the estimates with smaller .p se are preferred. A negative .p se indicates that the variance is decreased by the adjustment. From Table 7.6, it is observed that calibration adjustment applied to the propensity score adjusted estimates improves the accuracy of the Web estimates, as the bias reduction is larger than when propensity score adjustment alone is used. The effectiveness of combining calibration is striking when the propensity score adjustment alone is not successful. For example, in A3, the improvement by adding calibration is more apparent. Among the two calibration methods, Calibration 1 that includes a substantive variable shows better bias reduction than Calibration 2. 129 Table 7.6. Comparison of Population Values, Reference Sample Estimates and Web Sample Estimates for blks y and vote y estimate rmse bias se p.rmse p.bias p.se blks y y.pop 0.614 - - - - - - y.R 0.612 0.034 -0.002 0.034 - - - y.U 0.675 0.064 0.062 0.016 0.0% 0.0% 0.0% y.A1.n 0.629 0.036 0.016 0.032 44.2% 74.7% 103.5% y.A1.1 0.621 0.033 0.008 0.032 47.5% 87.1% 107.0% y.A1.2 0.625 0.035 0.012 0.033 45.3% 81.0% 109.1% y.A2.n 0.642 0.043 0.029 0.032 33.1% 53.7% 101.4% y.A2.1 0.632 0.036 0.018 0.032 42.8% 70.5% 101.2% y.A2.2 0.636 0.039 0.022 0.032 39.2% 64.2% 102.8% y.A3.n 0.669 0.059 0.055 0.021 6.9% 10.3% 35.7% y.A3.1 0.638 0.037 0.024 0.028 41.7% 60.6% 79.2% y.A3.2 0.647 0.043 0.033 0.028 32.0% 45.9% 76.2% y.4.n 0.635 0.038 0.021 0.032 39.7% 65.6% 104.0% y.4.1 0.626 0.035 0.012 0.032 45.5% 79.9% 106.7% y.4.2 0.630 0.037 0.016 0.033 42.7% 73.7% 108.6% vote y y.pop 0.648 - - - - - - y.R 0.650 0.034 0.002 0.034 - - - y.U 0.817 0.169 0.169 0.013 0.0% 0.0% 0.0% y.A1.n 0.718 0.078 0.070 0.032 54.3% 58.3% 151.2% y.A1.1 0.713 0.072 0.066 0.030 57.6% 61.2% 130.8% y.A1.2 0.715 0.074 0.067 0.031 56.6% 60.6% 142.3% y.A2.n 0.716 0.075 0.068 0.032 55.7% 59.8% 148.9% y.A2.1 0.711 0.069 0.063 0.030 59.0% 62.8% 130.0% y.A2.2 0.712 0.071 0.064 0.031 58.1% 62.1% 140.0% y.A3.n 0.818 0.171 0.170 0.014 -0.7% -0.7% 10.2% y.A3.1 0.755 0.109 0.107 0.022 35.7% 36.9% 74.7% y.A3.2 0.766 0.120 0.118 0.022 29.0% 30.0% 74.9% y.4.n 0.718 0.077 0.070 0.032 54.6% 58.7% 150.4% y.4.1 0.712 0.071 0.064 0.030 58.2% 61.9% 129.6% y.4.2 0.714 0.073 0.066 0.031 57.2% 61.2% 141.5% Note: The figure for the best estimate (excluding y.R and y.U ) is highlighted in bold/Italic in each column. However, calibration does tend to increase the standard errors compared to the unadjusted estimates. Figure 7.4 plots .p bias against .p se from all Web sample estimates and depicts the trade-off between two ? a surprisingly clear positive 130 relationship. The fitted linear regression shows a high capability of explaining the variability. This again confirms the earlier finding that the increased accuracy from the adjustment comes at the cost of increased variability. A. blks y B. vote y y = 0.7258x - 0.0583 R 2 = 0.825 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 120% % se increase % b i as r e d u ct i o n y = 0.4132x + 0.0325 R 2 = 0.9346 0% 10% 20% 30% 40% 50% 60% 70% 0% 30% 60% 90% 120% 150% % se increase % bi a s r e duc t i on Figure 7.4. Relationship between Percent Bias Reduction and Percent Standard Error Increase in Unadjusted and Adjusted Web Sample Estimates 7.2.5.2 Discussion The combination of propensity score adjustment and calibration adjustment seems to serve the aim of adjustment better than using only propensity score adjustment. Three things can be improved in the subsequent case study. First, the degrees of bias decrease and variability increase due to the adjustments are only speculated in this section. A statistical test is needed to verify the extent to which the inference of this argument holds. Second, the significance of the covariates in this section is examined only in relation to the substantive study variables, y, not to the treatment variable, g. Covariate and model selection may be modified by incorporating both y and g. It will allow us to examine the role of covariates more extensively. Lastly, the subclassification based on the propensity scores for voting behavior was not completed in 29 out of 2,000 simulations due to 131 subclasses having zero counts of units from the reference sample. Suppose that the reference sample data were originally collected for the general population and that only a subset (like veterans of the military) were to be used because the Web survey target population was a subgroup of the general population. In this case, one may have a small number of cases in the reference sample for that particular Web survey. Therefore, the reference survey should have a large enough size so that the reference samples for any Web survey target populations will have sufficient number of observations for forming the quintile subclassification. 7.3 Case Study 2: Application of Propensity Score Adjustment and Calibration Adjustment to 2003 Michigan Behavioral Risk Factor Surveillance Survey Data 7.3.1 Construction of Pseudo-population and Sample Selection for Simulation More elaborate examination of the adjustment is carried out in this case study with the data from the 2003 Michigan Behavioral Risk Factor Surveillance System (BRFSS). The BRFSS is a collaborative project of the Centers for Disease Control and Prevention and U.S. states, Washington, D.C., and territories and is designed to measure behavioral risk factors in the adult population (18 years of age or older) living in households (CDC, 1998). The 2003 Michigan BRFSS data consist of four quarterly data sets collected by the Institute for Public Policy and Social Research at Michigan State University. The respondents were selected by random digit dialing method with disproportionate allocation for strata defined by geographic area, phone bank density, and probability of each phone number being listed (Michigan Department of Community 132 Health, 2003). The original 2003 Michigan BRFSS data contain 3,551 units. Among them, 3,410 cases without item nonresponse on Web access ownership and the four stratifying variables (age, gender, race and education) are retained for the study to form the pseudo-population data. Table 7.7. Distribution of Age, Gender, Education and Race of BRFSS Full Sample, BRFSS Web User and Harris Interactive Survey Respondents High School or Less Some College or above White NonWhite White NonWhite A. BRFSS Full Sample (n=3,410) ? 40 yrs Female 5.01% 1.35% 10.32% 2.35% Male 3.58 1.09 6.48% 1.23% 41 yrs + Female 16.57% 2.49% 20.29% 2.23% Male 10.29% 1.17 14.13% 1.41% Sum 100% B. BRFSS Web Users (n=1,250) ? 40 yrs Female 5.28% 0.83% 13.98% 2.22% Male 3.29 0.83 8.70% 1.44% 41 yrs + Female 10.56% 0.97% 23.29% 2.08% Male 7.18% 0.65 17.18% 1.53% Sum 100% C. Harris Interactive Respondents (n=8,195) ? 40 yrs Female 2.03% 1.64% 13.28% 13.37% Male 0.85 0.61 7.58% 9.09% 41 yrs + Female 2.45% 0.48% 15.58% 4.58% Male 1.70% 0.24% 20.82% 5.71% Sum 100% Table 7.7 compares the distribution of the four stratifying variables among all respondents in BRFSS, Web users in BRFSS and Harris Interactive survey respondents (the same as in Case Study 1). The results from the comparison echo what has been observed previously. Harris Interactive survey respondents over-represent Nonwhites and more educated and younger people compared to BRFSS respondents. This tendency remains even when the Web access owners in BRFSS are compared to the Harris 133 Interactive survey respondents. When interpreting Table 7.7, one should bear in mind that the target population of BRFSS represents Michigan residents, while the Harris Interactive survey targets the general U.S. population. Even so, since the purpose of this study is not to show the discrepancy between the two data sets but to investigate whether this discrepancy can be reduced by the proposed statistical adjustment, this should not degrade the value of the study. The table shows that there are still considerable gaps in the distributions of education and race between the two sets of BRFSS respondents and the HI respondents. As in Case Study 1, a BRFSS pseudo-population is created by bootstrapping the 3,410 BRFSS respondents with replacement for the size of 20,000. Among the pseudo- population, 12,674 people indicated that they have Web access at home 19 and these will be considered as the Web pseudo-population. This results in 63.4% of Web access owners in the pseudo-population, which is very close to 63.3% (=2,160/3,410) Web owners in the original BRFSS data. The aims of Case Study 2 are slightly different than those of the previous case study. First, the emphases are placed on the propensity score model development in a more practical situation. Second, variance estimation methods are examined for estimates calculated with propensity score adjustment weights and/or calibration weights. Third, the effectiveness of different propensity score models and calibration methods is assessed with significance tests. Therefore, Case Study 2 uses slightly different simulation functions than the first case study. 19 Question Wording: ?Do you have access to the Internet at home?? 134 Using the full pseudo-population data, a reference sample of size 500 is drawn from the full pseudo-population using ref.sam. For the Web samples, instead of drawing different types of Web samples, samples resembling Harris Interactive survey respondents, i.e. .WHI s , are examined. Web samples of size 1,500 are drawn from the Web pseudo-population with an allocation proportional to the stratum distribution for Harris Interactive respondents in Table 7.7 using pois.sam. The sample sizes here are larger than the ones used in the first case study in order to avoid situations where weighting based on propensity score adjustment becomes impossible due to zero observations in subclasses in (6.1). These samples are drawn in 3,200 simulations. 20 7.3.2 Adjustments 7.3.2.1 Propensity Score Adjustment Case study 1 used different PSA models for different variables. However, this type of modeling is unlikely to be exercised in a real setting, because it requires different weights for each study variable when estimating more than one variable. In practice, one propensity model is likely to be applied to derive weights for all study variables. In order to implement propensity score adjustment, the following modeling method is used in this study. First, one reference sample of size 500 and one Web sample of size 1,500 are drawn as described previously. Then, these two samples are merged into one data set of 2,000. The base propensity score model is constructed from this merged data based on the relationship between g and x , not between y and x as in Case Study 1. Five different logistic models are used for propensity score adjustment as in (7.10). For the 20 The number of simulation is increased in this case study in order to compute the variance and the confidence interval more reliably. 135 base model (Model 2), the vector of i x for person i includes all 30 covariates listed in Table 7.8, such that ( ) () Pr 1 ln 1Pr 1 i i i g g ? ??= ?=+ ?? ?= ?? Bx , (7.10) where 1,...,in= , B and i x are 30 1? vectors, and n is the total number of cases in the merged data set. Model 1, 3, 4, and 5 use subset of the covariates in Table 7.8. Model 3 retains marginally significant covariates with 0.2p ? in Model 2 and, thus, tests the role of significant covariates in predicting g in propensity score models. In order to detect the marginal effect of stratifying variables used in the sampling stage, Model 1, 4 and 5 are constructed. Model 1 includes stratifying variables only; Model 4 excludes variables in Model 1 from Model 2 (i.e., Model 4 uses all variables except the stratifiers); and Model 5 excludes variables in Model 1 from Model 3 (i.e., Model 5 includes covariates significant at 20% but excludes the stratifiers). Details of these models are shown in Table 7.9. Table 7.8. List of Covariates Used for Propensity Modeling Covariate Type Description Ghealth Continuous General Health (1: excellent, 2: very good, 3: good, 4: fair, 5: poor) Coverage 2 categories Having health care coverage (1: yes, 2: no) Doctor 2 categories Having personal doctor/health care provider (1: yes, 2: no) cprevent 2 categories Cost prevented from doctor's visit in the past 12 months (1: yes, 2: no) phyact 2 categories Participate in any physical activities other than regular job durin the past month (1: yes, 2: no) diabete 2 categories Ever told to have diabetes by a doctor (1: yes, 2: no) cholest 2 categories Ever checked blood cholesterol (1: yes, 2: no) losewgt 2 categories Trying to lose weight (1: yes, 2: no) wgtadv 2 categories Weight advice given by health professional in the past 12 months (1: yes, 2: no) 136 Table 7.8 (continued) Covariate Type Description asthma 2 categories Ever told to have asthma by a doctor (1: yes, 2: no) flushot 2 categories Had a flu shot in the past 12 months (1: yes, 2: no) pneumon 2 categories Ever had a pneumonia shot (1: yes, 2: no) sunburn 2 categories Had a sunburn in the past 12 months (1: yes, 2: no) age Continuous Age in years educ Continuous Education income Continuous Household income weight Continuous Current weight numphone Continuous Number of residential phone lines gender 2 categories Gender (1: male, 2: female) jointsym 2 categories Had any symptoms of pain, aching, or stiffness around joint in the past 30 days (1: yes, 2: no) limitact 2 categories Limited in any activities because of physical, mental or emotional problems (1: yes, 2: no) modact 2 categories Moderate activities for at least 10 minutes in a usual week when not working (1: yes, 2: no) army 2 categories Ever served on active duty in the United States Armed Forces (1: yes, 2: no) cellphon 2 categories Have a cell or mobile phone (1: yes, 2: no) alcohol Continuous Amount of alcohol consumption hhsize Continuous Household size work 2 categories Work full time (1: yes, 2: no) marry 2 categories Marital status (1: married, 2: others) race 2 categories Race (1: Whites, 2: others) veggie Continuous Amount of vegetable consumption Propensity score Models 1 through 5 are applied in deriving five sets of weights that are used for the three study variables, 1 y : whether respondents have high blood pressure (HBP), 2 y : whether respondents have smoked 100 or more cigarettes (SMOKE), and 3 y : whether respondents do vigorous physical activities (ACT). 21 In 21 Question wording for the three variables are as follows. 1 y (HBP): ?Have you ever been told by a doctor, nurse or other health professional that you have high blood pressure?? 2 y (SMOKE): ?Have you smoked at least 100 cigarettes in your entire life?? 137 order to examine the relationship between the covariates in propensity models and the study variables, the same sets of covariates in Model 1 through 5 are fitted to predict 1 y , 2 y , 3 y , and g (whether the unit is from the Web survey sample or the reference survey sample) in the original BRFSS data (n=3,410). The p-value of each covariate in all models is shown in Table 7.9. The performance of the model predictability is detected using Akaike Information Criteria (AIC). AIC is computed as ( ) ? 2log 2AIC L K=? + , where ? L is the likelihood statistic and K is the number of parameters in the model. The smaller the AIC, the better fitting the propensity score model. The AIC penalizes model complexity by increasing as the number of parameters increase. Not surprisingly, propensity score Model 2 and 3 fit better across four dependent variables than the other models, as Model 2 contains more covariates and Model 3 contains only significant ones. Model 4, which includes all covariates except stratifying variables, is also competitive based on the AIC. 3 y (ACT): ?Now thinking about the vigorous physical activities you [fill in] in a usual week, do you do vigorous activities for at least 10 minutes at a time, such as running aerobics, heavy yard work, or anything else that causes large increase in breathing or hear rate?? 138 Table 7.9. Propensity Score Models and P-values of Covariates for Different Dependent Variables Dependent Variable g : WEB 1 y : HBP 2 y : SMOKE 3 y : ACT MODEL 1 age 0.3218 <0.0001 0.1042 <0.0001 educ 0.0000 0.0065 <0.0001 <0.0001 gender 0.0726 0.6029 <0.0001 <0.0001 race 0.0000 0.0001 0.0160 <0.0001 AIC 2008.8 3757.6 4552.4 4278.3 MODEL 2 ghealth 0.1806 <0.0001 0.0004 <0.0001 coverage 0.4073 0.4904 0.0164 0.1199 doctor 0.1045 0.1435 0.5436 0.5270 cprevent 0.0221 0.3360 0.0013 0.3121 phyact 0.3604 0.3266 0.4718 <0.0001 diabete 0.0480 0.0001 0.1825 0.5117 cholest 0.4914 <0.0001 0.9139 0.0063 losewgt 0.0350 0.1837 0.5822 0.0296 wgtadv 0.2986 0.0008 0.8317 0.1817 asthma 0.4106 0.2167 0.9845 0.2333 flushot 0.8168 0.0021 0.0449 0.9012 pneumon 0.3888 0.8610 0.0660 0.1647 sunburn 0.1466 0.2629 0.0117 0.0215 age 0.6221 <0.0001 0.8366 <0.0001 educ <0.0001 0.6441 <0.0001 0.2376 income 0.0097 0.1335 0.9379 0.1250 weight 0.5240 <0.0001 0.0557 0.1128 numphone 0.4489 0.5027 0.7071 0.4085 gender 0.1632 0.2615 0.0738 <0.0001 jointsym 0.8323 0.3379 0.0012 0.3371 limitact 0.0342 0.3663 0.0508 0.0048 modact 0.4473 0.8883 0.2546 <0.0001 army 0.1326 0.6561 <0.0001 0.6325 cellphon 0.0039 0.0786 0.2519 0.0972 alcohol 0.7995 0.2469 <0.0001 0.0096 hhsize 0.7981 0.0235 0.1913 0.0405 work 0.6905 0.1298 0.7322 0.0339 marry 0.2720 0.3171 0.4568 0.0220 race <0.0001 0.1219 0.0083 0.1930 veggie 0.7797 0.1457 0.0448 <0.0001 AIC 2004.1 2729.8 3486.2 3139.2 139 Table 7.9 (continued) Dependent Variable g : WEB 1 y : HBP 2 y : SMOKE 3 y : ACT MODEL 3 ghealth 0.0880 <0.0001 <0.0001 <0.0001 doctor 0.1940 <0.0001 0.8831 0.4741 cprevent 0.0116 0.2333 <0.0001 0.0052 diabete 0.1087 <0.0001 0.2517 0.0471 losewgt 0.0559 <0.0001 0.3762 0.0006 sunburn 0.3119 <0.0001 0.0085 <0.0001 educ <0.0001 0.9828 <0.0001 0.0297 income 0.0006 0.0002 0.8408 0.0004 gender 0.1173 0.5012 0.0090 <0.0001 limitact 0.1297 0.0002 0.0219 <0.0001 army 0.1959 <0.0001 <0.0001 0.0009 cellphon 0.0016 0.4797 0.6566 0.0424 race <0.0001 0.3687 0.0045 0.4833 AIC 1981.1 3312.6 3914.9 3777.2 MODEL 4 ghealth 0.0152 <0.0001 <0.0001 <0.0001 coverage 0.1835 0.8748 0.0059 0.0229 doctor 0.3021 0.0526 0.4018 0.7992 cprevent 0.0501 0.8298 0.0031 0.3112 phyact 0.0174 0.4261 0.1576 <0.0001 diabete 0.6122 <0.0001 0.1343 0.3520 cholest 0.8262 <0.0001 0.4369 0.2597 losewgt 0.0540 0.1638 0.3007 0.5248 wgtadv 0.4037 0.0022 0.8598 0.0730 asthma 0.5499 0.0532 0.9058 0.5355 flushot 0.9521 <0.0001 0.0514 0.2514 pneumon 0.2047 0.1082 0.0147 0.0193 sunburn 0.9247 0.0016 0.0009 <0.0001 income <0.0001 0.0259 0.0783 0.0662 weight 0.6802 <0.0001 0.2382 0.1460 numphone 0.6040 0.2946 0.8867 0.2045 jointsym 0.2648 0.0508 0.0016 0.8521 limitact 0.0031 0.6016 0.1116 0.0079 modact 0.7416 0.4107 0.2555 <0.0001 army 0.0289 0.6521 <0.0001 0.3924 cellphon 0.0001 0.2652 0.0446 0.0875 alcohol 0.1280 0.2824 <0.0001 0.0003 hhsize 0.1190 <0.0001 0.2480 0.0009 work 0.2855 <0.0001 0.7928 <0.0001 marry 0.0879 0.0311 0.9053 0.0014 race 0.3698 0.5276 0.0020 <0.0001 AIC 2147.1 2786.0 3546.2 3198.9 140 Table 7.9 (continued) Dependent Variable g : WEB 1 y : HBP 2 y : SMOKE 3 y : ACT MODEL 5 ghealth 0.0005 <0.0001 <0.0001 <0.0001 doctor 0.4131 <0.0001 0.7442 0.0725 cprevent 0.0154 0.2023 <0.0001 0.0140 diabete 0.7052 <0.0001 0.1861 0.0319 losewgt 0.0466 <0.0001 0.1201 0.0082 sunburn 0.8364 <0.0001 0.0004 <0.0001 income <0.0001 0.0002 0.0173 <0.0001 limitact 0.0122 0.0001 0.0513 <0.0001 army 0.0088 <0.0001 <0.0001 0.5847 cellphon 0.0001 0.5391 0.1275 0.0840 AIC 2136.4 3307.9 3897.2 3819.1 7.3.2.2 Calibration Adjustment Two different sets of calibration variables are employed to test the effect of including population estimates of substantive variables. The first calibration adjustment projects the weighted Web sample to the pseudo-population with respect to age, gender, educ and race (Calibration 1). This resembles generalized ratio-raking using known population figures. Calibration 2 expands the first one by adding a key health variable, ghealth: ?Would you say that in general your health is excellent, very good, good, fair or poor?? Although ghealth is a rather less traditional variable to be included in calibration, our three study variables are all highly health-related. Therefore, inclusion of ghealth in the calibration adjustment is expected to improve the adjustment. The propensity score adjustment weights are calculated in psa.fcn. These propensity score adjusted weights and base weights are modified in calibrating the sample covariate estimates to the pseudo-population benchmarks using newcal.fcn (see Appendix 1.7 for the R ? code). 141 Applying calibration weights, Web sample estimates on 1 y , 2 y , and 3 y are calculated in each simulation. The whole simulation is done similarly to cal.sim in Appendix 1.6. 7.3.3 Results of Adjustments 7.3.3.1 Comparison of Adjusted Estimates For each study variable, there are population values, reference sample estimates and Web sample estimates. Since reference samples do not require propensity score adjustment, there are three types of estimates reflecting calibration adjustment status: No Calibration, Calibration 1, and Calibration 2. For Web samples, there are 18 different combinations of adjustments: (No propensity score adjustment, propensity score Model 1, 2, 3, 4, and 5) x (No Calibration, Calibration 1, and Calibration 2). The type of adjustment will be denoted after the name of variable. For instance, the unadjusted reference sample estimate of HBP will be denoted as y1.R.n; the Web sample estimate of SMOKE using no propensity score adjustment but Calibration 1 as y2.n.1; and the estimate of ACT using propensity score Model 5 and Calibration 2 as y3.5.2. Table 7.10 presents the simulation means of the Web sample estimates using all adjustments and the reference sample estimates and the population values. The distribution of the reference sample and Web sample estimates over all simulations are shown in Figure 7.5 using box plots. While the reference sample estimates are distributed around the population values, the Web sample estimates are not necessary so. The unadjusted estimates (y1.n.n, y2.n.n and y3.n.n) are the most biased of all the alternatives. In fact, none of the ranges of unadjusted estimates from 3,200 simulations contains the true values. On average, when no adjustment is applied, people in the Web samples are less likely to have high blood pressure, less likely to have smoked 100 cigarettes and 142 more likely to do vigorous physical activities than the population. Once the adjustment is incorporated, the discrepancies between the Web estimates and the population figures tend to decrease. Some of the adjusted Web sample estimates are almost unbiased, since estimates, such as y1.1.n, y1.1.2, y2.n.2, y2.1.1, y2.2.1, y2.2.2, y2.3.1, y2.3.2, y2.4.1, y2.4.2, y2.5.1, y2.5.2, y3.5.1, and y3.5.2 show almost symmetric distributions around the population values. The most striking bias reduction can be observed for SMOKE. When any combinations of propensity score and calibration adjustment are applied, the means of the Web sample estimates for the proportion of people who smoked 100 or more cigarettes are almost right on the population value. However, the introduction of adjustments causes estimates to be more variable, as evidence by larger interquartile ranges. The reduction in variance due to having large sample sizes in Web surveys is offset by the bias corrections. Table 7.10. Population Values, Reference Sample Estimates and Web Sample Estimates for HBP, SMOKE and ACT estimate Adjustment Combination 1 y : HBP 2 y : SMOKE 3 y : ACT Pop 0.3201 0.5276 0.4349 R.n 0.3197 0.5278 0.4352 R.1 0.3197 0.5279 0.4351 R.2 0.3197 0.5278 0.4352 n.n 0.2766 0.4722 0.5117 n.1 0.2864 0.5194 0.4758 n.2 0.3022 0.5330 0.4653 1.n 0.3205 0.4985 0.4738 1.1 0.3114 0.5295 0.4660 1.2 0.3197 0.5356 0.4583 2.n 0.3042 0.4998 0.4604 2.1 0.3029 0.5319 0.4525 2.2 0.3050 0.5333 0.4502 3.n 0.2882 0.5012 0.4612 3.1 0.2934 0.5330 0.4524 143 Table 7.10 (continued) estimate Adjustment Combination 1 y : HBP 2 y : SMOKE 3 y : ACT 3.2 0.2945 0.5337 0.4513 4.n 0.2898 0.4820 0.4575 4.1 0.3079 0.5320 0.4221 4.2 0.2996 0.5260 0.4251 5.n 0.2955 0.4831 0.4733 5.1 0.3102 0.5325 0.4402 5.2 0.3020 0.5269 0.4419 7.3.3.2 Performance of Adjustments on Error Reduction The mechanisms of error reduction are examined to a greater depth in this section. Table 7.10 summarizes the error properties of all Web sample estimates calculated as in (7.3), (7.4), (7.6), (7.7), (7.8) and (7.9). It also includes standardized error properties, such as standardized rmse ( .srmse), standardized bias ( .sbias), and standardized se ( .sse). These are defined as rmse , bias , and se divided by the simulation mean in (7.1). These standardized error and the percentage figures allow unit-free comparisons on the magnitude of error reduction across all variables. As discussed in the previous section, there is a notable reduction in bias by using propensity score and calibration adjustment. The reduction in bias is achieved at 68.2% on average, ranging from 17.7% for y.4.n of SMOKE to 99.2% for y.1.n for HBP. When both propensity score and calibration adjustment are employed, the average bias reduction is realized at 78.8%. Propensity score adjustment or calibration adjustment alone does not seem to remove bias as much as when the two are combined. 144 Figure 7.5. Simulation Means of All Web Sample Estimates and Reference Sample Estimates and Population Values 145 Although we utilize the adjustments to reduce bias, it is worthwhile to examine how they change the overall error structure. This is done by comparing rmse, s.rmse and p.rmse among different adjustment methods presented in Table 7.11.A, Table 7.11.B and Table 7.11.C. The magnitude of rmse reduction is smaller than that of bias reduction. While the adjustment on SMOKE decreases the bias as much as by 98.8%, the reduction in rmse is only half of that, although still substantial compared to no adjustment. Table 7.11.A. Error Properties of Reference Sample and Web Sample Estimates for Proportion of People with High Blood Pressure Adjustment Combina- tion rmse s.rmse p.rmse bias s.bias p.bias se s.se p.se 1 y : HBP R.n 0.0204 -0.0004 0.0204 R.1 0.0192 -0.0004 0.0192 R.2 0.0188 -0.0004 0.0188 n.n 0.0448 0.1618 - -0.0435 -0.1574 - 0.0104 0.0376 - n.1 0.0404 0.1412 9.7% -0.0337 -0.1178 22.5% 0.0223 0.0778 114.4% n.2 0.0286 0.0946 36.1% -0.0179 -0.0591 59.0% 0.0223 0.0739 114.8% 1.n 0.0247 0.0770 44.9% 0.0004 0.0012 99.2% 0.0247 0.0770 137.3% 1.1 0.0259 0.0832 42.1% -0.0087 -0.0279 80.1% 0.0244 0.0784 134.8% 1.2 0.0232 0.0727 48.1% -0.0004 -0.0012 99.1% 0.0232 0.0727 123.6% 2.n 0.0267 0.0878 40.3% -0.0159 -0.0523 63.5% 0.0215 0.0705 106.4% 2.1 0.0302 0.0998 32.4% -0.0172 -0.0570 60.4% 0.0248 0.0820 139.0% 2.2 0.0284 0.0930 36.7% -0.0151 -0.0496 65.2% 0.0240 0.0786 130.6% 3.n 0.0380 0.1319 15.1% -0.0319 -0.1107 26.7% 0.0206 0.0716 98.6% 3.1 0.0362 0.1233 19.2% -0.0267 -0.0909 38.7% 0.0244 0.0833 135.1% 3.2 0.0347 0.1180 22.4% -0.0256 -0.0869 41.2% 0.0235 0.0798 126.1% 4.n 0.0339 0.1169 24.3% -0.0303 -0.1045 30.5% 0.0152 0.0524 46.1% 4.1 0.0294 0.0955 34.3% -0.0122 -0.0395 72.0% 0.0268 0.0870 157.7% 4.2 0.0326 0.1087 27.2% -0.0205 -0.0685 52.9% 0.0253 0.0845 143.5% 5.n 0.0283 0.0956 36.8% -0.0246 -0.0831 43.6% 0.0140 0.0473 34.6% 5.1 0.0284 0.0915 36.6% -0.0099 -0.0320 77.2% 0.0266 0.0858 155.9% 5.2 0.0310 0.1025 30.8% -0.0181 -0.0600 58.4% 0.0251 0.0832 141.7% Note: The figure for the best estimate (excluding y.R and y.U ) is highlighted in bold/Italic in each column. 146 Table 7.11.B. Error Properties of Reference Sample and Web Sample Estimates for Proportion of People Who Smoked 100 Cigarettes or More Adjustment Combina- tion rmse s.rmse p.rmse bias s.bias p.bias se s.se p.se 2 y : SMOKE R.n 0.0222 0.0003 0.0222 R.1 0.0218 0.0003 0.0218 R.2 0.0217 0.0003 0.0217 n.n 0.0566 0.1199 - -0.0554 -0.1172 - 0.0119 0.0251 - n.1 0.0265 0.0510 53.2% -0.0082 -0.0158 85.2% 0.0252 0.0485 112.4% n.2 0.0256 0.0479 54.9% 0.0055 0.0103 90.1% 0.0250 0.0468 110.4% 1.n 0.0368 0.0738 35.0% -0.0290 -0.0582 47.6% 0.0226 0.0454 90.7% 1.1 0.0267 0.0504 52.9% 0.0020 0.0037 96.4% 0.0266 0.0502 124.1% 1.2 0.0272 0.0509 51.9% 0.0081 0.0151 85.4% 0.0260 0.0486 119.3% 2.n 0.0356 0.0712 37.2% -0.0278 -0.0556 49.8% 0.0222 0.0444 87.0% 2.1 0.0278 0.0523 50.9% 0.0044 0.0082 92.1% 0.0275 0.0517 131.6% 2.2 0.0279 0.0522 50.8% 0.0058 0.0108 89.6% 0.0273 0.0511 129.7% 3.n 0.0341 0.0681 39.7% -0.0264 -0.0526 52.4% 0.0217 0.0433 82.8% 3.1 0.0273 0.0512 51.8% 0.0054 0.0102 90.2% 0.0267 0.0502 125.3% 3.2 0.0270 0.0506 52.3% 0.0061 0.0115 88.9% 0.0263 0.0493 121.9% 4.n 0.0481 0.0997 15.1% -0.0455 -0.0945 17.7% 0.0154 0.0320 29.9% 4.1 0.0291 0.0547 48.6% 0.0045 0.0084 91.9% 0.0288 0.0541 142.6% 4.2 0.0285 0.0541 49.7% -0.0015 -0.0029 97.2% 0.0284 0.0541 139.7% 5.n 0.0468 0.0969 17.3% -0.0444 -0.0919 19.8% 0.0148 0.0305 24.4% 5.1 0.0289 0.0543 48.9% 0.0049 0.0093 91.1% 0.0285 0.0535 140.2% 5.2 0.0280 0.0531 50.6% -0.0007 -0.0013 98.8% 0.0279 0.0530 135.6% Note: The figure for the best estimate (excluding y.R and y.U ) is highlighted in bold/Italic in each column. Table 7.11.C. Error Properties of Reference Sample and Web Sample Estimates for Proportion of People Who Do Vigorous Physical Activities Adjustment Combina- tion rmse s.rmse p.rmse bias s.bias p.bias se s.se p.se 3 y : ACT R.n 0.0220 0.0003 0.0220 R.1 0.0211 0.0002 0.0211 R.2 0.0207 0.0003 0.0207 n.n 0.0777 0.1518 - 0.0768 0.1501 - 0.0116 0.0227 - n.1 0.0478 0.1005 38.5% 0.0409 0.0859 46.8% 0.0248 0.0522 113.5% n.2 0.0394 0.0846 49.3% 0.0304 0.0654 60.4% 0.0250 0.0537 114.8% 1.n 0.0443 0.0935 43.0% 0.0389 0.0821 49.3% 0.0211 0.0446 81.9% 1.1 0.0407 0.0872 47.7% 0.0311 0.0668 59.5% 0.0261 0.0561 124.8% 1.2 0.0348 0.0759 55.2% 0.0234 0.0510 69.6% 0.0258 0.0562 121.6% 2.n 0.0340 0.0738 56.3% 0.0255 0.0554 66.8% 0.0225 0.0488 93.1% 2.1 0.0326 0.0721 58.0% 0.0176 0.0389 77.1% 0.0275 0.0607 136.3% 147 Table 7.11.C (continued) Adjustment Combina- tion rmse s.rmse p.rmse bias s.bias p.bias se s.se p.se 2.2 0.0310 0.0689 60.1% 0.0153 0.0341 80.0% 0.0270 0.0599 131.9% 3.n 0.0337 0.0731 56.6% 0.0263 0.0570 65.8% 0.0211 0.0458 81.7% 3.1 0.0315 0.0697 59.4% 0.0175 0.0386 77.2% 0.0263 0.0580 125.9% 3.2 0.0306 0.0678 60.6% 0.0164 0.0363 78.7% 0.0259 0.0573 122.3% 4.n 0.0278 0.0608 64.2% 0.0226 0.0494 70.6% 0.0163 0.0356 39.9% 4.1 0.0314 0.0744 59.6% -0.0128 -0.0304 83.3% 0.0286 0.0678 146.3% 4.2 0.0301 0.0708 61.2% -0.0098 -0.0229 87.3% 0.0285 0.0670 145.1% 5.n 0.0411 0.0868 47.1% 0.0384 0.0811 50.0% 0.0147 0.0310 26.3% 5.1 0.0285 0.0649 63.3% 0.0053 0.0120 93.1% 0.0281 0.0637 141.3% 5.2 0.0286 0.0647 63.2% 0.0070 0.0158 90.9% 0.0277 0.0627 138.4% Note: The figure for the best estimate (excluding y.R and y.U ) is highlighted in bold/Italic in each column. The smaller degree of the rmse reduction than bias reduction occurs because adjustment weights add variability in the estimates as they attempt to decrease discrepancies between the Web sample covariate distributions and their desired population distributions. The base weight, when no adjustment is made, is the same for every unit in the Web sample. As adjustments are made, weights diverge from the base weight. The divergence becomes even larger when the adjustments correct for large discrepancies. Recall Table 7.7 which showed a sizable discrepancy in some of the covariates between the population and volunteer panel Web survey respondents. Therefore, it would not be surprising to see the variation in weights after applying the adjustments as shown in Table 7.12. The base weight starts at 13.34. Once the adjustment is applied, the upper and lower boundaries diverge from the base weight radically. The ratio of the largest and the smallest weight from the same adjustment ranges from 1 to 89.7. 148 Table 7.12. Distribution of Weights for All Adjustments over All Simulations Adjustment Combinations Lower Bound Upper Bound Ratio (Upper/Lower) n.n 13.34 13.34 1.00 n.1 6.68 144.46 21.62 n.2 6.67 178.23 26.72 1.n 6.13 63.47 10.35 1.1 3.91 133.78 34.21 1.2 3.35 160.60 47.94 2.n 5.41 62.76 11.60 2.1 3.15 147.26 46.77 2.2 2.78 159.62 57.49 3.n 5.85 62.24 10.64 3.1 3.48 135.55 38.93 3.2 3.04 147.12 48.38 4.n 6.29 34.53 5.49 4.1 3.15 260.52 82.72 4.2 3.14 282.13 89.73 5.n 7.24 31.56 4.36 5.1 3.63 250.35 69.05 5.2 3.62 273.28 75.53 Figure 7.6 shows the relationship between the decrease in bias and the increase in variability when adjustment is applied to the Web survey estimates. As was true in simulation in Section 7.2.4, this figure shows that the bias reduction is generally achieved at the cost of the standard error increase. The correlation between the two is .61 (not shown in the figure). The linear regression also indicates a fairly strong relationship between the two statistics. 149 y = 0.4025x + 0.2269 R 2 = 0.3732 0% 20% 40% 60% 80% 100% 120% 0% 30% 60% 90% 120% 150% 180% % se incre a se % bi a s r e du c t i o n Figure 7.6. Relationship between Percent Bias Reduction and Percent Standard Error Increase in Adjusted Web Sample Estimates We examine the effectiveness of propensity score adjustment and calibration adjustment using analyses of variance (ANOVA). ANOVA models are used to predict p.rmse, p.bias, and p.se as functions of two main effects (PSTATUS: propensity score adjustment status ? whether or not propensity score adjustment is used; and CSTATUS: calibration adjustment status ? whether or not calibration adjustment is used) and their interaction. All three ANOVA models are significant with 11.89F = , 20.66 and 65.34 ( 3/50df = ; 0.0001p < ). Both propensity score adjustment status and calibration adjustment status have significant effects on p.rmse, p.bias and p.se with 0.0001p < . Their interactions are also significant in explaining the variances of all three error properties with p-values of 0.0045, 0.0317 and 0.0021, respectively. 150 Table 7.13. Least Square Mean of Percent Root Mean Square Error Reduction, Percent Bias Reduction and Percent Standard Error Increase by Propensity Score Adjustment Status, Calibration Adjustment Status and Their Interactions LS Mean Effect p.rmse p.bias p.se Propensity Score Adjustment (PSTATUS) PSA (P1) 42.9% P2 64.5% P2 102.5% P2 No PSA (P2) 20.1% P1 30.3% P1 56.7% P1 Calibration Adjustment (CSTATUS) CAL (C1) 43.9% C2 69.7% C2 123.9% C2 No CAL (C2) 19.1% C1 25.1% C1 35.4% C1 Interaction (PSTATUS*CSTATUS) P1*C1 (1) 47.5% 4 78.8% 2,4 134.4% 2,4 P1*C2 (2) 38.2% 4 50.2% 1,4 70.7% 1,3,4 P2*C1 (3) 40.3% 4 60.7% 4 113.4% 2,4 P2*C2 (4) 0.0% 1,2,3 0.0% 1,2,3 0.0% 1,2,3 Note: Superscripts indicate statistically different means at 0.05p = . Next, least-squares means (LS Means), shown in Table 7.13, are computed for p.rmse, p.bias and p.se by each effect in the previous ANOVA. LS Means are predicted population margins ? they estimate the marginal means over a balanced population (SAS Institute, 1999). For example, the ANOVA model for p.rmse is () ij ij ? ?? ??+++ , where i ? is the effect for level i of PSTATUS, j ? is the effect for level j of CSTATUS, and ( ) ij ?? is the interaction. The LS mean for the combination (PSTATUS/CSTATUS) for p.rmse is 47.5%, i.e. the use of both propensity score adjustment and calibration is predicted to reduce the rmse by 47.5% (averaged over the five propensity score adjustment models and two calibration methods). Pair differences are calculated using pairwise Tukey-Kramer adjusted differences. The results shown in Table 7.13 reveal that all three error statistics become large when either or both of the adjustments are applied. Among the four 151 possible combinations of the two adjustment status, using both adjustments is superior to any other adjustments. Its bias reduction (p.bias) can be as large as 78.8%. Although the standard error becomes 1.3 times larger, the root mean square error (p.rmse) size is smaller by 47.5% than that of the unadjusted estimate. Overall, one can say that the adjustment reduces the error in estimates. 7.3.4 Performance of Different Propensity Score Models and Calibration Models How each propensity score model and calibration method affects all three error properties is examined in this section. As in the previous ANOVA, p.rmse, p.bias and p.se are fitted by the different types of propensity score models (PMODEL) and calibration adjustment methods (CMODEL). Note that the focus of examination here is on different models instead of adjustment application status. There are six different methods under PMODEL: no propensity score adjustment and propensity score Models 1 through 5; and three CMODEL: no calibration and Calibration 1 and Calibration 2. All ANOVA models are significant in explaining the variance in the percent error statistics (See Table 7.14). In case of se, the model accounts for 97% of the variance in the percent se increase. Six different types of propensity score modeling and three calibration types have significantly different effects on the errors. However, their interactions are significant in explaining only p.se. 152 Table 7.14. Results of Analysis of Variance on Percent Root Mean Square Error Reduction, Percent Bias Reduction and Percent Standard Error Increase by Propensity Score Adjustment Models, Calibration Adjustment Models and Their Interactions p.rmse p.bias p.se Effect df SS F Value SS F Value SS F Value Model 17 0.7084 1.79* 2.5017 3.71** 9.3600 69.43** Error 36 0.8363 1.4288 0.2855 Total 53 1.5447 3.9304 9.6455 (R 2 =0.4586) (R 2 =0.6365) (R 2 =0.9704) df Type 3 SS F Value Type 3 SS F Value Type 3 SS F Value Propensity Score Model (PMODEL) 5 0.2513 2.16* 0.7210 3.63** 1.2060 30.41** Calibration Model (CMODEL) 2 0.2603 5.60** 1.3899 17.51** 6.2374 393.25** Interaction (PMODEL* CMODEL) 10 0.1969 0.85 0.3908 0.98 1.9166 24.17** Note: * 0.1p < , ** 0.05p < Table 7.15 provides more detailed information on the performance of different propensity score models and calibration methods. It displays least square means of p.rmse, p.bias and p.se by effects of PMODEL and CMODEL included in ANOVA in the previous table. Contrary to expectations, the table does not convey clear-cut messages about the superiority of particular propensity score models and calibration methods. In general, propensity score Model 1 and 2 are preferable ? although their statistical significance does not always hold, the direction is obvious. Propensity score Model 1 includes four stratifying variables (age, educ, gender, race); and model 2 includes all 30 covariates in Table 7.9. This importance of Model 1 is logical because Web samples are drawn based on the stratification on those variables but using the extremely imbalanced distributions of the Harris Interactive respondents. Model 2 ought to perform well, since it uses the full matrix of covariates in the 153 adjustment. There is no clear association between the AIC of the propensity models and their error properties. The role of AIC in model building discussed earlier cannot be verified from this result. Table 7.15. Least Square Mean of Percent Root Mean Square Error Reduction, Percent Bias Reduction and Percent Standard Error Increase by Propensity Score Adjustment Models and Calibration Adjustment Models a LS Mean Effect p.rmse p.bias p.se Propensity Score Model (PMODEL) No Adjustment (P0) 26.8% P1, P2 40.4% P1, P2, P4, P5 75.6% P1, P2, P3, P4, P5 Model 1 (P1) 46.7% P0 76.2% P0 117.6% P0, P5 Model 2 (P2) 47.0% P0 71.6% P0 120.6% P0, P5 Model 3 (P3) 41.9% 62.2% 113.3% P0 Model 4 (P4) 42.7% 67.0% P0 110.1% P0 Model 5 (P5) 43.8% 69.2% P0 104.3% P0 Calibration Models (CMODEL) No Adjustment (C0) 31.8% C1, C2 41.8% C1, C2 58.9% C1, C2 Calibration 1 (C1) 44.8% C0 74.2% C0 133.4% C0 Calibration 2 (C2) 47.8% C0 77.4% C0 128.4% C0 a. LS Means by interactions are excluded from the table, since there is little difference across 18 different combinations. Note: Superscripts indicate statistically different means at 0.1p = . Among the two calibration methods, the second one using estimated population general health status as well as known population demographic characteristics seems to benefit the error structure to a larger degree than using only known values. One notable finding with calibration from the table above is that the Calibration 2 shows a larger decrease in bias and a smaller increase in standard error than Calibration 1, although not statistically significant. This implies that a good calibration method may achieve bias reduction at a smaller level of variability increase. 154 7.3.5 Variance Estimation 7.3.5.1 Variance Estimation for Propensity Score Adjustment There is no clear approach for deriving variance estimates when propensity score adjustment weights are applied. One method that commercial statistical software, such as SAS, may use would be the following estimator: ()() ()() 2 ... 11 1 1 1 WW W CC WPSA WPSA WPSA naive j j j jW W cc js js n vy f dy dyN nn == ?? ? ? ?? ? ??? =? ? ? ?? ?? ? ? ?? ?? , (7.11) where .WPSA j d is the weight derived from propensity score adjustment in (6.2). However, this is a na?ve approach, since the estimator does not account for the complexity of multiple weights in .WPSA W jcj dfd= , where c f is the PSA factor and W j d is the base design weight for unit j. If the weights reflect a nonresponse adjustment which has not beeen incorporated in this study, they will be even more complicated. Thus, na?vely applying (7.11) may give poor results. Table 7.16 shows estimated and empirical standard errors for the estimators with propensity score adjustment but without calibration adjustment. It allows a comparison between the se estimates from (7.11) (v.naive) and the simulation se from (7.8) (v.sim) by the ratio of the two. The na?ve estimator tends to overestimate the actual variability, although the degree of overestimation is not too extreme. This tendency is worse for 2 y , where the na?ve se estimates are at least 12% larger than the actual se. This echoes the finding in Valliant (2004) which showed an understatement of efficiency of employing the na?ve estimator when calculating variances of estimates adjusted by multiple weights. 155 Table 7.16. Estimated Standard Error and Simulation Standard Error of Propensity Score Adjusted Web Sample Estimates 1 y : HBP 2 y : SMOKE 3 y : ACT Propensity Score Model v.naive v.sim Ratio (naive/sim) v.naive v.sim Ratio (naive/sim) v.naive v.sim Ratio (naive/sim) Model 1 0.0231 0.0247 93.8% 0.0264 0.0226 116.8% 0.0239 0.0211 113.1% Model 2 0.0219 0.0215 101.9% 0.0262 0.0222 118.2% 0.0232 0.0225 103.5% Model 3 0.0207 0.0206 100.1% 0.0262 0.0217 120.7% 0.0230 0.0211 109.0% Model 4 0.0151 0.0152 99.3% 0.0176 0.0154 114.0% 0.0154 0.0163 94.7% Model 5 0.0146 0.0140 104.6% 0.0166 0.0148 112.5% 0.0150 0.0147 102.2% 7.3.5.2 Variance Estimation for Calibration Adjustment Two variance estimation approaches are examined for cases when the calibration adjustment is added to the propensity score adjustment. The first follows the na?ve approach in (7.11), where .WPSA j d is replaced with the calibration weight j w from Section 6.3, such that ()() ()() 2 . 1 1 1 WW W WA naive jj jjW W js js n vy f wy wyN nn ?? ? ? ?? ? ??? =? ? ? ?? ?? ? ? ?? . (7.12) As mentioned above, this squared residual method is the same as what commercially available software packages typically utilize for variance estimation. The second method which originates from Deville and S?rndal (1992) uses the following variance estimator modified from the asymptotic variance estimator for the GREG for the population total, y t : ( ) ( )( )( ) ds y ij ij i i j j s vt wewe?=? ?? , (7.13) where i and j denote units in the sample; ? ??? =? ij ij i j ; ? i and ? j are inclusion probabilities for unit i and j into the sample; ? ij is a joint inclusion probability of the two units; and i e is the sample-based residual defined as ? iiiws ey ? =?zB . ? ws B is the 156 regression slope estimate computed as in (6.15) and (6.16). Since we use Poisson sampling to draw Web samples, the variance estimator (7.13) becomes simplified as () () () ()( ) 2 W ds y i i i i j j ij is vt we wewe ? ? =+ ? ? , (7.14) As we are estimating the population mean and the samples are drawn with replacement, (7.14) is changed to obtain ()() () () 2 . 1 1 W W WA ds j j W js n vy f weN n ? ? ? =? ? ? ? ?  . (7.15) 0.02 0.03 0.04 0.05 1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 Adjustment Combinations S t a n d a rd E rro r v.ds v.naive v.sim y1: HBP y2: SMOKE y3: ACT Figure 7.7. Standard Error of Adjusted Web Sample Estimates by Different Adjustment Method Combinations Standard errors estimates using (7.12) and (7.15) are computed in simulations using v.naive and v.ds in newcal.fcn shown in Appendix 1.7. The resultant statistics are compared to the simulation standard error for the estimators using both propensity score and calibration adjustment in Figure 7.7. As shown in Valliant 157 (2004) and in the previous discussion, the na?ve approach overestimates the variability of survey estimates, presenting the survey estimates as if they are far less efficient. The estimator suggested by Deville and S?rndal (1992) appears to estimate the actual variance reasonably well. Although it tends to underestimate the variability, the degree of its underestimation is much smaller than the degree of the overestimation in the na?ve approach. The standard errors of adjusted Web sample estimates are plotted against the respective bias reduction in the estimated mean achieved in adjustments in Figure 7.8. Over the range of bias reductions shown, v.naive is always a substantial overestimate while v.ds is somewhat too small. Figure 7.8. Relationship between Standard Error and Percent Bias Reduction of Adjusted Web Sample Estimates The respective coverage rates for 95% confidence intervals in the simulation when v.ds and v.naive are used are presented in Table 7.17. It is striking that underestimation by v.ds leads to consistent undercoverage by its confidence intervals. %biasReduc 0.02 0.03 0.04 0.05 30% 65% 100% y1:HBP St a n d a r d Er r o r v.ds v.naive v.sim 0.02 0.03 0.04 0.05 80% 90% 100% y2:SMOKE v.ds v.naive v.sim 0.02 0.03 0.04 0.05 50% 75% 100% y3:ACT v.ds v.naive v.sim % bias reduction 158 In contrast, the confidence intervals using v.naive have coverage rates that can be more or less than 95%. The cases with the poorest coverage tend to be ones where the estimated mean is biased so that the confidence intervals are not properly centered. For example, the standardized biases of the 3.1 and 3.2 estimates for HBP are about -9% from Table 7.11, and the coverage for v.ds are 75.2% and 74.4%, respectively. Table 7.17. Coverage Rates of 95% Confidence Interval by Standard Error Estimated with v.ds and v.naive 1 y : HBP 2 y : SMOKE 3 y : ACT Adjustment Combination v.ds v.naive v.ds v.naive v.ds v.naive 1.1 90.7% 95.6% 92.5% 99.5% 74.5% 91.7% 1.2 92.7% 98.4% 90.8% 99.5% 81.6% 95.9% 2.1 85.8% 92.2% 92.4% 99.4% 88.3% 97.3% 2.2 86.3% 94.0% 91.6% 99.4% 88.6% 97.8% 3.1 75.2% 84.7% 92.4% 99.5% 88.7% 97.3% 3.2 74.4% 86.9% 91.7% 99.4% 88.7% 97.9% 4.1 89.5% 94.8% 92.1% 99.6% 88.5% 92.0% 4.2 81.8% 91.5% 91.8% 99.3% 89.1% 93.8% 5.1 89.8% 95.3% 92.3% 99.6% 92.6% 97.2% 5.2 83.8% 92.4% 92.1% 99.3% 91.5% 97.8% One may argue that it is safe to use v.naive for calculating variance when calibration weights are applied, because it is conservative in stating the estimation efficiency. However, the degree of variance overestimation tends to be too large, especially for 2 y . Confidence intervals obtained by standard errors using v.naive cover the population value over 99% of the time in Table 7.17. Recall that the bias reduction is achieved at the greatest degree for adjustments applied to 2 y . For this variable, the coverage rates of v.ds are not as poor as for other variables. Therefore, it 159 seems sensible to examine the relationship between the bias reductions and the confidence interval coverage rates. Figure 7.9. Relationship between 95% Confidence Interval Coverage and Percent Bias Reduction of Adjusted Web Sample Estimates The relationship between the bias reductions and the confidence interval coverage rates is depicted in Figure 7.9. Although not remarkable, there is a noticeable relationship between the degree of bias reduction and that of confidence interval coverage. When the adjustment is poor in reducing the bias, the coverage rates of confidence intervals computed with variances from both (7.11) and (7.14) are far lower than the nominal rate, 95%. In contrast, the coverage rates tend to converge to 95%, as the biases are reduced to a larger degree. V.naive is not as good as v.ds where more than 85% biases are reduced. 7.3.6 Discussion This case study examines the combination of propensity score adjustment and calibration adjustment for volunteer panel Web survey estimates. The results from %biasReduc 70% 75% 80% 85% 90% 95% 100% 20% 40% 60% 80% 100% y1: HBP 9 5 % CI Co v e ra g e v.ds v.naive 70% 75% 80% 85% 90% 95% 100% 85% 90% 95% 100% y2: SMOKE v.ds v.naive 70% 75% 80% 85% 90% 95% 100% 40% 60% 80% 100% y3:ACT v.ds v.naive % bias reduction 160 simulation show that using any of the two adjustments improves the accuracy of the sample estimates. The interaction of the two also has a significant effect on bias and rmse reduction. This confirms the effectiveness of each adjustment separately and the combined adjustment. At the same time, these adjustments increase the standard errors of estimates significantly. This reaffirms the trade-off between the bias reduction and the variability increase. Nonetheless, the adjustments decrease the magnitude of overall error substantially. The examination of the separate propensity score models does not reveal clear implications. Models that include variables used in the sample selection and all auxiliary variables perform better. However, this does not bring in substantive understandings about the propensity score modeling strategies. As recommended by Rubin and Thomas (1996), it may be a sound approach to include all available covariates even if some are only remotely related to the study variables. The simulation on calibration adjustment suggests benefits from the inclusion of substantive variables whose population figures are estimated from another larger and more reliable survey. When the general health item is added in the calibration, the percent bias reduction and rmse reductions are larger but the percent se is smaller than when excluded. This item is asked in large-scale national health surveys, such as National Health Interview Survey, from which reliable population estimates are available. Therefore, the utilization of more substantive covariates is practical and effective in calibration adjustment. The variance estimation methods tested in this study, unfortunately, do not provide conclusive guidelines. The na?ve variance estimator that uses the multiply 161 adjusted weight as a simple single weight produces highly inflated figures. The estimator suggested by Deville and S?rndal is a better approximation but tends to be too small. However, confidence intervals based on Deville and S?rndal?s standard errors cover the population values at lower rates than the targeted rate. This will deceptively portray the estimates as if they are more efficient than they actually are. The na?ve approach, on one hand, can be better for some estimates since it is conservative. On the other hand, confidence intervals based on the na?ve estimators can cover the population value close to 100% of the time, when the nominal rate is 95%. On a positive note, when the applied adjustment results in larger bias reduction, the Deville-S?rndal estimator provides near nominal coverage. Replication is another variance estimation option that has the potential to be quite effective when complicated weighting methods are used like propensity score and calibration adjustment. Variations of the jackknife or bootstrap could be good choices for future investigation. 162 Chapter 8: Conclusion With the advance in communication technology and the accompanying societal and cultural changes, Web surveys are here to stay. This research is carried out not to assert the scientific significance of Web surveys or to advocate the embracing of Web surveys, but to supplement what is lacking in current Web survey practice. Web surveys are popular but without any proven methodological value. In order to make potentially biased Web survey estimates usable, statistical adjustments may be employed in the estimation process. However, traditional adjustment techniques are found to be limited in compensating for the biases in Web survey estimates. Based on that finding, this study attempts to adopt existing adjustment methods from the causal inference and survey statistics literature to volunteer panel Web survey data. First, protocols for recruiting volunteers for Web surveys are not guaranteed to produce random samples. This is viewed as a selection bias in this study. Propensity score adjustment in causal inference using observational data is a method that can remove or reduce the selection bias. We applied this to Web survey settings to derive an adjustment weight for selection bias. A second calibration adjustment is made to decrease the bias arising from the differences between the adjusted Web sample and the population. The study provides a mathematical presentation of processes in these adjustments which are absent in the existing research. 163 The performance of the adjustments is diagnosed in simulations. The two case studies carried out in this research convey the same clear implications about the adjustments: the propensity score adjustment and the calibration adjustment decrease bias and root mean square error in volunteer Web panel survey estimates; however, these reductions are realized with an increase in variance of estimates. It is also found that the error reduction becomes larger when the propensity score adjustment is used in conjunction with the calibration adjustment. The contention that nondemographic covariates are needed in propensity score models made by some survey organizations is not supported. The best method of covariate selection for propensity modeling appears to be the inclusion of all available variables in the adjustment. For calibration adjustment, utilization of substantive variables whose population estimates are obtainable from larger surveys does improve the quality of the adjustments. The application scope of these adjustments may exceed volunteer panel Web surveys. When the quality of data collection is doubtful, one may adopt the adjustments examined in this research to make a better use out of the data. Imagine that one has a survey data set but fear that respondents? self-selection may have introduced bias but that there is a more reliable survey which has variables in common with one?s survey. Propensity score adjustment can take advantage of the power of those overlapping covariates between the two surveys. This adjustment may be tuned to a finer degree by calibration using a smaller set of variables whose population figures are known or estimable from larger surveys. The survey estimates may become more usable after these adjustments. 164 When applying these adjustments to survey data, one should bear in mind the following. First, the adjustments are post-hoc in their nature. If feasible given the survey budget, it is important is to improve the survey procedures to collect better data. It would be unwise to intentionally collect suboptimal data, assuming that the adjustments will remove all biases. While the biases are reduced, they are not eliminated. It may not work under all circumstances, and as shown bias reduction depends on the model used. Second, when the covariates used in adjustments have missing data, propensity score adjustment becomes more difficult, because propensity scores cannot be assigned to the units in the merged data set with missing covariate information. This research uses hot-deck imputation to avoid this problem. One may consider following a recommendation of D?Agostino and Rubin (2000) to condition the propensity score on both observed values of covariates and the observed missing- data indicators. Third, this research uses the main effects of covariates in propensity models. One of the advantages of using propensity score adjustment weighting over the traditional weighting is the flexibility of the model formation. Propensity model refinement including higher order interactions among the covariates and using more covariates may provide a clearer insight about the variable selection. Fourth, the effectiveness of nondemographic covariates may not have been confirmed in this study, because the Web samples are drawn based on the distribution of demographic variables and these variables are also included in the adjustment. One may consider another way of drawing Web samples or conducting a series of Web surveys on substantive variables whose true values are either known or obtainable. Fifth, the sample size of the reference survey matters. When the size is small, the 165 subclassification based on propensity scores may not be possible. Instead of conducting a small reference survey for each Web survey, one possibility is to adopt a large-scale national survey as a reference survey. Lastly, the two variance estimation methods examined in this research did not perform well enough to be recommended for general use. Alternative variance estimators are needed as the final weights from the adjustments account for multiple steps of adjustment. Variance estimation methods with replication are alternatives. These remarks are hoped to provide directions for future research. 166 Appendices 1. R ? Code Used in the Study 1.1 psa.fcn function (dframe, form, pfit, prnk, qbin, bin, trmt) # Propensity Score Adjustment Weight Calculation # - Calcultates adjustment weights and throws out the reference sample # # dframe: data frame # form: propesnity score model which needs to be defined beforehand # pfit: fitted propensity scores # prnk: propensity score rank # qbin: propensity score bin number factor created by PSdefine # bins: number of bins to be formed # trmt: treatment/control group variable { if(missing(dframe)||!inherits(dframe,"data.fram e")) stop("First argument to PSdefine must be a Data Frame name.") if(missing(form)||class(form)!="formula") stop("Second argument to PSdefine must be a formula.") trtm <- deparse(form[[2]]) if(!is.element(trtm,dimnames(dframe)[[2]])) stop("Response variable in the PSdefine formula must be an existing treatment factor.") dframe[,trtm] <- as.factor(dframe[,trtm]) last.glm <- glm(form, family = binomial (link = logit), data = dframe, na.action = na.omit) df3 <- as.data.frame(fitted.values(last.glm)) pfit <- deparse(substitute(pfit)) 167 dimnames(df3)[[2]] <- "pfit" prnk <- deparse(substitute(prnk)) df3[,"prnk"] <- rank(df3[,"pfit"], na.last = T) qbin <- deparse(substitute(qbin)) df3[,"qbin"] <- factor(1+floor((bins*df3[,"prnk" ]) /(1+length(df3[,"prnk"])))) newdframe <- merge(dframe, df3, by.x="row.names", by.y="row.names", all.x=T) if (any(ftable(newframe [,c("depend","qbin")])[]==0)==T) (newframe.c<-1) else {newframe.c<-0} if ((newframe.c >= 1) (skip<-TRUE) if (!skip) { nwc <- table(newdframe[newdframe[,var]==1, bins]) nrc <- table(newdframe[newdframe[,var]==0, bins]) nw <- length(newdframe[newdframe[,var]==1, bins]) nr <- length(newdframe[newdframe[,var]==0, bins]) wgt <- (nrc*nw)/(nwc*nr) wgt <- as.vector(wgt[newdframe[,bins]]) allwgt <- data.matrix(cbind(newdframe, wgt)) pwgt <- allwgt[,"basewgt"]*allwgt[,"wgt"] pwgt <- as.vector(pwgt) allwgt <- data.matrix(cbind(allwgt, pwgt)) PSdframe <- data.frame(allwgt[allwgt[,var]==2,]) #bc data.matrix makes trmt+1 } } 168 1.2 cal.fcn function (pop, sam, sampx, knownx, estimx, L, U, conv.crit=0.01, max.steps=10., min.B=5) { # Calculation - GLS wgts using restricted linear distance function # # pop: population # sam: sample # X: matrix of auxiliary vars; n x p # X.pop: matrix of pop controls; p x 1 # X.hat: vector of HY estimates of X.pop # a: vector of base wgts (1/pi); n x 1 # c: vector of model vars (usually set to 1); p x 1 # L: lower bound on wgt ratio w/a # U: upper bound on wgt ratio w/a X <- sampx(sam) X.pop <- knownx(pop) X.hat <- estimx(sam, "pwgt") a <- as.matrix(sam[,"pwgt"]) p <- ncol(X) c.vec <- rep(1., length(X[,1.])) # convergence check on lambda lambda.old <- # rep(0.,p) # iteration converged <- function(old, new, conv.crit) { check <- F D <- max(abs((old - new)/old)) if (D < conv.crit) { check <- T } check } step.num <- 0. # compute weights repeat{ 169 step.num <- step.num + 1 max.steps.reached <- step.num > max.steps lambda.x <- lambda.old %*% t(X)/c.vec sA <- (lambda.x < (L - 1.)) sB <- (lambda.x >= (L - 1.)) & (lambda.x <= (U - 1.)) # sC <- (lambda.x > (U - 1.)) if(sum(sB)< min.B) stop("Set sB too small, no. cases = ", sum(sB), "No. of iteration steps used: ", step.num, "where: ", sam, sampx, "\n") phi.sA <- phi.sB <- phi.sC <- 0. lambda.xsB <- lambda.old %*% t(X[sB, ])/c.vec[sB] Z.sB <- (a/c.vec)[sB] * X[sB, ] phi.prime <- t(Z.sB) %*% X[sB, ] if(sum(sA) != 0.) { if(length(a[sA])==1.) phi.sA <- (L - 1.) * a[sA] * X[sA, ] else phi.sA <- (L - 1.) * a[sA] %*% X[sA,] } phi.sB <- lambda.old %*% t(Z.sB) %*% X[sB, ] if(sum(sC) != 0.) { if(length(a[sC])==1.) phi.sC <- (U - 1.) * a[sC] * X[sC, ] else phi.sC <- (U - 1.) * a[sC] %*% X[sC,] } phi.sA <- as.vector(phi.sA) phi.sB <- as.vector(phi.sB) phi.sC <- as.vector(phi.sC) phi.s1 <- phi.sA + phi.sB + phi.sC phi.s2 <- as.matrix(phi.s1) phi.s3 <- t(phi.s2) 170 lambda.new <- lambda.old - ginv(phi.prime) %*% (t(phi.s3) +X.hat - X.pop) if(converged(lambda.old, lambda.new, conv.crit) | max.steps.reached) { cat("No. of iteration steps used:", step.num, "\n") break } lambda.old <- as.vector(lambda.new) } g.fcn <- rep(0., length(X[, 1.])) lambda.x <- as.vector(lambda.new) %*% t(X)/c.vec sA <- (lambda.x < (L - 1.)) sB <- (lambda.x >= (L - 1.)) & (lambda.x <= (U- 1.)) sC <- (lambda.x > (U - 1.)) g.fcn[sA] <- L g.fcn[sB] <- 1. + lambda.x[sB] g.fcn[sC] <- U calwgt <- a * g.fcn cwgt <- as.vector(calwgt) calwgt <- data.matrix(cbind(sam, cwgt)) caldframe <- data.frame(calwgt) } 171 1.3 ref.sam function (pop, n) { # Select an srs as a reference sample # pop: population # n: sample size N <- nrow(pop) sam <- sample(1:N, n, replace = F) dat <- pop[sam, ] basewgt<-dim(pop)[[1]]/dim(dat)[[1]] dat<-cbind(dat, basewgt) } 172 1.4 pois.sam function(subpop, pop, ph, str, n) { # Select stratified Poisson sample from pop of size Nh # subpop: subpopulation, e.g., web population # pop: population, e.g., Entire GSS population # ph: vector of proportions in strata that define rates of web usage # str: column of pop for stratum (can be name or number) # n: desired expected total sample size h <- subpop[,str] N <- nrow(subpop) Nh <- table(subpop[, str]) H <- length(Nh) u <- runif(N, min=0, max=1) if (any(is.na(h))){ stop("stratum vat str missing for some cases. Processing stopped.\n") } if (sum(ph)!=1){ stop("sum(ph) != 1. Processing stopped.\n") } if(H != length(ph)) { stop("\H != length(ph). Processing stopped.\n") } adjh <- n/ sum(Nh * ph) ph <- ph*adjh ph.pop <-ph[h] sam <- (u < ph.pop) sam <- subpop[sam,] basewgt<-dim(pop)[[1]]/dim(sam)[[1]] dat <- cbind(sam, basewgt) 173 } 174 1.5 psa.sim function(pop, wpop, nr, nw, y, bw, pw, form1, form2, trmt, bin, seed, NoSams) { ################################################# # Propensity Score Adjustment Only ################################################# # Estimation for "y" # pop: population data set # wpop: web subpopulation # nr: reference sample size # nw: web sample size # y: variable of interest, e.g.,"vote" # bw: base weight # pw: PSA weight # form1: PSA model 1 defined previously # form2: PSA model 2 defined previously # trmt: treatment variable for PSA, "depend" # bin: variable name for bins in PSA # NoSams: Number of Simluated Samples ################################################# set.seed(seed) out.est <- array (0, dim=c(2,6,NoSams)) cat ("Begin", date(), "\n") for(s in 1:NoSams) { skip <- FALSE #skip_ct <- 0 if (s%%1==0) cat("s=", s, date(), "\n") ################################################# # sample draw ref<-ref.sam(pop, nr) strat<-pois.sam(wpop, pop, ph = c(0.11441573, 0.06110840, 0.10417682, 0.03347960, 175 0.12099789, 0.02267187, 0.09393792, 0.01227044, 0.07346010, 0.02754754, 0.06663416, 0.02007151, 0.11173411, 0.01649602, 0.10498944, 0.01600845), "str", nw) harris<-pois.sam(wpop, pop, ph = c(0.0203, 0.0164, 0.0085, 0.0060, 0.0245, 0.0048, 0.0170, 0.0024, 0.1328, 0.1337, 0.0758, 0.0909, 0.1558, 0.0458, 0.2082, 0.0571), "str", nw) # basic estimates y.pop <- est(pop, y) y.pop <- rbind(y.pop, y.pop, y.pop) colnames(y.pop) <- "y.pop" y.wpop <- est(wpop, y) y.wpop <- rbind(y.wpop, y.wpop, y.wpop) colnames(y.wpop) <- "y.wpop" bp.wpop <- y.pop-y.wpop colnames(bp.wpop) <- "bp.wpop" y.R <- w.est(ref, y, bw) y.R <- rbind(y.R, y.R, y.R) colnames(y.R) <- "y.R" y.U.t <- w.est(strat, y, bw) y.U.h <- w.est(harris, y, bw) y.U <- rbind(y.U.t, y.U.h) colnames(y.U) <- "y.U" ################################################## # merge reference and web samples rt<-rbind(ref, strat) 176 rh<-rbind(ref, harris) ################################################## # propensity score adjustment psaform1.t <- psa.fcn(rt, form1, pfit, prnk, qbin, bin, trmt) psaform1.h <- psa.fcn(rh, form1, pfit, prnk, qbin, bin, trmt) psaform2.t <- psa.fcn(rt, form2, pfit, prnk, qbin, bin, trmt) psaform2.h <- psa.fcn(rh, form2, pfit, prnk, qbin, bin, trmt) # adjusted estimates y.pform1.t <- w.est(psaform1.t, y, pw) y.pform1.h <- w.est(psaform1.h, y, pw) y.pform1 <- rbind(y.pform1.t, y.pform1.h) colnames(y.pform1) <- "y.pform1 " y.pfporm2.t <- w.est(psa.form2.t, y, pw) y.pform2.h <- w.est(psa.form2.h, y, pw) y.pform2 <- rbind(y.pform2.t, y.pform2. h) colnames(y.pform2) <- "y.pform2" ###################################################### ### # bind all estimates into y.est y.est <- cbind (y.pop, y.wpop, y.R, y.U, y.pform1, y.pform2) dimnames(y.est)[[1]][1]<-"strat" dimnames(y.est)[[1]][2]<-"harris" out.est[ , , s] <- y.est dimnames(out.est) <- list(dimnames(y.est)[[1]], dimnames(y.est)[[2]], NULL) } # end of s loop 177 cat("end", date(),"\n") list("estimates"=out.est) } 178 1.6 cal.sim function(pop, wpop, nr, nw, y, bw, pw, cw, form1, form2, samp1, known1, estim1, samp2, known2, estim2, trmt, bin, seed, NoSams) { ################################################# # Estimation for "y" # pop: population data set # wpop: web subpopulation # nr: reference sample size # nw: web sample size # y: variable of interest, e.g.,"vote" # bw: base weight # pw: PSA weight # cw: calibration weight # form: PSA forms defined previously # samp1: function for obtaining only calibration covariate matrix from Sample for calibration 1 # known1: function for obtaining population figures of calibration covariates for calibration 1 # estim1: function for ontaining sample estimates of calibration covariates for calibration 1 # samp2: function for obtaining only calibration covariate matrix from Sample for calibration 2 # known2: function for obtaining population figures of calibration covariates for calibration 2 # estim2: function for ontaining sample estimates of calibration covariates for calibration 2 # trmt: treatment variable for PSA, "depend" # bin: variable name for bins in PSA # NoSams: Number of Simluated Samples ################################################# set.seed(seed) out.est <- array (0, dim=c(2,14,NoSams)) cat ("Begin", date(), "\n") for(s in 1:NoSams) { skip<-FALSE 179 if (s%%1==0) cat("s=", s, date(), "\n") ################################################# # sample draw ref<-ref.sam(pop, nr) strat<-pois.sam(wpop, pop, ph = c(0.11441573, 0.06110840, 0.10417682, 0.03347960, 0.12099789, 0.02267187, 0.09393792, 0.01227044, 0.07346010, 0.02754754, 0.06663416, 0.02007151, 0.11173411, 0.01649602, 0.10498944, 0.01600845), "str", nw) harris<-pois.sam(wpop, pop, ph = c(0.0203, 0.0164, 0.0085, 0.0060, 0.0245, 0.0048, 0.0170, 0.0024, 0.1328, 0.1337, 0.0758, 0.0909, 0.1558, 0.0458, 0.2082, 0.0571), "str", nw) # basic estimates y.pop <- est(pop, y) y.pop <- rbind(y.pop, y.pop, y.pop) colnames(y.pop) <- "y.pop" y.wpop <- est(wpop, y) y.wpop <- rbind(y.wpop, y.wpop, y.wpop) colnames(y.wpop) <- "y.wpop" y.R.n <- w.est(ref, y, bw) y.R.n <- rbind(y.R.n, y.R.n, y.R.n) colnames(y.R.n) <- "y.R.n" y.U.n.t <- w.est(strat, y, bw) y.U.n.h <- w.est(harris, y, bw) y.U.n <- rbind(y.U.n.t, y.U.n.h) colnames(y.U.n) <- "y.U.n" 180 ################################################## # merge reference and web sample rt<-rbind(ref, strat) rh<-rbind(ref, harris) ################################################## # propensity score adjustment only psaform1.t <- psa.fcn(rt, form1, pfit, prnk, qbin, bin, trmt) psaform1.h <- psa.fcn(rh, form1, pfit, prnk, qbin, bin, trmt) psaform2.t <- psa.fcn(rt, form2, pfit, prnk, qbin, bin, trmt) psaform2.h <- psa.fcn(rh, form2, pfit, prnk, qbin, bin, trmt) # adjusted estimates y.pform1.n.t <- w.est(psaform1.t, y, pw) y.pform1.n.h <- w.est(psaform1.h, y, pw) y.p1.n <- rbind(y.pform1.n.t, y.pform1.n.h) colnames(y.p1.n) <- "y.p1.n" y.pform2.n.t <- w.est(psa.form2.t, y, pw) y.pform2.n.h <- w.est(psa.form2.h, y, pw) y.p2.n <- rbind(y.pform2.n.t, y.pform2.n.h) colnames(y.p2.n) <- "y.p2.n" #################################################### # calibration adjustment psaform1.cal1.t <- cal.fcn(pop, psaform1.t, 181 samp1, known1, estim1, L, U) psaform1.cal1.h <- cal.fcn(pop, psaform1.h, samp1, known1, estim1, L, U) psaform1.cal2.t <- cal.fcn(pop, psaform1.t, samp2, known2, estim2, L, U) psaform1.cal2.h <- cal.fcn(pop, psaform1.h, samp2, known2, estim2, L, U) psaform2.cal1.t <- cal.fcn(pop, psaform2.t, samp1, known1, estim1, L, U) psaform2.cal1.h <- cal.fcn(pop, psaform2.h, samp1, known1, estim1, L, U) psaform2.cal2.t <- cal.fcn(pop, psaform2.t, samp2, known2, estim2, L, U) psaform2.cal2.h <- cal.fcn(pop, psaform2.h, samp2, known2, estim2, L, U) psano.cal1.t <- cal.fcn(pop, strat, samp1, known1, estim1, L, U) psano.cal1.h <- cal.fcn(pop, harris, samp1, known1, estim1, L, U) psano.cal2.t <- cal.fcn(pop, strat, samp2, known2, estim2, L, U) psano.cal2.h <- cal.fcn(pop, harris, samp2, known2, estim2, L, U) psano.cal1.R <- cal.fcn(pop, ref, samp1, known1, estim1, L, U) psano.cal2.R <- cal.fcn(pop, ref, samp2, known2, estim2, L, U) # adjusted estimates y.p1.c1.t <- w.est(psaform1.cal1.t, y, cw) 182 y.p1.c1.h <- w.est(psaform1.cal1.h, y, cw) y.p1.c2.t <- w.est(psaform1.cal2.t, y, cw) y.p1.c2.h <- w.est(psaform1.cal2.h, y, cw) y.p2.c1.t <- w.est(psaform2.cal1.t, y, cw) y.p2.c1.h <- w.est(psaform2.cal1.h, y, cw) y.p2.c2.t <- w.est(psaform2.cal2.t, y, cw) y.p2.c2.h <- w.est(psaform2.cal2.h, y, cw) y.n.c1.t <- w.est(psano.cal1.t, y, cw) y.n.c1.h <- w.est(psano.cal1.h, y, cw) y.n.c2.t <- w.est(psano.cal2.t, y, cw) y.n.c2.h <- w.est(psano.cal2.h, y, cw) y.R.c1 <- w.est(psano.cal2.R, y, cw) y.R.c2 <- w.est(psano.cal2.R, y, cw) y.p1.c1 <- rbind(y.p1.c1.t, y.p1.c1.h) y.p1.c2 <- rbind(y.p1.c2.t, y.p1.c2.h) y.p2.c1 <- rbind(y.p2.c1.t, y.p2.c1.h) y.p2.c2 <- rbind(y.p2.c2.t, y.p2.c2.h) y.U.c1 <- rbind(y.n.c1.t, y.n.c1.h) y.U.c2 <- rbind(y.n.c2.t, y.n.c2.h) y.R.c1 <- rbind(y.R.c1, y.R.c1) y.R.c2 <- rbind(y.R.c2, y.R.c2) colnames(y.p1.c1) <- "y.p1.c1" colnames(y.p1.c2) <- "y.p1.c2" colnames(y.p2.c1) <- "y.p2.c1" colnames(y.p2.c2) <- "y.p2.c2" colnames(y.U.c1) <- "y.U.c1" colnames(y.U.c2) <- "y.U.c2" colnames(y.R.c1) <- "y.R.c1" colnames(y.R.c2) <- "y.R.c2" 183 ###################################################### ### # bind all estimates into y.est y.est <- cbind (y.pop, y.wpop, y.R.n, y.R.c1, y.R.c2, y.U.n, y.U.c1, y.U.c2, y.p1.n, y.p1.c1, y.p1.c2, y.p2.n, y.p2.c1, y.p2.c2) dimnames(y.est)[[1]][1]<-"strat" dimnames(y.est)[[1]][2]<-"harris" out.est[ , , s] <- y.est dimnames(out.est) <- list(dimnames(y.est)[[1]], dimnames(y.est)[[2]], NULL) } # end of s loop cat("end", date(),"\n") list("estimates"=out.est) } 184 1.7 newcal.fcn function (pop, sam, sampx, knownx, estimx, L, U, conv.crit, max.steps, min.B, y) { #################################################### # Calibration and Variance estimation #################################################### # pop: population # sam: sample # X: matrix of auxiliary vars; n x p # X.pop: matrix of pop controls; p x 1 # X.hat: vector of HY estimates of X.pop # a: vector of base wgts (1/pi); n x 1 # c: vector of model vars (usually set to 1); p x 1 # L: lower bound on wgt ratio w/a # U: upper bound on wgt ratio w/a # conv.crit: convergence criterion # max.steps: maximum number of calibration iteration # y: variable of interest, e.g. ?HBP? #################################################### X <- sampx(sam) X.pop <- knownx(pop) X.hat <- estimx(sam, "pwgt") a <- as.matrix(sam[,"pwgt"]) p <- ncol(X) c.vec <- rep(1., length(X[,1.])) # convergence check on lambda lambda.old <- # rep(0.,p) # iteration converged <- function(old, new, conv.crit) { check <- F D <- max(abs((old - new)/old)) if (D < conv.crit) { check <- T } check } step.num <- 0. # compute weights 185 repeat{ step.num <- step.num + 1 max.steps.reached <- step.num > max.steps lambda.x <- lambda.old %*% t(X)/c.vec sA <- (lambda.x < (L - 1.)) sB <- (lambda.x >= (L - 1.)) & (lambda.x <= (U - 1.)) # sC <- (lambda.x > (U - 1.)) if(sum(sB)< min.B) stop("Set sB too small, no. cases = ", sum(sB), " No. of iteration steps used: ", step.num, "where: ", sam, sampx, "\n") phi.sA <- phi.sB <- phi.sC <- 0. lambda.xsB <- lambda.old %*% t(X[sB, ])/c.vec[sB] Z.sB <- (a/c.vec)[sB] * X[sB, ] phi.prime <- t(Z.sB) %*% X[sB, ] if(sum(sA) != 0.) { if(length(a[sA])==1.) phi.sA <- (L - 1.) * a[sA] * X[sA, ] else phi.sA <- (L - 1.) * a[sA] %*% X[sA,] } phi.sB <- lambda.old %*% t(Z.sB) %*% X[sB, ] if(sum(sC) != 0.) { if(length(a[sC])==1.) phi.sC <- (U - 1.) * a[sC] * X[sC, ] else phi.sC <- (U - 1.) * a[sC] %*% X[sC,] } phi.sA <- as.vector(phi.sA) phi.sB <- as.vector(phi.sB) phi.sC <- as.vector(phi.sC) phi.s1 <- phi.sA + phi.sB + phi.sC phi.s2 <- as.matrix(phi.s1) phi.s3 <- t(phi.s2) 186 lambda.new <- lambda.old - ginv(phi.prime) %*% (t(phi.s3) +X.hat - X.pop) if(converged(lambda.old, lambda.new, conv.crit) | max.steps.reached) { cat("No. of iteration steps used:", step.num, "\n") break } lambda.old <- as.vector(lambda.new) } cat("Max relative change in lambda at last step: ",# max(abs((lambda.old - lambda.new)/lambda.old)), "\n") g.fcn <- rep(0., length(X[, 1.])) lambda.x <- as.vector(lambda.new) %*% t(X)/c.vec sA <- (lambda.x < (L - 1.)) sB <- (lambda.x >= (L - 1.)) & (lambda.x <= (U-1.)) sC <- (lambda.x > (U - 1.)) g.fcn[sA] <- L g.fcn[sB] <- 1. + lambda.x[sB] g.fcn[sC] <- U calwgt <- a * g.fcn cwgt <- as.vector(calwgt) calwgt <- data.matrix(cbind(sam, cwgt)) calwgt <- data.frame(calwgt) ################################################## # Variance estimation ################################################## # Deville-Sarndal variance Y <- as.matrix(sam[, y]) sampsize <- dim(sam)[[1]] popsize <- dim(pop)[[1]] A <- t(X*cwgt) %*% X B <- ginv(A) %*% t(X*cwgt) %*% Y e <- Y - X %*% B nwgt <- cwgt/sum(cwgt) 187 v.ds <- (1-sampsize/popsize)*(sampsize/(sampsize- 1))*sum((nwgt*e)^2) v.ds <- rbind(v.ds.y1, v.ds.y2, v.ds.y3) # Naive variance m.y <- mean (nwgt*Y) v.naive <- (1-sampsize/popsize)*(sampsize/(sampsize- 1))*sum((nwgt*Y-m.y)^2) newcaldframe<- list("calwgt"=calwgt, "v.ds"=v.ds, "v.naive"=v.naive) } 188 2. GSS Propensity Score Model Specification in R ? 2.1 blks y : Warm Feelings towards Blacks D1 depend ~ age+educ+newsize+hhldsize+income+ as.factor(race)+as.factor(gender)+ as.factor(married)+as.factor(region)+ D2 depend ~ age+educ+as.factor(race)+as.factor(gender)+ as.factor(region) D3 depend ~ newsize+hhldsize+income+as.factor(married) A1 depend ~ age+educ+newsize+hhldsize+income+ as.factor(race)+as.factor(gender)+ as.factor(married)+ as.factor(region)+ class+as.factor(work)+as.factor(party)+ as.factor(religion)+ethnofit A2 depend ~ age+educ+as.factor(race)+as.factor(gender)+ as.factor(region)+ethnofit A3 depend ~ newsize+hhldsize+income+as.factor(married)+ class+as.factor(work)+as.factor(party)+ as.factor(religion) N1 depend ~ class+as.factor(work)+as.factor(party)+ as.factor(religion)+ethnofit N2 depend ~ ethnofit N3 depend ~ class+as.factor(work)+as.factor(party)+ as.factor(religion) 4 depend ~ age+educ+newsize+hhldsize+income98+ as.factor(race)+as.factor(gender)+ as.factor(married)+as.factor(region)+ethnofit 189 2.2 blks y : Voting Participation in 2000 Presidential Election A1 depend 22 ~ age+educ+newsize+hhldsize+income+ as.factor(race)+as.factor(gender)+ as.factor(married)+ as.factor(region)+ class+as.factor(work)+as.factor(party)+ as.factor(religion) A2 depend ~ age+educ+income+as.factor(race)+ as.factor(married)+class+as.factor(party) A3 depend ~ newsize+hhldsize+as.factor(gender)+ as.factor(region)+as.factor(work)+ as.factor(religion) D1 depend ~ age+educ+newsize+hhldsize+income+ as.factor(race)+as.factor(gender)+ as.factor(married)+as.factor(region)+ D2 depend ~ age+educ+income+as.factor(race)+ as.factor(married) D3 depend ~ newsize+hhldsize+as.factor(gender)+ as.factor(region)+ N1 depend ~ class+as.factor(work)+as.factor(party)+ as.factor(religion) N2 depend ~ class+as.factor(party) N3 depend ~ as.factor(work)+as.factor(religion) 4 depend ~ age+educ+newsize+hhldsize+income+ as.factor(race)+as.factor(gender)+ as.factor(married)+ as.factor(region)+ class+as.factor(party) 22 depend: An indicator for the status of each unit whether included in the Web or reference sample; the same as g in Chapter 6. 190 3. Reference Sample and Unadjusted and Propensity Score Adjusted Web Sample Estimates for blks y and vote y .WST s .WHI s estimate bias p.bias rmsd p.rmsd se estimate bias p.bias rmsd p.rmsd se blks y y.R 0.612 0.034 0.612 0.034 y.U 0.636 0.024 0.0448 0.0% 0.016 0.675 0.064 0.074 0.0% 0.016 y.D1 0.623 0.012 52.4% 0.0405 9.6% 0.022 0.638 0.026 58.6% 0.052 29.4% 0.032 y.D2 0.622 0.010 57.1% 0.0398 11.2% 0.021 0.645 0.034 47.0% 0.056 24.6% 0.031 y.D3 0.637 0.025 -4.7% 0.0457 -2.0% 0.018 0.675 0.063 0.4% 0.074 -0.8% 0.021 y.N1 0.620 0.008 65.7% 0.0388 13.5% 0.020 0.657 0.046 28.3% 0.060 18.5% 0.022 y.N2 0.622 0.010 58.6% 0.0386 13.9% 0.018 0.658 0.046 27.3% 0.059 19.5% 0.017 y.N3 0.632 0.020 17.5% 0.0430 4.1% 0.018 0.672 0.061 4.8% 0.072 2.3% 0.021 y.A1 0.616 0.004 82.0% 0.0390 13.1% 0.023 0.629 0.017 72.6% 0.048 35.5% 0.032 y.A2 0.617 0.005 79.4% 0.0387 13.6% 0.022 0.642 0.030 52.2% 0.054 27.5% 0.032 y.A3 0.636 0.024 1.7% 0.0451 -0.5% 0.019 0.669 0.057 10.0% 0.070 5.8% 0.021 y.4 0.619 0.007 71.3% 0.0392 12.5% 0.023 0.635 0.023 63.9% 0.050 31.8% 0.032 .WST s .WHI s estimate bias p.bias rmsd p.rmsd se estimate bias p.bias rmsd p.rmsd se vote y y.R 0.650 0.034 0.650 0.034 y.U 0.715 0.065 0.075 0.0% 0.015 0.817 0.167 0.171 0.0% 0.013 y.D1 0.709 0.059 9.7% 0.069 8.3% 0.022 0.724 0.074 55.7% 0.086 50.0% 0.031 y.D2 0.711 0.062 5.4% 0.071 5.2% 0.021 0.721 0.072 57.2% 0.084 51.2% 0.032 y.D3 0.720 0.070 -7.1% 0.079 -5.7% 0.016 0.814 0.164 1.7% 0.169 1.6% 0.014 y.N1 0.695 0.045 30.5% 0.057 23.4% 0.019 0.771 0.121 27.5% 0.127 26.1% 0.020 y.N2 0.694 0.044 32.0% 0.057 24.4% 0.019 0.764 0.115 31.4% 0.121 29.6% 0.019 y.N3 0.719 0.069 -5.6% 0.078 -4.3% 0.016 0.821 0.172 -2.6% 0.175 -2.4% 0.013 y.A1 0.702 0.052 19.9% 0.063 16.2% 0.024 0.718 0.069 58.9% 0.081 52.7% 0.032 y.A2 0.706 0.057 13.5% 0.066 11.9% 0.023 0.716 0.066 60.4% 0.079 54.1% 0.032 y.A3 0.724 0.074 -13.4% 0.083 -10.5% 0.017 0.818 0.169 -0.7% 0.172 -0.6% 0.014 y.4 0.703 0.053 18.8% 0.063 15.5% 0.024 0.718 0.068 59.2% 0.080 53.1% 0.032 191 4. Relationship between the Distributions of the Different Web Sample Estimates and the Reference Sample Estimates for blks y and vote y 192 193 5. Distributions of the Web Estimates by Different Propensity Score Adjustments 194 6. BRFSS Propensity Score Model Specification in R ? Model 1 depend ~ age+educ+as.factor(gender)+as.factor(race) Model 2 depend ~ ghealth+as.factor(coverage)+as.factor(doctor)+ as.factor(cprevent)+as.factor(phyact)+ as.factor(diabete)+as.factor(cholest)+ as.factor(losewgt)+ as.factor(wgtadv)+ as.factor(asthma)+as.factor(flushot)+ as.factor(pneumon)+as.factor(sunburn)+ age+educ+income+weight+numphone+ as.factor(gender)+as.factor(jointsym)+ as.factor(limitact)+as.factor(modact)+ as.factor(army)+as.factor(cellphon)+ alcohol+hhsize+as.factor(work)+as.factor(marry)+ as.factor(race)+veggie Model 3 depend ~ ghealth+as.factor(doctor)+as.factor(cprevent)+ as.factor(diabete)+as.factor(losewgt)+ as.factor(sunburn)+educ+income+as.factor(gender)+ as.factor(limitact)+as.factor(army)+ as.factor(cellphon)+as.factor(race) Model 4 depend ~ ghealth+as.factor(coverage)+as.factor(doctor)+ as.factor(cprevent)+as.factor(phyact)+ as.factor(diabete)+as.factor(cholest)+ as.factor(losewgt)+ as.factor(wgtadv)+ as.factor(asthma)+as.factor(flushot)+ as.factor(pneumon)+as.factor(sunburn)+ income+weight+numphone+as.factor(jointsym)+ as.factor(limitact)+as.factor(modact)+ as.factor(army)+as.factor(cellphon)+alcohol+ hhsize+as.factor(work)+as.factor(marry)+veggie 195 Model 5 depend ~ ghealth+as.factor(doctor)+as.factor(cprevent)+ as.factor(diabete)+as.factor(losewgt)+ as.factor(sunburn)+ income+as.factor(limitact)+ as.factor(army)+as.factor(cellphon) 196 Bibliography Angrist, J.D., Imbens, G.W., and Rubin, D.B. (1996). Identification of Causal Effects Using Instrumental Variables. Journal of the American Statistical Association, 91 (434), 444-472. Benjamin, D.J. (2003). Does 401(k) Eligibility Increase Saving? Evidence From Propensity Score Subclassification. Journal of Public Economics, 87(5-6), 1259-90. Berk, R.A., and Newton, P.J. (1985). Does Arrest Really Deter Wife Battery? An Effort to Replicate the Findings of the Minneapolis Spouse Abouse Experiment. American Sociological Review, 50, 253-262. Burnett, R., and Marshall, P.D. (2003). Web Theory. An Introduction. New York, NY: Routledge. Casady R.J., and Lepkowski, J.M. (1993). Stratified Telephone Survey Designs. Survey Methodology, 19(1),103-113. Cochran, W.G. (1968). The Effectiveness of Adjustment by Subclassification in Removing Bias in Observational Studies. Biometrics, 24, 295-313. Cochran, W.G., Mosteller, F., and Tukey, J.W. (1954). Statistical Problems of Kinsey Report (on Sexual Behavior in the Human Male). Washington, D.C.: American Statistical Association. Cook, E.F., and Goldman, L. (1989). Performance of Tests of Significance Based on Stratification by a Multivariate Confounder Score or by a Propensity Score. Journal of Clinical Epidemiology, 42, 317-324. Couper, M.P. (2000). Web Surveys: A Review of Issues and Approaches. Public Opinion Quarterly, 64, 464-494. Couper, M.P. (2001). The Promises and Perils of Web Surveys. In The Challenge of the Internet, ed. A. Westlake et al. London, UK: Association for Survey Computing. Couper, M.P. (2002). Web Survey Design. Unpublished Coursenote. Couper, M.P., and Tourangeau, R. (2002), Web-Based Survey Applications: A Review of Opportunities and Issues for NCHS. Report submitted to the National Center for Health Statistics. Crown, W.H. (2001). Antidepressant Selectionand Economic Outcome: A Review of Methods and Studies from Clinical Practice. The British Journal of Psychiatry, 179, s18-s22. Curtin, R., Presser, S., and Singer, E. (2000). The Effects of Response Rate Changes on the Index of Consumer Sentiment. Public Opinion Quarterly, 64, 413-428. Czajka, J.L., Hirabayashi, S.M., Little, R.J.A., and Rubin, D.B. (1992). Projecting from Advance Data Using Propensity Modeling: An Application to Income and Tax Statistics. Journal of Business and Economic Statistics, 10(2), 117-132. 197 D?Agostino, R.B. Jr. (1998). Propensity Score Methods for Bias Reduction for the Comparison of a Treatment to a Non-randomized Control Group. Statistics in Medicine, 17, 2265-2281. D?Agostino, R.B. Jr., and Rubin, D.B. (2000). Estimating and Using Propensity Scores with Partially Missing Data. Journal of the American Statistical Association, 95(451) 749-759. Danielsson, S. (2002). The Propensity Score and Estimation in Nonrandom Surveys - An Overview. Accessed from http://www.statistics.su.se/modernsurveys/publ/11.pdf. Deming, W.E. (1944). On Errors in Surveys. American Sociological Review, 9(4), 359-369. Deming, W.E., and Stephan, F.F. (1940). On A Least Squares Adjustment of A Sampled Frequency Table When the Expected Marginal Totals Are Known. Journal of the American Statistical Association, 35, 615-630. Deville, J.C., and S?rndal, C.E. (1992). Calibration Estimators in Survey Sampling. Journal of the American Statistical Association, 87(418), 376-382. Deville J.C., S?rndal, C.E., and Sautory, O. (1993). Generalized Raking Procedures in Survey Sampling, Journal of the American Statistical Association, 88(423), 1013- 1020. Dillman, D.A. (2000). Mail and Internet Surveys. The Tailored Design Method. Second Edition. New York, NY: John Wiley & Sons. Dillman, D.A. (2002). Navigating the Rapids of Change: Some Observations on Survey Methodology in the Early 21st Century. Draft of Presidential Address to American Association for Public Opinion Research Annual Meeting. Accessed from http://survey.sesrc.wsu.edu/dillman/papers.htm. Drake, C. (1993). Effects of Misspecification of the Propensity Score on Estimators of Treatment Effect. Biometrics, 49(4), 1231-1236. Duncan, K.B., and Stasny, E.A. (2001). Using Propensity Scores to Control Coverage Bias in Telephone Surveys. Survey Methodology, 27(2). 121-130. Frigoletto, F.D., Lieberman, E., Lang, J.M., Cohen, A.P., Barss, V., Ringer, S.A. and Datta, S. (1995). A Clinical Trial of Active Management of Labor. New England Journal of Medicine, 333, 745-750. Gattiker, U.E. (2001). The Internet as A Diverse Community: Cultural, Organizational, and Political Issues. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Gelman, A., Carlin, J., Stern, H., and Rubin, D.B. (1995). Bayesian Data Analysis. Boca Raton, FL: Chapman & Hall Groves, R.M. (1989). Survey Errors and Survey Costs. New York, NY: John Wiley & Sons. Groves, R.M., and Couper, M.P. (1998). Nonresponse in Household Interview Surveys. New York, NY: John Wiley & Sons. 198 Groves, R.M., and Kahn, R.L. (1979). Surveys by Telephone: A National Comparison with Personal Interviews. New York, NY: Academic Press. Heckman, J.J. (1979). Sample Selection Bias as a Specification Error. Econometrica, 47(1), 153-162. Heckman, J.J., and Smith, J.A. (1995). Assessing the Case for Social Experiments. Journal of Economic Perspectives, 9, 85-110. Heckman, J.J. (1997). Instrumental Variables: A Study of Implicit Behavioral Assumptions Used in Making Program Evaluations. The Journal of Human Resources, 32(3), 441-462. Hoffer, T., Greeley, A.M., and Coleman, J.S. (1985). Acheivement Growth in Public and Catholic Schools. Sociology of Education, 58, 74-97. Huggins, V., and Eyerman, J. (2001). Probability Based Internet Surveys: A Synopsis of Early Methods and Survey Research Results. Paper presented at the 2001 Research Conference for the Federal Committee on Statistical Methodology. International Telecommunication Union. (2003). Internet Indicators: Hosts, Users and Number of PCs by Country. Accessed from http://www.itu.int/ITU- D/ict/statistics/. Jayasuriya, B., and Valliant, R. (1996). An Application of Restricted Regression Estimation to Post-Stratification in a Household Survey. Survey Methodology, 22, 127-137. Keeter, S., Kohut, A., Miller, C., Groves, R.M., and Presser, S. (2000). Consequences of Reducing Nonresponse in a Large National Telephone Survey. Public Opinion Quarterly, 64, 125-148. Kish, L. (1965). Survey Sampling. New York, NY: John Wiley & Sons. Kraut, R., Patterson, M., Lundmark, V., Kiesler, S., Mukhopadhyay, T., and Scherlis, W. (1998). Internet Paradox: A Social Technology That Reduces Social Involvement and Psychological Well-being? American Psychologist, 53 (9), 1017-31. Lavori, P.W. (1992). Clinical Trials in Psychiatry: Should Protocol Deviation Censor Patient Data? Neuropsychopharmacology, 6(1), 39-48. Lavori, P.W., and Keller, M.N. (1988). Improving the Aggregate Performance of Psychiatric Disgnostic Methods When Not All Subjects Receive the Standard Test. Statistics in Medicine, 7, 723-737 Lee, S. (2003). An Evaluation of Nonresponse and Coverage Errors in a Web Panel Survey. Paper presented at the annual Joint Statistical Meeting, American Statistical Association, San Francisco, CA. Lee, S. (2004). Propensity Score Adjustment as a Weighting Scheme for Volunteer Panel Web Surveys. Paper presented at the annual meeting of the American Association for Public Opinion Research, Phoenix, AZ. 199 Leiner, B.M., Cerf, V.G., Clark, D.D., Kahn, R.E., Kleinrock, L., Lynch, D.C., Postel, J., Roberts, L.G., and Wolff, S. (2000). A Brief History of the Internet. Accessed from http://www.isoc.org/internet/history/brief.shtml Lieberman, E., Lang, J.M., Cohen, A.P., D?Agostino, Jr. R, Datta, S., and Frigoletto, Jr. F.D. (1996). Association of Epidural Analgesia with Caesareans in Nulliparous Women. Obstetrics and Gynecology, 88, 993-1000. Lepkowski, J.M. (1988). Telephone Sampling Methods in the United States. In. Telephone Survey Methodology, ed. R.M. Groves, P.P. Biemer, L.E. Lyberg, J.T. Massey, W.L. Nicholls II, and J. Waksberg, New York, NY: John Wiley & Sons. Little, R.J.A., and Rubin, D.B. (2002). Statistical Analysis with Missing Data. Second Edition. Hoboken, NJ: John Wiley & Sons. Manfreda, K.L., (2001). Web Survey Errors. Unpublished Doctoral Dissertation. University of Ljubljana (Slovenia), Faculty of Social Science. Merkle, D. and Edelman, M. (2002). Nonresponse in Exit Polls: A Comprehensive Analysis. In Survey Nonresponse. ed. R.M. Groves, D.A. Dillman, J.L. Eltinge, and R.J.A. Little, New York, NY: John Wiley & Sons. Moore, J. (2002). The Internet Weather: Balancing Continuous Change and Constant Truths. New York, NY: John Wiley & Sons. Mitofsky, W.J. (1970). Sampling of Telephone Households. Unpublished CBS News Memorandum. Mitofsky, W.J. (1999). Pollsters.com. Public Perspective, June/July, 24-26. Nie, N.H., and Erbring, L. (2000). Internet and Society: A Preliminary Report. Palo Alto, CA: Stanford Institute for the Quantitative Study of Society. Accessed from http://www.stanford.edu/group/siqss Obenchain, R.L. (1999). Propensuty Score Binning and Smoothing in Splus, Version 9911. Accessed from http://www.math.iupui.edu/~indyasa/bobodown.htm. Obenchain, R.L., and Melfi, C.A. (1997). Propensity Score and Heckman Adjustments for treatment Selection Bias in Database Studies. Proceedings of the Biopharmaceutical Section, American Statistical Aassociation, 297-306. Presser, S., Blair, J., and Triplett, T. (1992). Survey Sponsorship, Response Rates, and Response Effects. Social Science Quarterly, 73, 3, 699-702. Reid, E. (1991). Electropolis: Communication and Community on Internet Relay Chat. Unpublished Honors Thesis, University of Melbourne. Accessed from ftp://ftp.parc.xerox.com/pub/MOO/papers/electropolis Rosenbaum, P.R., and Rubin, D.B. (1983). The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika, 70(1), 41-55. Rosenbaum, P.R. (1984a). From Association to Causation in Observational Studies: The Role of Tests of Strongly Ignorable Treatment Assignment. Journal of the American Statistical Association, 79(385), 41-48. 200 Rosenbaum, P.R. (1984b). The Consequences of Adjustment for a Concomitant Variable That Has Been Affected by the Treatment, Journal of the Royal Statistical Society, Series A (General), 147(5), 656-666. Rosenbaum, P.R., and Rubin, D.B. (1984). Reducing Bias in Observational Studies Using Subclassification on the Propensity Score. Journal of the American Statistical Association, 79(387), 516-524. Rosenbaum, P.R., and Rubin, D.B. (1985a). Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score. The American Statistician, 39(1), 33-38. Rosenbaum, P.R., and Rubin, D.B. (1985b). The Bias Due to Incomplete Matching. Biometrics, 41(1), 103-116. Rubin, D.B. (1973). Matching to Remove Bias in Observational Studies. Biometrics, 29(1), 159-183. Rubin, D.B. (1978). Multiple Imputation in Sample Surveys ? A Phenomenological Bayesian Approach to Nonresponse. Proceedings of the Section on Survey Research Methodology, American Statistical Association, 20-34. Rubin, D.B. (1979). Using Multivariate Matched Sampling and Regression Adjustment to Control Bias in Observational Studies. Journal of the American Statistical Association, 74(366), 318-328. Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York, NY: John Wiley & Sons. Rubin, D.B. (1997). Estimation from Nonrandomized Treatment Comparisons Using Subclassification on Propensity Scores. Annals of Internal Medicine, 127, 8(2), 757- 763. Rubin, D.B., and Thomas, N. (1992). Characterizing the Effect of Matching Using Linear Propensity Score Methods with Normal Distributions. Biometrika, 79(4), 797- 809. Rubin, D.B., and Thomas, N. (1996). Matching Using Estimated Propensity Scores: Relating Theory to Practice. Biometrics, 52, 254-268. SAS Institute, Inc. (1999). SAS/STAT ? User?s Guide, Version 8. Cary, NC: SAS Institute, Inc. Schonlau, M., Fricker, R.D., Jr., and Elliott, M.N. (2002). Conducting Research Surveys via E-mail and the Web. Santa Monica, CA: RAND. Schonlau, M., Zapert, K., Simon L.P., Sanstad, K., Marcus, S., Adams, J., Spranca, M., Kan, H., Turner, R., and Berry, S. (2004). A Comparison between a Propensity Weighted Web Survey and an Identical RDD Survey. Social Science Computer Review, 22(1). Slevin, J. (2000). The Internet and Society. Cambridge, UK: Polity Press. Smith, P.J., Rao, J.N.K., Battaglia, M.P., Daniels, D., and Ezzati-Rice, T. (2000). Compensating for Nonresponse Bias in the National Immunization Survey Using 201 Response Propensities. Proceedings of the Section on Survey Research Methods, American Statistical Asosication, 641-646. Spiegelhalter, D.J., Thomas, A., and Best, N.G. (1999). WinBUGS Version 1.2 User Manual. MRC Biostatistics Unit. Stone, R.A., Oborsky, S., Singer, D.E., Kapoor, W.N., and Fine, M.J. (1995). Propensity Score Adjustment for Pretreatment Differences between Hospitalized and Ambulatory Patients with Community-Acquired Pneumonia. Medical Care, 33, AS56-66. Taylor, H., and Terhanian, G. (2003). The Evaluation of Online Research and Surveys over the Last Two Years. Unpublished Manuscript. Taylor, H., Bremer, J., Overmeyer, C., Siegel, J.W., and Terhanian, G. (2001). The Record of Internet-Based Opinion Polls in Predicting the Results of 72 Races in the November 2000 US Elections. International Journal of Market Research, 43(2), 127- 135. Taylor, H. (2000). Does Internet Research Work? Comparing Online Survey Result with Telephone Survey. International Journal of Market Research, 42(1), 58-63. Terhanian, G. (2000). How to Produce Credible, Trustworthy Information through Internet-Based Survey Research. Paper presented at the annual meeting of the American Association for the Public Opinion Research, Portland, OR. Terhanian, G., and Bremer, J. (2000). Confronting the Selection-Bias and Learning Effects Problems Associated with Internet Research. Research Paper: Harris Interactive. Terhanian, G., Bremer, J., Smith, R., and Thomas, R. (2000). Correcting Data from Online Survey for the Effects of Nonrandom Selection and Nonrandom Assignment. Research Paper: Harris Interactive. Toffler, A. (1991). Powershift: Knowledge, Wealth, and Violence at the Edge of the 21st Century. New York, NY: Bantam Books. Toffler, A. (1980). The Third Wave. New York, NY: William Morrow. Toffler, A. (1970). Future Shock. New York, NY: Bantam Books. Turkle, S. (1995). Life on the Screen: Identity in the Age of the Internet. New York, NY: Simon & Schuster. U.S. Department of Commerce (2002). A Nation Online: How Americans Are Expanding Their Use of the Internet. Accessed from http://www.ntia.doc.gov/ntiahome/dn/. Valliant, R. (2004). The Effect of Multiple Weighting Steps on Variance Estimation. To appear in Journal of Official Statistics. Varedian, M., and F?rsman, G. (2002). Comparing Propensity Score Weighting with Other Weighting Methods: A Case Study on Web Data. Paper presented at the annual meeting of the American Association for Public Opinion Research, St. Petersburg Beach, FL. 202 Vativarian, S., and Little, R. (2003). On the Formation of Weighting Adjustment Cells for Unit Nonresponse. University of Michigan Department of Biostatistics Working Paper Series. Vehovar, V., and Manfreda, K.L. (1999). Web Surveys: Can the Weighting Solve the Problem? Proceedings of the Section on Survey Research Methods, American Statistical Association, 962-967. Venables, W.N., Smith, D.M., and the R Development Core Team. (2003). An Introduction to R ? . Waksberg, J. (1978). Sampling Methods for Random Digit Dialing. Journal of the American Statisical Association, 73, 40-46. Westat (2000). WesVar? 4.0 User?s Guide. Rockville, MD: Westat.