ABSTRACT Title of Document: THE MIXTURE DISTRIBUTION POLYTOMOUS RASCH MODEL USED TO ACCOUNT FOR RESPONSE STYLES ON RATING SCALES: A SIMULATION STUDY OF PARAMETER RECOVERY AND CLASSIFICATION ACCURACY Youngmi Cho, Doctor of Philosophy, 2013 Directed By: Professor Jeffrey R. Harring Professor George B. Macready Department of Human Development and Quantitative Methodology Response styles presented in rating scale use have been recognized as an important source of systematic measurement bias in self-report assessment. People with the same amount of a latent trait may in some cases be victims of biased test scores due to the construct?s irrelevant effect of response styles. The mixture polytomous Rasch model has been proposed as a tool to deal with the response style problems. This model can be used to classify respondents with different response styles into different latent classes and provides person trait estimates that have been corrected for the effect of a response style. This study investigated how well the mixture partial credit model (MPCM) recovered model parameters under various testing conditions. Item responses that characterized extreme response style (ERS), middle-category response style (MRS), and acquiescent response style (ARS) on a 5-category Likert scale as well as ordinary response style (ORS), which does not involve distorted rating scale use, were generated. The study results suggested that ARS respondents could be almost perfectly differentiated from other response-style respondents while the correct differentiation between MRS and ORS respondents was most difficult to attain followed by the differentiation between ERS and ORS respondents. The classifications were more difficult when the distorted response styles were presented in small proportions within the sample. Under the simulated conditions where ten-items and a sample size of 3000 were used there were reasonable item thresholds and person parameter estimates that were obtained. As the structure of mixture of response styles became more complex, increased sample size, test length, and balanced mixing proportion were needed in order to achieve the same level of recovery accuracy. Misclassification impacted the overall accuracy of person trait estimation. BIC was found to be the most effective data-model fit statistic in identifying the correct number of latent classes under this modeling approach. The model-based correction of score bias was explored with up to four different response-style latent classes. Problems with the estimation of the model including non-convergence, boundary threshold estimates, and label switching were discussed. THE MIXTURE DISTRIBUTION POLYTOMOUS RASCH MODEL USED TO ACCOUNT FOR RESPONSE STYLES ON RATING SCALES: A SIMULATION STUDY OF PARAMETER RECOVERY AND CLASSIFICATION ACCURACY By Youngmi Cho Doctoral Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2013 Advisory Committee: Professor Jeffrey R. Harring, Chair Professor George B. Macready, Co-Chair Professor Robert G. Croninger Professor Robert W. Lissitz Professor Matthias von Davier ? Copyright by Youngmi Cho 2013 ii Table of Contents Table of Contents .......................................................................................................... ii List of Tables ................................................................................................................ v List of Figures ............................................................................................................. vii Chapter 1: Introduction ................................................................................................. 1 1.1 Background of the Problem ................................................................................. 1 1.1.1 Response styles. ........................................................................................... 1 1.1.2 Why response styles matter? ........................................................................ 3 1.1.3 Response styles as meaningful constructs. .................................................. 6 1.1.4 Methodology to deal with response styles. .................................................. 8 1.2 Mixture IRT Models in Empirical Studies......................................................... 14 1.3 The Current Study .............................................................................................. 16 Chapter 2: Literature Review ...................................................................................... 18 2.1 The Rasch Model for binary item responses...................................................... 18 2.1.1 Presentation of the model........................................................................... 18 2.1.2 Item response function. .............................................................................. 20 2.2 Partial Credit Model ........................................................................................... 21 2.2.1 Presentation of model. ............................................................................... 21 2.2.2 Threshold parameters in the PCM. ............................................................ 25 2.2.3 Category characteristic curves and the presence of response styles. ......... 26 2.3 Unique Features of the Rasch Models ............................................................... 34 2.4 Mixture Distribution Models ............................................................................. 35 2.4.1 Continuous and discrete mixture distribution. ........................................... 36 2.4.2 Latent class model...................................................................................... 38 2.4.3 Mixture IRT models. .................................................................................. 39 2.5 Mixture Partial Credit Model ............................................................................. 41 2.5.1 Presentation of model. ............................................................................... 41 2.5.2 Parameter estimation. ................................................................................. 42 iii 2.5.3 Assigning latent class membership. ........................................................... 47 2.5.4 Determining the number of latent classes. ................................................. 47 2.6 Applications of the MPCM to Study of Response Styles .................................. 51 2.6.1 Real data analysis. ...................................................................................... 52 2.6.2. Simulated data analysis. ............................................................................ 58 Chapter 3: Methodology ............................................................................................. 60 3.1 Objectives and Research Questions ................................................................... 60 3.2 Overview of Simulation Study........................................................................... 61 3.2.1 Manipulated factors. .................................................................................. 61 3.2.2 Fixed factors............................................................................................... 62 3.2.3 Response scale. .......................................................................................... 63 3.3 Data Generation ................................................................................................. 64 3.3.1 Population generating thresholds. .............................................................. 64 3.3.2 Item responses generation. ......................................................................... 75 3.4 Analysis and Evaluation Criteria ....................................................................... 79 3.4.1 Fitting competing models. ......................................................................... 79 3.4.2 Convergence check. ................................................................................... 80 3.4.3 Model selection. ......................................................................................... 80 3.4.4 Problem of label switching. ....................................................................... 80 3.4.5 Classification accuracy. ............................................................................. 82 3.4.6 Threshold parameter recovery. .................................................................. 83 3.4.7 Person trait parameter recovery. ................................................................ 84 3.4.8 Model-based correction of score bias due to response styles. ................... 85 3.4.9 Evaluation of effects of manipulated factors. ............................................ 85 Chapter 4: Results ....................................................................................................... 87 4.1. Initial treatment of estimation problems and label switching problems ........... 87 4.1.1 Non-convergence and boundary estimates. ............................................... 87 4.1.2 Label switching problems. ......................................................................... 93 4.2. Model selection ................................................................................................. 99 4.3 Classification of Respondents .......................................................................... 105 iv 4.3.1. Classification accuracy. .......................................................................... 106 4.3.2. Misclassification. .................................................................................... 113 4.4 Threshold Parameter Recovery ........................................................................ 115 4.4.1. Evaluation of the RMSE. ........................................................................ 115 4.4.2. Evaluation of the correlation. .................................................................. 125 4.4.3. Evaluation of the standard error. ............................................................. 131 4.5 Person Trait Parameter Recovery .................................................................... 138 4.5.1. Evaluation of the bias.............................................................................. 139 4.5.2. Evaluation of the RMSE. ........................................................................ 145 4.5.3. Evaluation of the correlation. .................................................................. 146 4.5.4. Impact of misclassification on person trait estimation. .......................... 147 4.6. Model-based Correction of Score Bias ........................................................... 149 Chapter 5: Discussion ............................................................................................... 152 5.1 Summary of Findings ....................................................................................... 154 5.2 Discussion ........................................................................................................ 160 5.3 Limitations of the current study and implications for future research ............. 164 Appendix A ............................................................................................................... 167 References ................................................................................................................. 169 v List of Tables Table 1. Manipulated Simulation Conditions of Population Heterogeneity ......................62 Table 2. Expected Marginal Category Probabilities for Different Response-style Classes ..................................................................................................................66 Table 3. Category Probabilities for Individual Items for ORS Class ................................66 Table 4. Threshold Values Used for the Generation of the ORS Class .............................70 Table 5.Threshold Values Used for the Generation of the MRS Class .............................71 Table 6. Threshold Values Used for the Generation of the ERS Class .............................72 Table 7. Threshold Values Used for the Generation of the ARS Class .............................74 Table 8. Means of Generated Threshold Parameters for Each Response-style Class .......82 Table 9. Percentages of the Occurrence of Non-convergence and Boundary Threshold Estimates ..............................................................................................................90 Table 10.Specifications of Simulation Conditions Excluded from Simulation Summary Due to Estimation Problems ................................................................92 Table 11. Specifications of Simulation Conditions in which Switched Labels are unsolvable .............................................................................................................99 Table 12.Model Selection under the ORS-ERS Mixtures ...............................................101 Table 13.Model Selection under the ORS-MRS Mixtures ..............................................102 Table 14. Model Selection under the ORS-ARS Mixtures ..............................................103 Table 15. Model Selection under the ORS-ERS-MRS Mixtures ....................................104 Table 16. Model Selection under the ORS-ERS-MRS-ARS Mixtures ..........................105 Table 17. Percentages of Correct Classification and Standard Errors of Classification Accuracy .............................................................................................................107 Table 18. Factorial ANOVA Results on Overall Classification Accuracy ......................109 Table 19.Cell Means of the Overall Classification Accuracy ..........................................110 Table 20. Effect size ( 2? ) for the Classification Accuracy Conditional on Statistical Significance (p < 0.05) .......................................................................................112 Table 21. Percentages of Misclassified Respondents ......................................................114 Table 22. RMSE of Threshold Parameter Estimates .......................................................116 Table 23. Factorial ANOVA Results on the RMSE of Threshold Estimates for ORS Class ...................................................................................................................117 Table 24.Cell Means of the RMSE of Threshold Estimates for the ORS Class ..............118 Table 25. Factorial ANOVA Results on the RMSE of Threshold Estimates for the ERS Class ...........................................................................................................120 Table 26. Cell Means of the RMSE of Threshold Estimates for the ERS Class .............121 Table 27.Factorial ANOVA Results on the RMSE of Threshold Estimates for the MRS Class ..........................................................................................................122 Table 28. Cell Means of the RMSE of Threshold Estimates for the MRS class .............123 Table 29. Factorial ANOVA Results on the RMSE of Threshold Estimates for the ARS Class ...........................................................................................................124 Table 30.Cell Means of the RMSE of Threshold Estimates for the ARS class ...............125 Table 31.Correlations Between Generated and Estimated Threshold Parameters ..........126 Table 32. Effect size ( 2? ) for Correlation for Thresholds Parameters Conditional on Statistical Significance (p < 0.05) ......................................................................127 Table 33.Cell Means of the RMSE of Threshold Estimates for the ORS Class ..............128 Table 34. Cell Means of the Correlation for the ERS Class ............................................129 vi Table 35.Cell Means of the RMSE of Threshold Estimates for the MRS Class .............130 Table 36. Cell Means of the RMSE of Threshold Estimates for the ARS Class .............131 Table 37. SE of Threshold Parameter Estimates .............................................................132 Table 38. Effect size ( 2? ) for the SE Conditional on Statistical Significance (p < 0.05) ....................................................................................................................133 Table 39. Cell Means of the SE of Threshold Estimates for the ORS Class ...................133 Table 40. Cell Means of the RMSE of Threshold Estimates for the ERS Class .............135 Table 41. Cell Means of the SE of Threshold Estimates for the MRS Class ..................137 Table 42. Cell Means of the SE of Threshold Estimates for the ARS Class ...................138 Table 43. Theta Recovery for All Respondents and Correctly Classified Respondents .140 Table 44. Effect size ( 2? ) for the RMSE of Theta Estimates Conditional on Statistical Significance (p < 0.05) .......................................................................................145 Table 45. Effect size ( 2? ) for the Correlation of Theta Estimates Conditional on Statistical Significance (p < 0.05) ......................................................................146 Table 46. Cell Means of the RMSE of theta estimates ....................................................147 Table 47. Cell Means of the Correlation of Theta Estimates ...........................................148 Table 48. Paired t-test Results on the Impact of Misclassification on Theta Recovery ..148 vii List of Figures Figure 1. A Likert scale with five ordered response categories ...........................................1 Figure 2. CCs corresponding to Rasch model items with different item difficulty ...........21 Figure 3. Five response categories and four corresponding steps .....................................22 Figure 4. CCC and threshold probabilities for a PCM item with thresholds (-1.7, -0.6, 0.6, and 1.7) ........................................................................................................29 Figure 5. CCCs and threshold probabilities for a PCM item with thresholds (-1.85, - 1.24, 1.34, and 1.95) ...........................................................................................30 Figure 6. CCCs and threshold probabilities for a PCM item with thresholds (-2.01, - 2.45, 2.45, and 2.01) ...........................................................................................31 Figure 7. CCCs and threshold probabilities for a PCM item with thresholds (0.45, 0.74, -0.74, and -0.45) .........................................................................................32 Figure 8. CCCs and threshold probabilities for a PCM item with thresholds (-1.51, - 1.63, -2.42, and -0.93) .........................................................................................34 Figure 9. Expected marginal frequency distributions of category responses for ...............65 Figure 10. Thresholds plot for 10 items for ORS class......................................................70 Figure 11. Thresholds plot for 10 items for the MRS class ...............................................71 Figure 12. Thresholds plot for 10 items for the ERS class ................................................73 Figure 13. Thresholds plot for 10 items for the ARS class ................................................74 Figure 14. Conditional frequency distributions of category responses for an item with ...77 Figure 15. Conditional frequency distributions of category responses for an item with higher item location ............................................................................................78 Figure 16. Interaction effect between type of mixture and test length on the overall classification accuracy ......................................................................................111 Figure 17. Interaction effect between type of mixture and test length on the RMSE of threshold estimates for the ORS class ...............................................................119 Figure 18. Interaction effect between type of mixture and test length on the SE of threshold estimates for the ORS class ...............................................................134 Figure 19. Interaction effect between type of mixing proportion and test length on the SE of threshold estimates for the ERS class .....................................................136 Figure 20. Theta estimates as a function of sum score for ORS, ERS, and MRS Class..150 Figure 21. Theta estimates as a function of sum score for ORS, ERS, MRS, and ARS Class ..................................................................................................................151 To provide the background of the problems dealt with in the current study, Chapter 1 reviews the individual differences in of those individual differences, methodologies to address the related psychometric issues, and the findings in previous empirical studies are detailed. The chapter continues to discuss the purpose and significance of the current study. 1.1 Background of the Problem 1.1.1 Response styles While dichotomously assessment, items with ordered polytomous response categories have been routinely used in self-report, non-cognitive assessment including various psychological tests and attitudinal survey questionnaires. Prototypical examples of ord format are Likert-type rating scales (Likert,1932), of which an illustration is presented in Figure1.1 Figure 1. A Likert scale with five ordered response categories 1 The question ?In most ways, my life is close to my ideal? is one of the five items of Satisfaction With Life Scale (SWLS) by Diener, Emmons, Larse, & Griffin (1985). SWLS intends to measure global cognitive judgments of satisfaction with one?s life. The o categories and does not use the graphical representation of the continuum as presented in Figure 1. 1 Chapter 1: Introduction rating scale use. Psychological aspects . -scored item format is more prevalent in cognitive ered polytomous item riginal form of SWLS uses seven response 2 . As seen in the illustrative item in Figure 1, a Likert-scale item attempts to quantify the individual differences in a continuous trait variable based on a certain number of response categories that are often associated with integer scores. It is generally assumed that if a respondent chooses a higher response category, he or she has more of the latent trait being measured by the item than a person who selects a lower response category. The formal aspects of the rating scale such as the number of response categories, category-wording and item-wording can differ in various ways. In order to utilize the Likert-scale measures as valid indicators of a latent trait of interest and to further compare the trait level among (groups of) respondents, certain necessary conditions must first be satisfied. For example, it must be assumed that respondents? choice of a response category is solely based on the substantive meaning of the item. In other words, any content-irrelevant factor should not systematically influence the respondent?s choice of response categories. Additionally, all respondents in a sample interpret the meaning of the provided response categories and use them in the same manner when they answer each item. These assumptions, however, do not hold if respondents present different response styles in responding to a rating scale. A response style (also referred to as a response set or response bias) can be defined as an individual?s tendency that causes a person to consistently respond to test items based on some formal aspects of the item or item connotation rather than the underlying construct the item intends to measure (Cronbach, 1946; Messick, 1991; Nunally, 1978; Paulhus, 1991). The prototypical 3 manifestations of the response styles in ordered polytomous response items are respondents? differential uses of response categories. Among many others (see e.g., Baumgartner & Steenkamp, 2001 and Paulhus, 1991 for a review of various response styles), three particular patterns of response category use that are well-documented in psychometric literature (e.g., Nunally, 1978: Paulhus, 1991) are the primary focus in the current study. These are extreme response style (ERS), middle-category response style (MRS), and acquiescent response style (ARS). ERS is an individual tendency that leads a person to predominantly use extreme response categories (e.g., categories 0 and 4 in Figure 1) and avoid less extreme choices (response categories in the middle of the scale). Conversely, MRS is a tendency to select the middle category (e.g., category 2 in Figure 1) predominantly while avoiding extreme responses. ARS is a tendency to use only one side of the response scale, i.e., agreement (?yea-saying?, e.g., categories 3 or 4 in Figure 1) or disagreement (?nay-saying?, e.g., categories 0 and 1 in Figure 1). 1.1.2 Why response styles matter? The presence of response styles in a data set can cause various psychometric problems. These adverse effects may invalidate test score differences, obscure true relations among traits of interest, impact test reliability, and confound the results of comparative studies at the group-level. Response styles can invalidate the assessment of true scores by inflating or deflating observed item scores. Cronbach (1946) pointed out that response styles always reduce logical validity of a test because they permit people with equal 4 knowledge, identical attitude, or equal amounts of a personality trait to have different test scores. Suppose that there are two people whose true levels of ?satisfaction with life? are located around category 3 on the latent trait continuum in Figure 1. However, they are different in terms of their response styles, i.e., one is an ERS respondent and the other is a MRS respondent. If their different response styles are operating during the item response process, it is highly likely that the two people?s choice of response category will not end up with the same. Instead, due to the confounding effects of their different response styles, the ERS respondent might select category 4, for example, while the MRS respondent might select category 2. Consequently, the ERS respondent would be regarded as being more satisfied with his life than the MRS respondent. Using the observed test scores contaminated by response styles can also cause serious problems in clinical diagnostic settings (see, e.g., Gollwitzer, Eid, & J?rgensen, 2005). In clinical symptom assessments, it is common practice for the total (sum) scores to be computed by adding up the category response scores and these sum scores are compared to appropriate normative values in order to make diagnostic decisions. Without considering individual differences in response styles, this approach for assessing clinical symptoms will lead to lower sensitivity as well as lower specificity of the diagnosis. Response styles may also give rise to spurious associations among trait domains of interest. Austin, Deary, Gibson, McGregor, and Dent (1998) assessed the consistency of response styles over items and over subscales of the NEO-FFI (NEO- Five Factor Inventory: Costa & McCrae, 1992) by using a measure of response spread 5 on a rating scale. They found non-trivial, highly significant correlations between unrelated, independent items. The observed spurious correlations may be attributed to the effect of response styles operating across the items because it seems unlikely that the items whose contents are not related with each other yielded such high levels of correlation. Austin et al. (1998) also pointed out that such spurious correlations could cause erroneous extraction and interpretation of latent factors in multivariate data analysis that were based on correlation matrices. Similarly, Austin et al. (2006) and Baumgartner and Steenkamp (2001) provided empirical support for the contribution of response styles inflating scale-level correlations. The impact of response styles on test reliability can be found in a simulation study by Liu, Wu, and Zumbo (2009). They generated outlying data, which represented ERS responses under a mixture modeling framework. Their results of the bias and efficiency of Cronbach?s coefficient alpha showed that outliers severely inflated the alpha coefficient as well as the standard error of the estimates of the coefficient. Another methodological issue is that response styles tend to be manifested differentially across groups. That is, certain response styles tend to be more prevalent in a particular group than in another. This between-group variability in response styles is likely to contribute to the violation of structural invariance and, in turn, any observed group differences may simply reflect measurement artifacts due to the differences in response styles. Regarding the between-group variability, Cheung and Rensvold (2000) used multi-group confirmatory factor analysis and demonstrated that 6 certain types of measurement non-invariance were attributed to the manifestations of ERS and ARS. Bolt and Johnson (2009) applied a multidimensional item response theory (IRT) model and found that ERS was an underlying source of item differential functioning (DIF). In various areas of study such as marketing, organizational and industrial psychology, education, and medicine there has been accumulating empirical evidence of between-group variability across nations, ethnic groups, and cultural regions (e.g., Baumgartner & Steenkamp, 2001; Buckley, 2009; Cheung & Rensvold, 2000; Harzing, 2006; Yang, Harkness, Chin, & Villar, 2010). For example it has been shown that ERS and ARS are more prevalent in among Hispanics/Latinos and African- Americans than among Caucasians in the U.S. (Bachman & O?Malley, 1984; Clarke III, 2000; Hui & Triandis. 1989; Marin, Gamba & Marin, 1992; Ross & Mirowsky, 1984). Japanese and Chinese respondents in the U.S. tended to use extreme responses less often than Americans in responding to positive feeling (Lee, Jones, Mineyama, & Shang, 2002). Japanese and Korean students tended to use middle categories more often than their American counterparts (Chen, Lee, & Stevenson, 1995; Lee & Green, 1991). In Europe, ERS has been shown to be more prevalent in Mediterranean countries (Italy, Spain, and Greece) than in the United Kingdom, Germany, and France (Van Herk, Poortinga, & Verhallen, 2004). 1.1.3 Response styles as meaningful constructs Rather than perceiving response styles as a source of systematic measurement bias, one strand of research in psychology views response styles as meaningful 7 reflectors of psychological constructs such as personality traits and cognitive processes, or some cultural values. In those research studies, the relation between some criteria variables and specific response style were investigated. For examples, ERS appeared to be positively related to trait anxiety (Berg & Collier, 1953; Lewis & Taylor, 1955; Norman 1969), extraversion (Austin, Deary, & Egan, 2006), and conscientiousness (Austin et al., 2006; Harzing, 2006). In cognitive process research area, Temple and Geisinger (1990) and Kulas and Stachowski (2008) found that middle category endorsements (e.g., ?neither disagree nor agree?, ?no answer?, or ???) exhibited longer response latencies than other category endorsements and were more frequently elicited when the given items were unclear, personally intrusive, or asked introspective questions. The results of these experimental studies have shown the evidence of increased cognitive load in processing information contained in the middle category. The implication is that response styles, in some cases, could be associated with the respondent?s attempts to reduce the cognitive demand required to process the meaning of the item content and the labels of the response categories. In cross-cultural comparative studies, the types of response style and cultural values are associated. For example, using the measures of Hofstede?s cultural dimensions,2 several studies argued that ARS seemed to be positively correlated with collectivism and femininity but negatively related to power distance and uncertainty 2 The Hofstede?s cultural dimensions theory (Hofstede, 1980) postulates four dimensions along which cultural values can be analyzed. The four dimensions are individualism-collectivism; uncertainty avoidance; power distance (strength of social hierarchy) and masculinity-femininity (task orientation versus person-orientation). 8 avoidance. ERS appeared to be positively correlated with individualism, power distance, uncertainty avoidance, and masculinity (see e.g., Chen, Lee, & Stevenson, 1995; de Jong, Steenkamp, Fox, & Baumgartner, 2008; Harzing, 2006; Johnson, Kulesa, Cho, & Shavitt, 2005). 1.1.4 Methodology to deal with response styles No matter how response styles are considered, i.e., treated as a statistical nuisance that needs to be controlled for or as a meaningful construct of interest, the initial treatment of the data analysis should be the distinction of the cases that are influenced by certain response styles. Following the distinction, the identified cases can be either controlled for (by eliminating the cases from the data or applying a correction method) or related with other variables to reveal the nature of the response styles and investigate their structural relations among latent variables. Traditional strategies dealing with response styles use simple descriptive statistics calculated for heterogeneous items and balanced scales, which are designed as ?built-in control? in an instrument. Relatively recently, different latent variable models have been proposed to aid in solving this response style problems. Heterogeneous items. Heterogeneous items refer to the items whose contents are psychologically diffused and theoretically independent of each other. In practice, a number of items that do not refer to substantively meaningful psychological construct can be used as heterogeneous items in an assessment. Alternatively, items varying widely in content can be selected from diverse set of scales that have little in common (see, e.g., Couch & Keniston, 1960). If a respondent consistently favors particular 9 response categories (e.g., extreme categories) across such heterogeneous items, this behavior can be taken as evidence of a response style (e.g., ERS). Response style measures for ERS, MRS, or ARS can then be derived by calculating the number or the proportion of the heterogeneous items on which a respondent selects the most extreme categories, middle category, or categories in just the upper or the lower extreme, respectively. Instead of frequency or proportion, response range as measured by the standard deviation of item scores within individuals has also been used (Austin et al., 1997; Greenleaf, 1992; Hui & Triandis, 1985). The major weakness of using heterogeneous items is that if the substantive independence among heterogeneous items is not warranted for a given sample of respondents, which is not unusual in practice, the resulting response style measures are confounded with the respondent?s trait level. In such cases, clustering respondents into different response-style groups may not be valid and inferences based on these clusters can hardly be justified. There is also a practical limitation. In the literature, it has been pointed out that the number of heterogeneous items should be large in order for a response style to have sufficient opportunity to manifest itself by permeating the responding pattern in a consistent way (Couch & Keniston, 1960; Greenleaf, 1992). If a test is lengthened due to the inclusion of heterogeneous items, it may raise some psychometric problems of a test (e.g., an increase in measurement error due to the respondent?s fatigue and lowered face validity of the test) as well as the issues of time and cost needed for the administration of the inventory. 10 Balanced scales. A balanced scale consists of pairs of logically reversed items, i.e., one item of the pair states a construct positively while the other of the pair states the equivalent construct negatively (Couch & Keniston, 1960; Paulhus, 1991). In such a way, the scale becomes semantically balanced. If a respondent has a tendency to acquiesce and respond to a pair of such logically reversed item by ?yea-saying? or ?nay-saying? to both, his or her responses are conceptually conflicting. If this conflicting endorsement is repeated, it can provide strong evidence for ARS. Using a balanced scale in an assessment per se does not preclude the occurrence of ARS. A well-constructed balanced scale, however, can alleviate score distortion to some degree. By ?reverse coding? item responses (mostly, responses to negatively worded items) before summing up all item scores, high or low item scores obtained by simply ?yea-saying? or ?nay-saying? will cancel each other out and ARS respondents will receive a moderate test score. Mirowsky and Ross (1991) showed that the ARS inflated the variance and reliability of the trait estimates when unbalanced scales were used, leading to either an overestimation or an underestimation of the relation between the construct measured by the unbalanced scale and other constructs. Watson (1992) showed that the covariance due to ARS is extracted using structural equation modeling when an unbalanced scale is used. Model-based approaches. Besides utilizing heterogeneous items and balanced scales in the test development stage, an increasing number of studies have attempted a 11 more rigorous solution to this problem by applying latent variable models into which response style effects are directly incorporated. Within the structural equation modeling (SEM) framework, response styles are examined as group characteristics and group differences in the manifestation of response styles are statistically tested. Cheung and Rensvold (2000) applied multiple- group confirmatory factor analysis to test for the presence of ERS and ARS and determine whether cultural groups can be meaningfully compared on the basis of factor means. Group differences in ERS and ARS are operationalized as non- invariance in the factor loadings and intercepts. This study showed the utility of using the SEM approach in this matter, but also highlighted its limitations. The SEM approach was not appropriate to use when no items in the scale were invariant across groups with respect to the effects of response styles. Also, the SEM approach does not provide individual level information. Billiet and McClendon (2000) estimated a confirmatory three-factor model that included ARS as a common ?style? factor (i.e., method factor) in addition to two ?content? factors. By using two sets of balanced scales measuring two independent constructs, they demonstrated that the effects of style factor can be separated from the content factors. Moors (2003, 2004, 2008) adapted the same rationale as Billiet and McClendon (2000) but within latent class factor analysis (LCFA). Moors emphasized the flexibility of this approach over multi-group CFA in that LCFA allowed response styles to be manifested within an exploratory setting in which no response style was hypothesized in a given data set. In Moors empirical studies, an ERS factor was 12 identified. Billiet and McClendon and Moors?s approach commonly impose a restriction that the factor loadings are equal for all items. However, if the items are actually influenced differentially by the response style, assuming a constant factor loading on the style factor would lead to a model misspecification. Within an item response theory (IRT) framework, Bolt and Johnson (2009) developed a multidimensional model that extends Bock?s nominal response model (Bock, 1972) to investigate ERS. In this model, response styles were characterized as continuous trait dimensions that influenced the attractiveness of particular score categories. The item response probabilities were defined as a function of two trait dimensions, i.e., an intended substantive trait and ERS tendency. Based on the estimates for these two dimensions, observed test scores were rectified for the impact of ERS. Although this approach has been shown to be useful to help understand how both substantive and ERS traits are combined to affect item response behaviors, whether it can be successfully applied for other types of response styles (e.g., MRS and ARS) and whether the condition in which more than two response styles are presented in a sample can be handled have not yet been explored. De Jong, Steenkamp, Fox, and Baumgartner (2008) proposed a model that extended a standard IRT model by integrating testlet models (e.g., Bradlow, Wainer, & Wang, 1999) and a structural multilevel model. The inclusion of the testlet component in the model permits a control for substantive correlations that may exist among heterogeneous items. This model allows the response styles to have differential impact across items. In addition, measurement invariant anchor items are not required 13 for group comparisons. This approach successfully identifies ERS, but is arguably less useful for correcting the effects of ERS on substantive trait estimates (Bolt & Newton, 2010). Lastly, mixture polytomous IRT models, which generalize the standard polytomous IRT models to mixture distribution models, have been used by an increasing number of researchers in various disciplines compared to the other model- based approaches previously introduced. Similar to LCFA, mixture polytomous IRT models are useful for the study of response styles in an exploratory manner, which is not benefited from the SEM approach as well as the extended IRT models by Bolt and Johnson (2009) and by De Jong et al. (2008). Unlike the Bolt and Jonhson (2009) approach where response styles were treated as continuous variables (quantitative differences), mixture polytomous IRT models treat response styles as discrete variables (qualitative differences) and assign each respondent to a latent class membership that represent his or her response style. This would allow for a more flexible and effective modeling technique that can be applicable when multiple response styles are present within a sample of respondents. Not only the classification of respondents but also the individual-level estimate of latent trait is obtained with mixture polytomous IRT models, which is not available information in the studies in the SEM framework. More details of the mixture polytomous IRT models are followed in the subsequent section as well as in Chapter 2. 14 1.2 Mixture IRT Models in Empirical Studies As mentioned earlier, the common manifestations of response styles, regardless of the cause of the emergence of response styles, is respondents? disproportionate usages of response categories. Different types of response styles can be characterized by different category response probabilities. For example, a sample of ERS respondents shows a high probability of endorsing the end-categories. Based on the analysis of the unique patterns of category response probabilities, mixture polytomous IRT models provide the way that can distinguish latent groupings of respondents with different response styles. In general, mixture IRT models assume that the respondent population can be heterogeneous not only quantitatively but also qualitatively. If respondents are different with respect to how they use the response categories, this heterogeneity can possibly be captured using mixture IRT models and respondents with different response styles are classified into different latent classes. A latent trait estimate is assigned to each respondent within the identified classes and, hence, the response style effects can be controlled when latent trait levels are compared. Mixture polytomous Rasch models are special cases of mixture IRT models where the category response probabilities are predicted by one of the logistic functions of the polytomous Rasch family such as the partial credit model (Masters, 1984), rating scale model (Andrich, 1978), mixed dispersion model (Andrich, 1982), and successive interval model (Rost, 1988). 15 The mixture partial credit model (MCPM) was proposed by Rost as an extension of latent class analysis that takes account of the different usage of rating scales within latent classes (Rost, 1991). When he proposed the MPCM, he suggested this model as a method for classifying people according to their item response profile, independent of the location of the profile on latent continuum. Because the MPCM is the Rasch model in which no restriction on the item parameters is imposed, it is often called the mixture (or mixed) polytomous Rasch model (Rost, 1991; von Davier & Rost, 1995). In this dissertation, the mixture partial credit model (MPCM) and the mixture polytomous Rasch model are used interchangeably. Mixture polytomous IRT models, especially the MPCM, have been increasingly used in applied studies in personality, organizational, and clinical psychology for the analysis of Likert-scale self-report data. (e.g., Austin, Deary, & Egan, 2006; Egberink, Meijer, & Veldkamp, 2010; Eid & Rauber, 2000; Gollwitzer et al., 2005; Maij-de Meij, Kelderman, & van der Flier, 2005, 2008; Meiser & Machunski, 2008; Rost, 1991; Rost, Carstensen, & von Davier, 1997; Smith, Ying, & Brown, 2012; Wu & Huang, 2010; Zickar, Gibby, & Robie, 2004). All these referred studies used the MPCM except Maij-de Meij et al. (2005, 2008), which used the mixture nominal response model, Egberink et al. (2010), which used the mixture graded response model, and Meiser and Machunski (2008), which used the mixture rating scale model. 16 1.3 The Current Study As reviewed in this chapter, the MPCM has great potential to provide solutions to the long-standing psychometric problems caused by response styles. Despite the growing interest and need in practical settings, little evidence has been provided about the accuracy of parameter estimation of the MPCM. When Rost (1991) proposed the MPCM, a one-replication simulation study was conducted in which the quality of MLE was evidenced. However, the simulation conditions were very limited, which made the results difficult to be generalized. In the MPCM, the accuracy of parameter estimates can vary depending on several factors such as the estimation algorithm, the number of items, the number of respondents, and the number and size of latent classes. The current study, therefore, proposes to conduct a larger-scale simulation in which the quality of MPCM parameter estimation is evaluated especially under the population where different response styles coexist. Specifically, the recovery of latent class membership, item thresholds, and person trait levels will be examined. The effects of the type of mixture, mixing proportions, sample size, and test length on the parameter recovery are assessed. In addition to the parameter recovery study, the simulation study will also examine how the MPCM makes an adjustment of the latent trait estimates to compensate for the effects of different response styles on test score. The effectiveness of information criteria for the MPCM model selection is also assessed. Given that there has been thus far no systematic simulation study that investigates the parameter recovery of the MPCM, the current study is expected to 17 provide some evidence regarding the soundness of the application of this model in empirical data analysis. Especially, the various mixture conditions of response styles simulated in this study will allow for the evaluation of the utility of the MPCM in dealing with particular response styles problems in real data analytic research. 18 Chapter 2: Literature Review Chapter 2 starts with an introduction of the conceptual development of the Rasch model (RM) and partial credit model (PCM) as well as the unique features of the Rasch family models. The chapter continues to introduce the finite mixture distribution before presenting the model formulation of the MPCM. Estimations of the model parameters of the MPCM and the application of the information criteria for model selection are discussed. Finally, the designs and results of related empirical and simulation studies are summarized. 2.1 The Rasch Model for binary item responses 2.1.1 Presentation of the model Item response theory (IRT) was built around the central idea that the probability of a certain answer when a person is confronted with an item ideally can be described as a simple function of the person?s position on the latent trait scale and one or more parameters characterizing the particular item (Molenaar, 1995). The Rasch model for dichotomously scored item responses (RM: Rasch, 1960) is the simplest IRT model in the sense that it only needs the difficulty of an item, which indicates the location of the latent trait scale, in order to characterize an item. This simplicity allows the RM to directly compare item and person parameters to define the item response probability. The following introduces the essential idea of the Rasch measurement model applied to the comparison of the difficulty of an item i ( i? ) and person n?s trait level ( n? ) on the same latent continuum. 19 Suppose that specific i? and n? are located at positions D? and T? on a latent variable continuum, respectively. In addition, TP is the probability of observing an event T indicating that T? exceeds D? on the continuum. Similarly, DP is the probability of observing an event D indicating that D? exceeds T? . Considering the relative locations of n? and i? on the latent continuum, TP would imply the probability of a success on the item whereas DP would imply the probability of a failure on the item. For dichotomous responses, TP may be replaced as 1niP representing the probability of person n scoring 1 on item i. Also, DP may be replaced as 0niP representing the probability of person n scoring 0 on item i. The RM then relates the distance between n? and i? on the continuum to the events T and D as the natural logarithm of the odds ratio presented in Equation 1. ??? ? ??? ? =??? ? ??? ? =?=? 0 1lnln ni ni D T inDT P P P P???? . (1) As seen in Equation 1, the log odds of observing a success rather than a failure on item i is determined based on the distance between n? and i? . From Equation 1, one can easily verify that when in ?? = , 1niP = 0niP = 0.5. If in ?? > , it implies that the respondent?s ability surpasses the difficulty level of the item, indicating a greater chance of success because the odds ( 1niP / 0niP ) must be greater than 1. Conversely, if in ?? < , it implies that the difficulty level of the item surpasses the respondent?s ability, indicating a greater chance of failure because the odds ( 1niP / 0niP ) must be 20 smaller than 1. Using the inverse logistic, Equation 1 transforms with respect to 1niP as presented in Equation 2. )exp(1 )exp( 1 in in niP ?? ?? ?+ ? = , (2) where 1niP is the probability that person n correctly answers item i, or the probability of scoring 1 on item i, n? is the trait level for person n, and i? is the difficulty of item i. This is the RM equation, which is the basic building block shared by all models within the Rasch family. 2.1.2 Item response function Equation 2 provides a trace line that indicates the probability of a correct item response at all possible levels of ? for a given difficulty i? . This trace line is referred to as an item response function (IRF) or item characteristic curve (ICC). Figure 2 illustrates three ICCs that the RM produces for items with i? = -0.5, 0, and 0.5, respectively. As can be seen in the plot, the RM ICCs differ only with respect to the locations on the continuum indicating different levels of item difficulty. The slopes of the ICCs are parallel, which indicates that discriminations of the items are the same for the three items. As mentioned in the previous section, the direction of the response probability changes at the point that corresponds to the probability value of 0.5, which is the point of inflexion of the ICC. 21 Figure 2. CCs corresponding to Rasch model items with different item difficulty 2.2 Partial Credit Model 2.2.1 Presentation of model. Masters (1984) proposed the partial credit model (PCM) by extending the RM to polytomously-scored item responses. The fundamental idea of the PCM is that the multiple response categories are a series of pairs of adjacent categories and the RM can be applied for modeling each pair. The PCM is appropriate for the items that are subject to partial credit scoring as well as those that are obtained with a Likert-type scale. -3 -2 -1 0 1 2 3 0. 2 0. 4 0. 6 0. 8 1. 0 ? Pr o ba bi lit y o f R e sp o n se -0.5 0.5 ?1 ?2 ?3 Masters (1984) introduced the concept of depicted in Figure 3, a step in an item represents the transition from one category to the next. Thus, there are k Figure 3. Five response categories and four corresponding steps On this Likert scale, passing the in response to the item. If a person chose ?Agree? (response category 3), for example, he or she is regarded to have selected ?Disagree? over ?Strongly disagree? (first passed), ?Neither disagree nor agree? over ?Disagree? (second step passed), and ?Agree? over ?Neither disagree nor agree? (third step passed), but to have failed to make a transition from ?Agree? to ?Strongly Agree?. In this case, the person will ear partial credit score of 3, i.e., the number of the steps that he or she has passed. For dichotomously and, hence, only one step needs to be passed to reach the highest score revisit Equation 2, which now can be considered as a special case of the PCM where 22 step as he proposed the PCM. As steps in an item with k + 1 response categories. kth step means selecting response category -scored items, there is only one pair of adjacent categories or 1 k over k-1 step n a . Let us 23 the test items are one-step items. To make this point explicit in the model presentation, Equation 2 may be rewritten using modified notations following Masters (1984): )exp(1 )exp( 1 1 10 1 1 in in nini ni niP ?? ?? ?? ? ?+ ? = + = , where ( 0ni? + 1ni? ) is the probability of person n scoring 0 or 1 on item i and 1niP is the probability of person n passing the first step to score 1 rather than 0 on item i conditional on that only the two successive categories are considered. 1i? is the first (and the only in this case) step difficulty. The details of the step difficulty will be shortly introduced in the subsequent section. For the second pair of categories, the RM is again applied: )exp(1 )exp( 2 2 21 2 2 in in nini ni niP ?? ?? ?? ? ?+ ? = + = , where 2niP is the probability of person n passing the second step to score 2 rather than 1 on item i conditional on that only the two successive categories are considered. The general form of the step difficulty probability that person n passes the kth step to score k rather than k-1 on item i is then defined as: )exp(1 )exp( 1, ikn ikn nikkni nik nikP ?? ?? ?? ? ?+ ? = + = ? , k = 1, 2, ?, ih . (3) Here, note that ih is used to indicate potentially varying number of steps in different items. In the PCM, it is assumed that person n must select one of the given k+1 categories. Therefore, the following restriction needs to be applied: 1,...,10 =++ niknini ??? . 24 Finally, combining Equation 3 and the restriction, the PCM can be written as the unconditional probability that person n scores x on item i over all other possible scores. ? ? ? = = = ??? ? ??? ? ??? ? ??? ? ? ?? ??? ? ? = ih g g k ikn x k ikn nix 0 0 0 )(exp )(exp ?? ?? ? , x = 0, 1, 2, ?, ih , (4) where 0)( 0 0 ??? =k ikn ?? . To show how Equation 4 determines a category response probability, an explicit expansion of Equation 4 is demonstrated below. The illustration is to calculate the category response probability for the third category ( 3ni? ) when five response categories are given. .)()()()()(0exp[ .....)]()()()(0exp[ ......)]()()(0exp[ ......)]()(0exp[ ......)](0exp[]0exp[ )]()()(0exp[ 54321 4321 321 21 1 321 3 ininininin inininin ininin inin in ininin ni ?????????? ???????? ?????? ???? ?? ??????? ?+?+?+?+?+ = +?+?+?+?+ = +?+?+?+ = +?+?+ = +?++ ?+?+?+ = (5) 25 2.2.2 Threshold parameters in the PCM Masters (1984) used the term ?step difficulty? to refer to ik? . The step difficulty is conceptually the same with the item difficulty in the RM. It indicates the location of a particular step on the latent trait continuum and the location of each threshold can be compared to the location of person. The probability of passing a step to select a particular response category is determined based on the relative locations of these two locations (i.e., step and person) on the latent continuum. In the IRT literature, several alternative terms have been used such as category intersection (see, e.g., Embretson & Reise, 2000), category transition location (de Ayala, 2009), and threshold (Rost, 1991; von Davier & Rost, 1995). Hereafter the step difficulty ik? is referred to as the threshold. The mean of the thresholds within an item is often used to indicate the global/general location of the given item.3 In the current study, this is referred to as the item location. The item location i? is defined as follows: i h k iki h/ 1 ? = = ?? , k = 1, 2, ?, ih , where ik? is the kth threshold for item i and ih is the number of thresholds of item i. 3 The PCM can be reformulated so that the threshold is decomposed into item location ( i ? ) and the difference between threshold and item location ( ik? ). Equation 2 can be rewritten as follow: )exp(1 )exp( ikin ikin nikP ??? ??? ??+ ?? = . 26 Among a set of k steps within a PCM item, some steps may be easier to pass than others. If a particular step is easier to pass than others, the threshold value associated with that step will be lower than those associated with more difficult steps. One of the important features of the PCM is that the model does not assume that there is an underlying sequential step process to achieve a partial score. Although the response category scores (e.g., 0, 1, 2, 3, and 4) should be ordered to reflect increasing ? level, the estimated thresholds are not restricted to follow a specified order. When the thresholds are disordered, for example, 1? = 0.45, 2? = 0.74, 3? = -0.74, and 4? = - 0.45, instead of being ordered as in the following example, 1? = -0.74, 2? = -0.45, 3? = 0.45, and 4? = 0.74, it is often called a reversal of the thresholds. 2.2.3 Category characteristic curves and the presence of response styles Understanding how the order of thresholds and distances between thresholds are related to the category response probabilities in the PCM is fundamental to simulate item response patterns contaminated by different types of response styles in the current study. Continuing previous sections, this section further explains the relations between thresholds and category response probabilities by introducing graphical representations of the relations. Similar to the ICCs in Figure 2, the category response probability of a polytomous item ( nix? in Equation 4) can be depicted as a trace line called a category 27 characteristic curve (CCC).4 A CCC relates the probability of choosing a particular response category given a specific ? value. While only one ICC is needed for a dichotomously-scored item, as many CCCs as the number of response categories are required to present probabilities for each category response for a polytomously-scored item. Note that each category response probability can be calculated by following Equation 5. Figures 4 to 8 present different patterns of CCCs that have hypothetical threshold values estimated for the groups of different response-style respondents. In these CCC plots, four trace lines representing threshold probabilities ( nikP in Equation 3) are overlaid. The black lines present the threshold probabilities while the colored lines present CCCs. In the plots, it is commonly seen that the thresholds correspond to the points of inflexion of threshold probabilities and those points are the intersections of two adjacent CCCs. This indicates that when the item difficulty level is at kth threshold, the probability of choosing k and that of k-1 are the same at 0.5. As the item difficulty increases from k, the probability of choosing k becomes higher while the probability of choosing k-1 becomes higher as the difficulty decrease from k in this group of respondents. Ordered thresholds and the implication for response styles. In the following Figure 4, the CCCs and threshold probabilities are dictated by a set of four thresholds 1i? = -1.7, 2i? = -0.6, 3i? = 0.6, and 4i? =1.7. Apparently, the four thresholds are in a strict order from low to high values on the ? continuum and the distances between thresholds are fairly evenly spaced. The latent trait space is divided into five segments 4 CCC is sometimes called as category response curve, category response function, option response function, or operating characteristic curve. 28 within each of which one of the five categories has the greatest probability to be selected than the others. For example, respondents with the lowest level of? would be most likely to choose the response category 0 (see the CCC in orange color) while respondents within the next higher ? range, between 1? and 2? , would choose category 1 with the highest probability than any other categories (see the CCC in brown color). Figure 4 shows that every category is used properly in accordance with the respondent?s ? level. In this group of respondents, no response category is avoided and the item categories seem to function well as they are designed to differentiate individual?s trait level. Related to the issue of response styles, this pattern of CCCs and threshold probabilities is likely to be observed in a measurement situation where respondents would not present a particular response style such as ERS, MRS, or ARS but respond to the item solely conditional on their ? level. This ?normal? responding pattern is referred to as ordinary response style (ORS) in the current study. 29 Figure 4. CCC and threshold probabilities for a PCM item with thresholds (-1.7, -0.6, 0.6, and 1.7) Figure 5 also shows a set of ordered thresholds ( 1? = -1.85, 2? = -1.24, 3? = 1.34, and 4? =1.95) but compared to Figure 4, the distances between thresholds are uneven. The distance between the second and third threshold is longer than the distances between other thresholds, which links to the relatively high probability for category 2 to be selected within a wide range of values on the ? continuum. In this plot, Category 1 and 3 are still the most favorable category within the ranges from 1? to 2? and from 3? to 4? , respectively. The pattern of CCCs in Figure 5 may be observed in a sample of MRS respondents. -3 -2 -1 0 1 2 3 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 ?1 = -1.70, ?2 = -0.60, ?3 = 0.60, ?4 = 1.70 Pr o ba bi lit y o f R e sp o n se ?1 ?2 ?3 ?4 P1 P2 P3 P4 ?0 ?1 ?2 ?3 ?4 ? 30 If the distance between 2? and 3? becomes longer, in other words, if the number of people in the sample who select the middle category increases, then the CCC for the middle category will peak more distinctively and the order of 1? and 2? as well as that of 3? and 4? can be reversed. An illustrative plot is shown in Figure 6 in which a larger proportion of respondents respond to the middle category and accordingly the reversals occur. Figure 5. CCCs and threshold probabilities for a PCM item with thresholds (-1.85, - 1.24, 1.34, and 1.95) 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 ?1 = -1.85, ?2 = -1.24 ?3 = 1.34 ?4 = 1.95 Pr o ba bi lit y o f R e sp o n se ?1 ?2 ?3 ?4-3 -1 0 1 3 P1 P2 P3 P4 ?0 ?1 ?2 ?3 ?4 ? 31 Figure 6. CCCs and threshold probabilities for a PCM item with thresholds (-2.01, - 2.45, 2.45, and 2.01) Reversed thresholds and the implication for response styles. The following Figure 7 shows a dramatically different array of CCCs and threshold probabilities from the previous figures. In this case, a reversal occurs ( 1? = 0.45, 2? = 0.74, 3? = - 0.74, and 4? = -0.45) and the latent trait space is predominantly taken by the first and the last CCCs. Category 1, 2, and 3 are never be the most likely category to be selected at any ? level. If a respondent in this sample has a higher level of ? than zero (i.e., the point where the first and the fifth CRCs intersect), category 4 has the highest probability to be chosen. Conversely, if a respondent has a lower level of ? than zero, category 0 has the highest probability of being selected. Category 1, 2, and 32 3 will rarely be selected. If estimated CCCs show this pattern of distortion, this may be evidence that item responses from this sample of respondents are contaminated by ERS. Figure 7. CCCs and threshold probabilities for a PCM item with thresholds (0.45, 0.74, -0.74, and -0.45) The last example presented in Figure 8 depicts CCCs with 1i? = -1.51, 2i? = - 1.64, 3i? = -2.42, and 4i? = -0.93 and corresponding threshold probabilities. For this item, reversals also occurs and category 1 and 2 do not have the highest probability to be selected at any level of ? . The extremely high response probability for category 4 results in all item thresholds being located at lower levels on the ? continuum. This can happen when the item content is too easy for the respondents and, therefore, most 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 ?1 = 0.45, ?2 = 0.74, ?3 = -0.74, ?4 = -0.45 Pr o ba bi lit y o f R e sp o n se ?1?2?3?4-3 -2 2 3 P3 P4 P1 P2 ?0 ?1 ?2 ?3 ?4 ? 33 of the respondents pass the highest threshold. Irrespective of the difficulty of the content of the item, however, if a group of respondents manifests acquiescent response style (ARS) in response to the item, this pattern of CCCs can also occur. The CCCs plots illustrated above show that threshold distances contain important information about response category use. As a rule, if the threshold parameters are ordered within an item, every response category is the most likely option at least at one ? level. In this case, each response category is linked to an area on the latent continuum where it has a larger response probability than the other categories. In contrast, disordered thresholds indicate that certain response categories are avoided or the relation between trait and category choice is improperly specified. In this case, there is no area in which the CCCs of one or more categories are larger than the CCC of the other items. 34 Figure 8. CCCs and threshold probabilities for a PCM item with thresholds (-1.51, - 1.63, -2.42, and -0.93) 2.3 Unique Features of the Rasch Models The models in the Rasch family are distinguished from other IRT models by a fundamental statistical characteristic: separable person and item parameters and hence sufficient statistics (Masters & Wright, 1984). It is said that a sufficient statistic exists when no other information from the data is required to estimate a parameter. Suppose an N?I data matrix (N is the number of people and I is the number of items) with elements nix being 0 or 1 for dichotomous Rasch model cases or being the number of thresholds passed for polytomous Rasch model cases. Then, the total score 0. 0 0. 2 0. 4 0. 6 0. 8 1. 0 ?1 = -1.51, ?2 = -1.63, ?3 = -2.42, ?4 = -0.93 Pr o ba bi lit y o f R e sp o n se ?1?2?3 ?4 0 1 2 3 P3 P2 P1 P4 ?0 ?1 ?2 ?3 ?4 ? 35 (i.e., the row sum of the data matrix, ? = = I i nin x 1 ? ) is a sufficient statistic for the estimation of person trait parameters ( n? ) and the item score (i.e., the column sum, ? = = N n nii x 1 ? ) is a sufficient statistic for the estimation of item difficulty parameters ( i? ). Once the sufficiency of total scores is established, the unknown parameter? can be eliminated by conditioning on the person?s total score? during the course of item parameter estimation. All different response vectors (patterns) that yield the same total score ? have the same trait estimate. Therefore, increasing sample size does not increase the number of person parameters to be estimated and item characteristics do not have an impact on trait estimation. Consequently, the consistency of item parameter estimates can be achieved. Also, once the sufficiency of item scores is established, by conditioning on the observed vector of item score? , the item parameters are eliminated. This means that under the PCM, a simple count of respondents passing each threshold of an item contains all information about the threshold difficulty. 2.4 Mixture Distribution Models The model of interest in the current study, the mixture partial credit model, (MPCM: Rost, 1991; von Davier & Rost, 1995) can be viewed as a generalization of the PCM to a finite mixture distribution model. In this section, mixture distribution is introduced followed by the latent class model (LCM), which is the simplest discrete 36 mixture distribution model and closely related to the MPCM. Lastly, the general idea of integrating the IRT and LC models is discussed. 2.4.1 Continuous and discrete mixture distribution A mixture distribution refers to a composite of several subpopulation distributions (see e.g., McLachlan & Peel, 2000). The basic assumption of the model based on a mixture distribution is that the distribution of an observed random variable is not adequately described by a single probability function, but by a number of conditional probability functions. In a research setting where the observed sample can be seen as being drawn from two or more subpopulations with distinctive features, a mixture distribution model can possibly model this heterogeneity by combining conditional probability functions across subpopulations. These subpopulations are alternatively called mixture components or latent classes. A mixture distribution can be either continuous or discrete depending on the nature of the mixing variable on which the probability is conditioned. In a general form, the continuous mixture distribution can be presented as follow: ?? dff ?? ?? = )|()( xx , where )(xf is the unconditional probability density of an I-dimensional random vector },...,{ 1 Ixx=x and is obtained by integrating over the component densities )|( ?xf conditional on a continuous mixing variable ? . The previously reviewed RM and PCM can be viewed as continuous mixture models where individual latent trait (? ) is 37 a real-valued mixing variable and the component densities )|( ?xf are defined as the logistic function. If the mixing variable is discrete, only a finite number of component distributions are produced (i.e., as many as the number of latent classes) and the unconditional probability becomes a weighted sum. The general form is specified as: )|()( 1 cff C c c? = = xx ? , (6) where c is a discrete mixing variable whose arbitrary quantity },...,1{ Cc = classifies each respondent?s latent class membership, )|( cf x is the component distribution conditional on latent class membership c, and c? are the relative sizes of latent classes called mixing proportions, which are constrained to be 10 ?? c? and 1 1 =? = C c c? . In most cases, the component distributions take on the common parametric form but have their own sets of parameters. When data is analyzed using a discrete mixture distribution, the nature of a mixing variable does not need to be specified a priori. It is a hidden structure, so that the existence of valid latent classes is explored during the estimation process and each respondent is assigned to one of the identified latent classes according to similarity among respondents. This flexible, exploratory capability of discrete mixture distribution models allows for a way to decompose unobserved heterogeneity that would not be detected and modeled within non-mixture models. 38 2.4.2 Latent class model The latent class model (LCM: Lazarsfeld & Henry, 1968) is the simplest finite mixture distribution model for item responses. The main purpose of using a LCM is to infer unobserved groups that differ in qualitative sense. Individuals within the same latent class are assumed to behave similarly on relevant behavior while members of different classes are assumed to behave differently. Before presenting the model formulation of LCM, a brief comparison of the LCMs to the IRT models is useful for a better understanding of both models. First, both IRT and LC models relate a set of item responses and a latent trait variable. Also, the manifest variables, i.e., item responses are treated as discrete variables in both models. The major difference between the two models, however, revolves around the conceptualization of the person trait distribution. The IRT models assume person trait as continuous and provide measures of the trait on a single latent continuum. In addition, respondents in a sample are assumed to come from a qualitatively homogeneous single distribution and, thus, the respondents are different in quantitative sense. On the other hand, in the LCM the respondents are different in qualitative sense. The LCMs treat the person trait as a discrete variable and provide mutually exclusive and exhaustive latent class membership. Within each latent class there is no variation in the item response probability. The general LCM can be presented by specifying the component distribution with the joint probability function of item responses under the local independence assumption: 39 ? ? = = ? ?= C c I i x ic x icc ppp 1 1 1)1()( ?x , where )(xp is the probability of a response pattern of items i={1,..., I}, c? are the mixing proportions, and xicp and x icp ? ? 1)1( are the probability of a success and a failure on item i in class c, respectively. Both c? and x icp are the model parameters to be estimated. 2.4.3 Mixture IRT models By integrating a standard IRT model with the LCM, a mixture IRT model is obtained. The integration means that the response probability is now conditional on both respondent?s continuous trait distribution (following the IRT models) as well as discrete trait distribution (following the LC models). Therefore, the unconditional probability of an item response pattern x for mixture IRT model is: ? ? ? = = = C c I i ciic dfcxpp 1 1 )(),|()( ???? ? x , (7) where )(?cf is the class-specific trait distribution, of which the items have different parameters. This integration relaxes both models? assumptions, which can limit the utility of the models in applications. Specifically, the IRT model assumption that respondents in a sample belong to a qualitatively homogeneous distribution is relaxed. Mixture IRT models accommodate heterogeneous subpopulations by allowing item and/or person parameters to vary across latent classes. The differences observed in item and 40 person parameter estimates across latent classes may provide the ground on which the nature of population heterogeneity can be interpreted. Also, the LCM assumption that the response probability within latent classes is the same is relaxed. In mixture IRT models, each respondent is assigned an estimated latent trait level as well as a latent class membership. In sum, in mixture IRT models, an IRT model holds within different subpopulations, but in each subpopulation a different set of item and person parameters can be estimated. The mixture IRT models provide a statistical tool to detect and simultaneously model two types of population heterogeneity i.e., quantitative differences on a continuous latent variable as well as qualitative differences on a discrete variable. Exploration of qualitative individual differences. The major utility of mixture IRT models has been found in their capability to simultaneously model quantitative and qualitative differences among individuals. In previous studies employing different mixture IRT models, researchers identified qualitatively distinguishable latent groups in several realms of study. In cognitive assessments, Rost (1990) applied the mixture Rasch model (MRM) and identified two latent classes in which the members differed in their relative strength in subject contents of a physics test. A random guessing group was detected in a low-stakes achievement test using a mixture 2-PL model (Lau 2009), in a mathematic proficiency test using the MRM (Subedi, 2009), and in a reading proficiency test using a mixture Rasch model (Mislevy & Verhelst, 1990). A latent class in which the members present speededness at the end-items of a test was 41 separated using a mixture distribution version of the Bock?s nominal response model in the study by Bolt, Cohen, and Wollack (2002). Mislevy and Verhelst (1990) suggested a mixture Rasch model with theory-based item parameter structures to detect problem solving strategies. In non-cognitive assessment, Reise and Gomel (1995) applied the MRM to analyze a personality scale data and found a structural difference in the personality factors between two latent classes. In the analysis of rating scale item responses, the characteristics of latent classes were interpreted in terms of different faking tendencies (Zickar et al., 2004), self-disclosure patterns (Maij-de Meij, et al., 2005), structures of personality factor (Egberink et al., 2010; Rost et al., 1997), and response styles (Austin et al., 2006; Gollwitzer et al., 2005; Meiser & Machunski, 2008; Rost, 1991; Rost et al., 1997; Smith, et al., 2012). 2.5 Mixture Partial Credit Model 2.5.1 Presentation of model As explained previously, by integrating the LCM and PCM, the MPCM can be derived. The model equation of the PCM defines the probability of an item response pattern x specified as Equation 7: ? ? ? ? ? = = = = = ??? ? ??? ? ??? ? ??? ? ? ?? ??? ? ? = C c I i h g g k ikcc x k ikcc c i p 1 1 0 0 0 )(exp )(exp )( ?? ?? ?x , x = 0, 1, 2, ?, ih , (7) 42 where )(xp is the unconditional probability of an item response pattern x, c? is the mixing proportion with constraints, 10 ?? c? and 1 1 =? = C c c? , and 0)( 0 0 ??? =k ikcc ?? . Note that c? and ikc? are now the class specific person trait and threshold parameters, respectively. The threshold parameters ikc? are constrained to be ?? = = I i x k ikc 1 0 ? = 0 for all c for the model identification purpose. 2.5.2 Parameter estimation In Section 2.3, the particular feature of Rasch family models i.e., the sufficiency of the total scores for the ? estimation is explained. The total scores ( n? ) obtained from a sample are simply used to eliminate person parameters ( n? ) in estimating item parameters. The property of the sufficient statistic, however, cannot be applied as straightforwardly for the mixture Rasch models as it can for the RM and PCM. That is because latent classes are not known and thus the total scores in each class are not directly observable. As a solution, an estimated quantity for c|?? , namely latent score probability, which is the probability of a total score appearing in a class, needs to be introduced. This probability is treated as a model parameter and estimated along with other model parameters. Given that the number of parameters needed to estimate the latent score distribution grows easily as the number of classes and items increases, parsimonious ways to approximate it have been proposed. Software mdltm (multidimensional discrete latent traits models: von Davier, 2005a), which is used for the parameter estimation in the current study, uses a 2-parameter log-linear smoothing 43 approach to parameterize this score distribution. Appling this approach, the distributional model-based score probability ( c|??? ) can be obtained: , )(4 exp )(4 exp ? max 0 2 max max max 2 maxmax | ? = ??? ? ??? ? ? + ??? ? ??? ? ? + = ? ? ? ?? ? ? ? ??? ? ? ? s cc cc cr sss k (8) where max,...,0 ?? = , c? is the location parameter indicating the average of ? s and c? is the variability of that distribution. The obtained score probabilities provide a smoother distribution of expected score frequencies and will be replicated in approximately identical shape in different samples of respondents. This distribution is flexible in terms of the shape that it can take on, so that various shapes of score distributions can be modeled. One of the benefits of introducing this distributional approximation that uses only two parameters is that it prevents a penalizing factor of the information criteria for model selection from unnecessarily increasing. The details related to this issue of model selection are further addressed in Section 2.5.4. More details about this logistic model for score frequency can be found in Rost and von Davier (1995) and Rost (1997). In mdltm, the Expectation-Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is implemented to obtain marginal maximum likelihood (MML) estimates. The MML method makes use of the following factorization: 44 )|(),|()|( cpcpcp ?? ?= xx , (9) where ? = = I i ix 1 ? is the total score and the conditional total score )|( cp ? is replaced with the estimated c|??? as explained above. By applying the property of the sufficient statistic, the pattern probability conditional on total score instead of estimated ? can be obtained as follows: ))(exp(? )exp( ),|( | 1 cc I i ciix cp ?? ? ? = ? = ? ? ? ? x , (10) where the denominator ))(exp(? | cc ????? is a class-specific symmetric function of the thresholds. It makes the computation of all possible combinations of item parameters that yield a total score and is also required in the E-steps of the item parameter estimation for computing the expected pattern frequencies. An illustration of this computation for the RM difficulty parameters can be found in Baker and Kim (2004, Ch.5). Finally, the full formulation of the pattern probability with person parameter eliminated is as follows: ? ? = = ?? ? ? = C c cc I i cii cc x p 1 | 1 | ))(exp(? )exp( ?)( ? ? ?? ? ?x . E-steps. In the expectation steps, the expected pattern frequencies in each latent class are computed on the basis of the observed pattern frequencies and preliminary estimates of the threshold parameters. A randomly selected value can be 45 used as a starting parameter values for the first iteration. For the subsequent iterations, the estimates of the previous M-step are used. The expected class-specific pattern frequency )|(? cn x is a proportion of the ratio of the pattern probability in a class )|( cp x and the unconditional observed pattern probability )(xp : )( )|()()|(? x x xx p cp ncn c ? = , where )(xn is the observed frequency of response pattern x , the conditional pattern frequency )|( cp x is defined by Equations 8, 9, and 10. )(xp is the unconditional observed pattern probability i.e., ? = C c c cp 1 )|(x? . M-steps. The expected pattern frequencies for each latent class obtained from the E-step are used in the M-step for the computation of the estimates of the model parameters, c? , c|?? , and ikc? . These parameters are estimated separately for each class by maximizing the log-likelihood function of class c. The log-likelihood function of class c may be specified as follows: ? ? ?? ??? ? ???= ?? x x I i cccc cixi cnL )))(exp(ln(?ln)|(?ln || ??? ?? . Solving the first derivative to be zero with respect to the threshold parameter yields the (revised) estimate for threshold k on item i in class c as follows: ? = ? = max 0 | |,1 | ? ?ln ? ? ? ? ? ? ? c ci c ikc ikc m n , 46 where ikcn is preliminary estimates of the number of individual with a response to category k on item i in class c, cm |? is the number of individuals with score ? in class c, and ci |,1? ?? are the symmetric functions of order 1?? of all item parameters except item i in class c. This symmetric function is iteratively calculated by means of preliminary threshold parameter estimates and revised estimates in each M-step. The estimates of the mixing proportions ( c?? ) and conditional score probability ( c|??? ) do not need to be calculated iteratively. They can be simply calculated as follows: N nc c =?? , c c c n m | |? ? ?? = , where cn is the number of respondents in class c. ? estimation. During the item parameter estimation, trait parameter ? has been eliminated from the equation. In the final stage of the estimation, the unknown parameter ? can be estimated by solving iteratively the following estimation equation: ? = ?+ ? = I i ckicn ckicn n 1 )?exp(1 )?exp( ?? ?? ? , where cki?? is the final estimate of kth threshold for item i in class c. Respondent n has the trait estimate cn?? under the condition that he or she belongs to class c and hence there are as many cn?? as the number of c for each respondent. However, these 47 conditional trait estimates of a single individual usually do not differ much from one class to another because the estimates depend mainly on n? , which is the same in all classes (Rost, 1997). 2.5.3 Assigning latent class membership As the outcomes of the simultaneous modeling of a continuous and a discrete latent variable, each respondent is assigned latent trait estimates as well as probabilities for membership in each latent class. The probability of class membership can be estimated using Bayes? theorem: ? = = C c c c cp cp cp 1 )|( )|()|( x x x ? ? . where )|( xcp is the posterior probability of class membership c given the item response pattern x . Note that the mixing proportion plays the role of prior probability in the Bayes? theorem and the estimated conditional pattern probability )|( cp x replaces the likelihood and the denominator indicates the total probability. The actual classification is carried out by first using the Bayes? theorem to compute the estimated probability for class membership given each response pattern. Then, respondents may be assigned to the latent class for which the conditional probability of their membership is largest. 2.5.4 Determining the number of latent classes In the MPCM formulation, the number of latent classes (C) is not a model parameter and, thus, must be specified before initiating the parameter estimation 48 process. Under conditions of uncertainty about the ?true? number of unknown subpopulations, the commonly used technique to determine the number of latent classes is to compare the likelihood function of competing models with increasing numbers of latent classes and then choose a model that an information criterion data- model fit indicates as the best- fitting model to the data. Although significance tests are not possible with these indices, comparing the index values for competing models provides some degree of evidence for the nature of trait variable structure. Information criteria. Many information criterion statistics have been developed under the minimum complexity criteria. Frequently referred information criteria include Akaike?s information criterion (AIC: Akaike, 1974), Bayesian information criterion (BIC: Schwarz, 1978) and consistent AIC (CAIC: Bozdogan, 1987). The three statistics are those provided by mdltm. The AIC index can be calculated based on H different models being compared: hhh ParLAIC 2)ln(2 +?= , where hL is the maximum of the likelihood function of the hth model and hPar is the number of independent parameters that are estimated when fitting the hth model to the data. In comparing competing models, the model h that shows the minimum AIC value is chosen as the model that best fits the data and therefore is considered as the preferred model. It is seen in the equation that when two models have similar maximum likelihood value ( hL ), a smaller value of AIC will be associated with the model based on fewer parameters. In this way, AIC prefers a model with less 49 complexity, in other words, a more parsimonious model. A criticism of the AIC is that it lacks properties of asymptotic consistency because the definition of the AIC does not directly involve the sample size. Consequently, as sample size increases a more complex model would be more likely to be selected based on the AIC. Schwarz (1978) developed the BIC, which is an asymptotically consistent measure. The computation of the BIC may be specified as follows: hhh ParNLBIC ?+?= )ln()ln(2 , where N denotes the sample size. In the same way as is done for AIC, a model h that shows the minimum BIC value is chosen as the preferred model. Note that the penalty term for the BIC is larger than for the AIC if the sample size N is 8 or greater, which can be seen by the fact that the value of )8ln( = 2.08. Therefore, for reasonable sized samples, the BIC tends to select less complex models (i.e., the solution with a smaller number of classes) than does the AIC. Bozdogan (1987) extends the AIC to make it asymptotically consistent and to be penalized for over-parameterization more stringently. The CAIC index is computed as follows: hhh ParNLCAIC ?++?= )1)(ln()ln(2 . Compared to the AIC and BIC, the penalty term for CAIC is even larger, leading to solutions that favor the selection of less complex models than are obtained with the AIC or BIC. 50 Based on the specific penalty weights, it is expected that different information criterion statistics may lead to different solutions in mixture IRT models. The preference of a more complex model by the AIC may result in over-identification problems under certain conditions whereas the preference of a less complex model by the BIC and CAIC may cause under-identification problems. The relative effectiveness of information criteria has been investigated via simulation studies, where the true conditions are known and hence it is possible to monitor the behavior of information criterion statistics in identifying the correct model. Model selection in mixture IRT models. There are a limited number of simulation studies on model selection indices in mixture IRT models and all of those studies examined only models for dichotomous responses. No study has thus far investigated the problems of model selection in mixture polytomous IRT models. The following presents the findings from the studies related to dichotomous models. The first study appearing in literature was one by Li, Cohen, Kim, & Cho (2009), in which a Bayesian estimation approach was used. Their study investigated five different model selection indices including the AIC and BIC, and compared the relative effectiveness of them under 1-, 2-, and 3-PL model with 1-, 2-, 3-, or 4-latent classes. In general, the results showed that the BIC performed the best in terms of detecting correct number of latent classes. For 1-, 2-, and 3-class simulated data, the BIC was accurate in identifying the correct number of classes in every case. However, when the simulated data had 4 classes, it apparently became more difficult for the BIC to distinguish the correct model for the 3PL model. In this case, the BIC tended to 51 select the simpler model. The result for the AIC showed that the AIC selects more complicated models, particularly when the true model is the 1PL model. Cho, Jiao, and Macready (2012a, 2012b) investigated the relative effectiveness of AIC and BIC in the context of mixture Rasch and mixture 2-PL model with two classes when marginal maximum likelihood estimation was applied. The studies manipulated qualitative heterogeneity in various ways by setting different sets of item parameter profiles across latent classes and evaluated the correct model selection rates. When more distinctive heterogeneity was generated between two classes causing class separation to be large, the BIC selected the correct model almost perfectly. Under the conditions where the heterogeneity manipulated was small, the BIC under-extracted latent classes while the AIC still tended to over-extract latent classes. Preinerstorfer and Formann (2012) reported similar results within a conditional maximum likelihood estimation context. They found that the BIC generally performed more accurately than the AIC and that longer test length was positively associated with the correct model selection rate. 2.6 Applications of the MPCM to Study of Response Styles In Sections 1.2 and 2.4.3, previous empirical studies in which mixture IRT models were employed were briefly introduced. In Section 2.6.1, the findings in the empirical studies related to the differences in response category use and the correction of test score bias are reviewed. Section 2.6.2 summarizes the previous simulation study that investigated the model performance of the MPCM. 52 2.6.1 Real data analysis Rost et al. (1997) applied the MPCM to the analysis of NEO-FFI scales and reported the results for the Conscientiousness (C) and Extraversion (E) scales. For the C scale, the item locations across two identified latent classes were not significantly different, which indicated that the items measured the same psychological construct across the latent classes. However, when the thresholds were examined, the larger latent class (? = 0.67) showed a set of ordered and relatively evenly spaced thresholds for all items while the smaller latent class (? = 0.33) showed that the first threshold distance was about four times larger than the second threshold distance. The threshold distances in the smaller class indicated that it was very easy to pass the first threshold and very hard to pass the last threshold and, hence, most people in this class responded to the middle categories and avoided the extreme categories. Integrating these findings in item locations and thresholds distances, the authors concluded that the difference characterizing the two latent classes was not in the conscientiousness construct but in the respondent?s differential use of response categories. When the E scale was analyzed, however, a structural difference in the personality construct as well as the response style difference emerged. The comparison of the item locations based on a two-class model solution revealed that the two identified latent classes reflected a structural difference between sociability and impulsivity. The subsequent MPCM analyses were conducted for these two classes separately and the same pattern of thresholds differences as was presented for the C scale was manifested. 53 Eid and Rauber (2000) applied the MPCM to analyze data from an organizational survey and demonstrated how mixture models could be used to detect measurement invariance caused by response styles. In their analysis, a two-class solution was selected as the best-fitting model based on the BIC. The item location parameters did not differ much between the two latent classes. The differences were observed with respect to the threshold parameters. In the larger latent class (Class 1 with ? = 0.71), all thresholds were ordered indicating that the members of this class used the rating scale in the expected way. Similar to the case depicted in Figure 4, each response category corresponded to an area on the latent continuum for which its response probability was larger than the probabilities of the other categories. In the smaller latent class (Class2 with ? = 0.29), the first two thresholds were disordered for all items and the threshold distances were much smaller than in Class 1. Therefore, the members of Class 2 were characterized as extreme respondents. Eid and Rauber (2000) also investigated whether latent classes differing in their response styles could be characterized by external variables including age, sex, length of service, length of service on the same position, and leadership level. The results showed that significantly larger proportion of female employees belonged to Class 2. In addition, relatively new employees belonged significantly less frequently to Class 1. People who had been working longer than 10 years in the same position had a higher probability for belonging to Class 2. Finally, employees at different leadership levels showed differences in the probability to belong to each latent class. 54 Gollwitzer et al. (2005) applied the MPCM to analyze the three anger expression subscales (Anger-in, Anger-out, and Anger-control) of the State-Trait Anger Expression Inventory (STAXI; Spielberger, 1988) obtained from patients hospitalized in a psychosomatic clinic. They observed considerable differences in response styles, which were similar to the differences in non-clinical samples. The largest latent class (Class 1) exhibited ordered and evenly spaced thresholds for both gender group and for all scales, meaning an appropriate use of response categories. It was also shown that respondents who were assigned to Class 1 on one scale were likely to be assigned to Class 1 on the other scales. The second latent class (Class 2) for the female sample presented partly disordered thresholds and narrower threshold distances. The logistic regression analyses were conducted to predict the latent class membership using various personality variables measured by Freiburg Personality Inventory (FPI-R; Fahrenberg, Hampel, & Selg, 1989). The regression analysis results provided some evidence that a social desirable tendency accounted for the response styles identified in Class 2. Gollwitzer et al. (2005) argued that it was not reasonable to compare all individuals quantitatively with respect to their sum scores, which was the scoring method instructed in the STAXI?s handbook (Spielberger, 1988). They suggested a more appropriate scoring strategy that required a two-step procedure. In the first step, individuals would have to be assigned to a latent class in order to qualify differential response styles. They could then be compared with each other within their latent class. 55 In a second step, class-specific person parameters could be compared across latent classes under the premise that the same trait is being measured in all classes. Zickar et al. (2004) conducted an experimental study in which respondents were couched to respond honestly or faked positively on a personality inventory. They analyzed the item responses from the experimental sample with the MCPM and found that honestly responding group exhibited the thresholds that were properly ordered and much lower item-level scores than the ?faking group?. For the faking group, the thresholds were disordered and the difference between the first and second thresholds was much smaller than the difference in the honestly responding group, indicating that few individuals chose the first and second categories in this group. Zickar et al. (2004) also compared the item responses on the Personal Preferences Inventory (PPI: Personnel Decisions International, 1997) between an applicant group and an incumbent group in an organization. Their MPCM analysis results showed that 27.6% of the applicants were in the extreme faking class whereas 13.7 % of the incumbents belonged to this class. Conversely, 26.5% of the applications were in the honestly responding class. These findings provided some insights that the typical applicant-incumbents comparison assuming that applicants were faking and incumbents were responding honestly had been too restricted. Smith et al. (2012) analyzed data from the Beliefs and Attitudes About Memory Survey (BAMS: Brown, Garry, Silver, & Loftus, 1997) with the mixture Rasch models to investigate the functioning of the ?Neutral? category (i.e., middle category) by examining the threshold ordering. Smith et al. (2012) pointed out that disordered 56 thresholds occur: i) when the rating scale includes more categories than the respondents can reliably distinguish, ii) when some rating categories are unlabeled, or iii) when rating scale includes middle point labeled as undecided or neutral. The analyses of the original 5-point Likert-scale BAMS data showed that disordered thresholds mainly occurred around the ?Neutral? category. They treated responses to the ?Neutral? category as missing data and reanalyzed the remaining data recoded to an ordered 4-point scale. For each of the three 4-point BAMS subscales, two latent classes were identified based on the CAIC. For the Blending of Memories subscale and the New Born, Womb, and Previous Lives Memories subscale, respondents from each of the latent classes used the items differently, resulting in an item difficulty ordering that was not invariant across latent classes. This indicated that different constructs related to the beliefs about memories might be measured within each latent class. For the Memory Storage subscale, however, the overall item difficulties were approximately the same for both classes except for one item. This led the author to reasonably assume that the same underlying constructs were being measured across the latent classes. Adjustment of response style effects on test scores by applying the MPCM. Rost et al. (1997) pointed out that the estimated trait parameters of the MCPM are automatically corrected for the effects of a response style and this is the most practical implication of employing the MPCM to the analysis of self-report data. Given that the MPCM provides ? estimates conditional on each response-style class and the sum score is the sufficient statistics for ? estimation, any differences observed in the class- 57 specific ? estimates for the same raw score can be viewed as an adjustment or correction for the effects of response styles (Rost et al, 1997). Rost et al. (1997), Gollwitzer et al. (2005), and Smith et al. (2012) graphically showed the relation between sum scores and ? estimates in each latent class to demonstrate how the class-specific person traits estimated for the same sum score differ across the classes. The results of those studies commonly showed that respondents who responded to more extreme categories earned less extreme theta estimates than the respondent with the same sum score but moderate response styles. These results implied that interpreting sum score difference among individuals without considering their response styles may lead to false inferences concerning individual differences in their latent trait level. Although the potential of rectifying score bias by employing the MPCM was demonstrated in those empirical data analytic studies, it has not been investigated how the correction would operate for different types of response styles when multiple kinds of response styles are present. Related to the correction of sum score bias, an important psychometric issue of interest is whether theta estimates obtained with a mixture IRT model may provide a better prediction of an external criterion, compared to the theta estimates obtained with its non-mixture counterpart. Maij-de-Meij et al. (2008) applied the mixture nominal response model and the MPCM to personality inventory scales, Extraversion (E) and Neuroticism (N), and investigated whether theta estimates provided by the mixture models resulted in a better prediction of relevant external criteria. The results of this 58 study showed that for N scale, the correlations between theta estimates and criterion measures were higher for the mixture models than for the non-mixture model. However, this improvement was not observed for the E scale. 2.6.2. Simulated data analysis As reviewed in previous sections, there have been increasing applications of the MPCM. Unfortunately, however, little is known about model performance of the MPCM in accurately estimating the model parameters. Only one simulation study conducted by Rost (1991) demonstrated the capability of the MPCM to ?unmix? heterogeneous item responses data. Rost (1991) created three sets of data, each of which was comparable with the PCM, and selectively combined two of the three data sets to generate several mixtures of two latent classes. In generating the mixture data sets, he manipulated sample size, threshold distance, and the ranges of ? , so that the mixtures differed with respect to ?degree of heterogeneity?. Specifically, the largest first data set (N = 1000) had a wide range of item locations (-2.7 to +2.7), equal threshold distances ( 21 ii ?? ? = 0.5 and 32 ii ?? ? = 0.5), and a wide-range of ? values (-2.5 to +1.0). The second data set (N = 600) had a smaller range of item locations (- 1.8 to +1.8), reversed thresholds with extremely unequal threshold distances ( 21 ii ?? ? = 1.4 and 32 ii ?? ? = 0.2), and a narrow-range of ? values (-1.0 to +1.0). The third data set (N=800) had no variation of item locations (0 for all items), large and equal threshold distances ( 21 ii ?? ? = 1.0 and 32 ii ?? ? = 1.0), and a narrow-range of ? values (0 to1.5). In this study, depending on the manipulated degree of heterogeneity, 59 the difficulty in detecting latent classes in mixture distributions was anticipated. The mixture of the first and second data sets was expected to be easiest to unmix because the item parameters and threshold distances differ strongly while the mixture of the second and third data sets was expected to be most difficult to unmix. The accuracy of thresholds recovery, mixing proportion recovery, and class- specific mean score recovery from a single replication result was evaluated by comparing the results for mixture data with those for non-mixture data. Results showed that the mean threshold distances and the class-specific score distributions were recovered fairly well. Some large deviations from the simulated condition were observed for the mixing proportions under certain conditions. These deviations, however, were interpreted as effects of the particular threshold sets manipulated not as a bias of the estimation procedure. Rost (1991) also evaluated the quality of estimates for the mixture with three-classes and found that the accuracy of the parameter recovery for the three-class model was comparable with the estimates in the two-class model. Regarding the model selection procedure, the AIC correctly identified the generated number of latent classes. 60 Chapter 3: Methodology 3.1 Objectives and Research Questions The major objective of the current study is twofold: (i) to evaluate the quality of the respondent classification as well as item and person trait parameter recovery of the MPCM when the population is a mixture of different response-style respondents, and (ii) to investigate how the MPCM makes an adjustment of the latent trait estimates to compensate for the confounding effects of different response styles on test scores. In addition to the major goals, the current study also explores the effectiveness of the information criterion statistics in identifying the correct number of latent classes in the MPCM. These objectives were addressed via a simulation study. The manipulated factors for which the effects were assessed were type of mixture of response styles, mixing proportions, sample size, and test length. The specific research questions that were addressed in this study are as follows: 1. What percentage of respondents does the MPCM correctly classify within their true response-style class under various conditions? 2. What percentage of replications does the information criterion statistics identify the correct number of latent classes? 3. What degree of the accuracy of thresholds parameter recovery does the MPCM provide under various simulation conditions when the accuracy is assessed by Pearson r, root mean square error, and standard error of estimates? 61 4. What degree of the accuracy of person trait parameter recovery does the MPCM provide under various simulation conditions when the accuracy is assessed by bias, Pearson r, and root mean square error? 5. How are sum (total) scores and class-specific person trait parameters estimated with the MPCM related to each other under the simulated types of mixture distribution? 3.2 Overview of Simulation Study 3.2.1 Manipulated factors The current simulation study selectively considered the five different types of response-style mixture distribution: (i) ORS and ERS, (ii) ORS and MRS, (iii) ORS and ARS, (iv) ORS, ERS, and MRS, as well as (v) ORS, ERS, MRS, and ARS. The mixing proportions were manipulated to be equal or unequal. The ?equal? condition represents the population where different response-style respondents are mixed with equal proportions and the ?unequal? condition represents the population where majority of the respondents are ORS respondents and very small proportion of respondents presents distorted response styles. Table 1 provides a summary of the types of mixture and mixing proportions manipulated in the current study. 62 Table 1. Manipulated Simulation Conditions of Population Heterogeneity Mixing proportions Class1( 1? ) Class2( 2? ) Class3( 3? ) Class4( 4? ) 1 Equal ORS(1/2) ERS(1/2) 2 ORS(1/2) MRS(1/2) 3 ORS(1/2) ARS(1/2) 4 ORS(1/3) ERS(1/3) MRS(1/3) 5 ORS(1/4) ERS(1/4) MRS(1/4) ARS(1/4) 6 Unequal ORS(9/10) ERS(1/10) 7 ORS(9/10) MRS(1/10) 8 ORS(9/10) ARS(1/10) 9 ORS(8/10) ERS(1/10) MRS(1/10) 10 ORS(7/10) ERS(1/10) MRS(1/10) ARS(1/10) Note: c? = mixing proportion for class c, ORS = ordinary response style, ERS = extreme response style, MRS = middle-category style, ARS = acquiescent response style Two other manipulated factors were sample size and test length. Sample sizes were chosen at three levels, medium (N=1200), moderately large (N=3000), and large (N=6000). As for test length, since it is common that a psychological instrument has a small number of items per subscale, as small as 4-item (I=4) was explored as well as moderate number of items (I=10) and large number of items (I=20). These four manipulated factors were completely crossed resulting in the total number of ninety simulation conditions. 3.2.2 Fixed factors Three factors, i.e., the number of response categories, latent trait distribution within latent class, and item locations were fixed in the current study. First, the number of response categories was fixed at five. Second, the latent trait distribution was generated to be a normal distribution with the mean of 0 and the standard 63 deviation of 1 for each latent class. Third, the item location of item i held invariant across the ORS, ERS, and MRS class in order for the latent classes to differ only with respect to the dispersion of item responses (Rost et al, 1997). For the ARS class, however, the generated item location for item i was not the same as that for the other response-style classes. The high response probability for the category 3 and 4 of a positively worded item i resulted in a very low item location for that item. Similarly, very high item locations for the negatively worded items were resulted. As this simulated condition for the item parameters in the ARS class indicates, if there is a group of ARS respondents in a sample, non-invariant item locations are likely to be manifested in a latent class. 3.2.3 Response scale The current study assumed that item responses were obtained with a five- category Likert-scale that had a built-in balanced scale. In the balanced scale, a pair of items asked an equivalent construct in a positive as well as a negative statement. In scoring the category responses, responses to negatively worded items were reversely coded before being analyzed. For example, an endorsement of the category 4, ?strongly agree? on these items was scored as 0 and an endorsement of the category 1, ?disagree? as 3. Using reversely coded category responses, instead of raw responses, affected the marginal distribution of category responses for ARS class. The raw response frequency distributions for the ARS class would be negatively skewed for all items before recoding responses. After the recoding process, however, the category 64 response frequency distributions for negatively worded items were positively skewed as can be seen in Figure 9. 3.3 Data Generation The rating scale item responses that were confounded by the effects of response styles were generated based on the relation between threshold values and response category probabilities defined in the PCM. The common method of generating thresholds such as randomly selecting threshold values within certain range of ? distribution, would not produce the item responses that characterize ERS, MRS, or ARS. The subsequent section presents the details of how to determine the population generating thresholds for each response-style subpopulation. The generation of item responses is then followed. 3.3.1 Population generating thresholds The first step was to clearly delineate distinguishing features of the four response-style classes by presuming marginal frequency distributions of category responses for each response-style class. Figure 9 presents the expected frequency distributions of category responses marginalized over all items administered in four different response-style classes. The specific probability values are presented in Table 2. For example, assuming that theta distribution is a normal distribution, 14% of ORS respondents would choose ?strongly disagree?, 22% ?disagree?, 28% ?neither disagree nor agree?, 22% ?agree?, and 14% ?strongly agree? on average over all items. If a group of people has ERS tendency, about 81% of them would select ?strongly disagree? or ?strongly agree?. In determining these marginal probabilities, a rather arbitrary decision was made because there was neither theoretical ground empirically reported category response frequencies related to Since too sparse category response frequencies cause problems in estimation, extremely small category response frequency (i.e., near zero percent) for any item avoided. Figure 9. Expected marginal different response styles (%) 65 the response styles. frequency distributions of category responses for ed nor was 66 Table 2. Expected Marginal Category Probabilities for Different Response-style Classes Category0 Category1 Category2 Category3 Category4 ORS 0.14 0.22 0.28 0.22 0.14 ERS 0.40 0.08 0.04 0.08 0.40 MRS 0.08 0.16 0.52 0.16 0.08 ARS(positive) 0.05 0.05 0.05 0.23 0.62 ARS(negative) 0.62 0.23 0.05 0.05 0.05 Note: ORS = ordinary response style, ERS = extreme response style, MRS = middle-category response style, ARS = acquiescent response style. The second step was to make variation of the category probabilities among items. As shown in Table 3, while ensuring the marginal category probabilities approximate the values initially specified in Table 2, the category probabilities for each item were manipulated to be different among items. Table 3 shows the variations created for ten items for the ORS class. The category probabilities for individual items for the other response styles are presented in the Appendix A. Table 3. Category Probabilities for Individual Items for ORS Class Item Category1 Category2 Category3 Category 4 Category 5 1 0.1478 0.2245 0.2556 0.2245 0.1478 2 0.1539 0.2061 0.2801 0.2061 0.1539 3 0.1244 0.2413 0.2685 0.2413 0.1244 4 0.1069 0.2412 0.3038 0.2412 0.1069 5 0.1233 0.2354 0.2825 0.2354 0.1233 6 0.1657 0.2071 0.2543 0.2071 0.1657 7 0.1332 0.2308 0.2721 0.2308 0.1332 8 0.1550 0.2124 0.2653 0.2124 0.1550 9 0.1501 0.2065 0.2867 0.2065 0.1501 10 0.1416 0.1904 0.3360 0.1904 0.1416 Mean 0.1402 0.2196 0.2805 0.2196 0.1402 67 Note that the means of the category probabilities of the ten items remain almost the same as the marginal category probabilities specified in Table 2. These variations among items were manipulated to generate item responses that fit the PCM instead of the rating scale model (RSM, Andrich, 1978), The RSM is restricted to have a common set of thresholds across all items. The next step was to compute threshold probability for each step by applying the simple Rasch logistic model to the series of adjacent categories. The computations are demonstrated using an example of the first item in Table 3. As presented in Equation 1, the kth threshold for item i ( ik? ) can be obtained by computing the natural logarithm of the odds ratio and subtracting it from the person trait density: ??? ? ??? ? =? ?1, ln kni nik ikn P P?? , n kni nik ik P P ?? +??? ? ??? ? ?= ?1, ln . (11) Ignoring the trait density (or n? = 0 ) in Equation 11, ik? can be computed as follows: 68 Category0 Category1 Category2 Category3 Category4 Category probability ( ik? ) 0.147 0.225 0.256 0.225 0.147 Step1 Step2 Step3 Step4 Threshold probability ( nikp ) ?? ?? ?? ?? ? ? ?? ?? ?? ?? ? ? = + = + 60.0 225.0147.0 225.0 1110 1 nn ni ?? ? ?? ?? ?? ?? ? ? ?? ?? ?? ?? ? ? = + = + 53.0 256.0225..0 256.0 1211 2 nn ni ?? ? ?? ?? ?? ?? ? ? ?? ?? ?? ?? ? ? = + = + 47.0 225.0256.0 225.0 1312 3 nn ni ?? ? ?? ?? ?? ?? ? ? ?? ?? ?? ?? ? ? = + = + 40.0 147.0225.0 147.0 1413 4 nn ni ?? ? Odds ??? ? ??? ? ? nik nik p P 1 53.1 60.01 60.0 = ?? ??? ? ? 14.1 53.01 53.0 = ?? ??? ? ? 88.0 47.01 47.0 = ?? ??? ? ? 65.0 40.01 40.0 = ?? ??? ? ? Ln(Odds) 0.426 0.129 -0.129 -0.426 Threshold ( ik? ) -0.426 -0.129 0.129 0.426 During this thresholds computation, item locations were fixed to zero. For the items to have different levels of difficulty, a positive or negative constant was added to each threshold. The varying item difficulties manipulated are presented in Tables 4 to Table 7. In the computation presented above, the item threshold values were computed without considering ? distribution. In IRT models, the probability of an item response is determined as a function of both item and person parameters. Therefore, the person trait density needed to be combined with the computed threshold values (Equation 11). In order to achieve this combination, a histogram that follows the normal distribution was constructed under which the determined thresholds (i.e., cut points on theta continuum) were adjusted. The procedures of this adjustment were the following: theta range from -2.5 to 2.5 was divided into nine intervals with 0.5 increments and then a 69 sample of 10000 respondents was allotted to each interval based on the cumulative normal density function. Using this sample of respondents and the initially computed threshold values for ten items, PCM item responses were generated. The generated item responses were analyzed to check the marginal category probabilities. While monitoring the resulting category probabilities, several sets of four constants were alternatively added to the initial threshold values until a set of thresholds that produced the expected marginal category probabilities as close as possible. Tables 4 to 7 present the threshold parameters that were obtained based on these adjustments for the ORS, ERS, MRS, and ARS class, respectively. Thresholds for ten items were first determined and those ten items were used twice to create 20-item test. Four items among the ten, which are indicated in the Tables 4 to 7, were selected to create 4-items test. The corresponding plots for the determined thresholds for ten items are presented in Figures 10 to 13. These threshold plots represent the locations of each threshold on the latent trait continuum on the y-axis. The characteristics of the sets of threshold parameters for each class are described in the subsequent sections. Thresholds for ORS class. The population generating thresholds for the ORS class are presented in Figure 10. As seen in Figure 4 in chapter 2, which presents ordered and evenly spaced thresholds for a single item, the threshold plot in Figure 10 shows those properties across all items. In this group, it is seen that passing a higher threshold requires more of the latent trait ? . Table 4. Threshold Values Item Threshold1 1? -1.5181 2? -1.2924 3 -1.8123 4 -1.7632 5 -1.8469 6 -1.1232 7 -1.7999 8 -1.1654 9? -1.6191 10? -1.0966 Mean -1.5037 Note: ? Selected item for 4 Figure 10 Thresholds for MRS class. class are presented in Figure 70 Used for the Generation of the ORS Class Threshold2 Threshold3 Threshold4 Location -0.5998 0.4998 1.4181 -0.6768 0.7768 1.3924 -0.6265 0.4265 1.6123 -0.5510 0.7510 1.9632 -0.7522 0.4522 1.5469 -0.4750 0.7750 1.4232 -0.7849 0.3849 1.3999 -0.4421 0.8421 1.5654 -0.9980 0.4980 1.1191 -0.7379 1.2379 1.5966 -0.6644 0.6644 1.5037 -item test length condition . Thresholds plot for 10 items for ORS class The population generating thresholds for 11. As can be seen in Figure 5, the distances between -0.05 0.05 -0.10 0.10 -0.15 0.15 -0.20 0.20 -0.25 0.25 0 the MRS second and third thresholds are large and there are as thresholds 3? and 4? . Table 5.Threshold Values Item Threshold1 1? -2.1328 2? -1.9616 3 -1.1515 4 -0.8654 5 -1.2218 6 -1.0001 7 -2.0917 8 -1.3698 9? -1.9528 10? -1.1903 Mean -1.4938 Note: ? Selected item for 4 Figure 11. 71 reversals between 1? and Used for the Generation of the MRS Class Threshold2 Threshold3 Threshold4 Location -2.8106 2.7106 2.0328 -2.3956 2.4956 2.0616 -3.2403 3.0403 0.9515 -2.3722 2.5722 1.0654 -3.1974 2.8974 0.9218 -2.7372 3.0372 1.3001 -2.7031 2.3031 1.6917 -2.2797 2.6797 1.7698 -2.2618 1.7618 1.4528 -2.1591 2.6591 1.6903 -2.6157 2.6157 1.4938 -item test length condition Thresholds plot for 10 items for the MRS class 2? as well -0.05 0.05 -0.10 0.10 -0.15 0.15 -0.20 0.20 -0.25 0.25 0 72 Thresholds for ERS class. The population generating thresholds for the ERS are presented in Figure 12. As can be seen in Figure 7, the reversals occur and there are items that have no area between thresholds, indicating very sparse expected responses for some categories. Generally, the first threshold value is the greatest in this class. It indicates that it is hard for people in this class to pass the first threshold and, therefore, they end up with selecting the first category (k = 0) rather than the second category (k = 1). On the other hand, the last threshold is the easiest to pass, indicating that respondents tend to pass the last threshold easily and select the last category (k = 4). Table 6. Threshold Values Used for the Generation of the ERS Class Item Threshold1 Threshold2 Threshold3 Threshold4 Location 1? 0.4043 0.6851 -0.7851 -0.5043 -0.05 2? 0.7207 0.2235 -0.1235 -0.6207 0.05 3 1.0029 -0.1963 -0.0037 -1.2029 -0.10 4 1.1222 0.6516 -0.4516 -0.9222 0.10 5 0.6489 0.1037 -0.4037 -0.9489 -0.15 6 0.8159 1.1774 -0.8774 -0.5159 0.15 7 1.1394 0.2912 -0.6912 -1.5394 -0.20 8 1.0474 0.7020 -0.3020 -0.6474 0.20 9? 0.246 0.2106 -0.7106 -0.7460 -0.25 10? 1.0952 0.9962 -0.4962 -0.5952 0.25 Mean 0.8243 0.4845 -0.4845 -0.8243 0 Note: ? Selected item for 4-item test length condition Figure 12. Thresholds for ARS class. class are presented in Figure positive statements whereas Item 6 to Item 10 are those that are written statement. The negatively continuum whereas the positively of theta continuum. 73 Thresholds plot for 10 items for the ERS class The population generating thresholds for 13. The first five items are those that are written in stated items? thresholds profile locates upper range of theta stated items? thresholds profile locates lower range the ARS in negative Table 7. Threshold Values Item Threshold1 1? -1.5092 2? -1.6323 3 -1.7054 4 -1.3810 5 -1.8507 Mean -1.6157 6? 0.8255 7? 0.9828 8 1.0787 9 0.8646 10 0.9644 Mean 0.9432 Note: ? Selected item for 4 Figure 13. (Items 1 to 5 are p 74 Used for the Generation of the ARS Class Threshold2 Threshold3 Threshold4 -1.6300 -2.4202 -0.9255 -1.0445 -2.5418 -0.8828 -1.6953 -2.1910 -1.2787 -1.2554 -2.4918 -0.6646 -1.5547 -2.3851 -1.2644 -1.4360 -2.4060 -1.0032 2.3202 1.5300 1.4092 2.6418 1.1445 1.7323 1.9910 1.4953 1.5054 2.6918 1.4554 1.5810 2.0851 1.2547 1.5507 2.3460 1.3760 1.5557 -item test length condition Thresholds plot for 10 items for the ARS class ositively stated, Items 6 to 10 are negatively stated) location -1.6210 -1.5250 -1.7170 -1.4480 -1.7630 -1.6148 1.5225 1.6235 1.5176 1.6482 1.4625 1.5549 75 3.3.2 Item responses generation. To generate item responses, person trait parameters n? were randomly drawn for each replication from a standard normal distribution N ~ (0,1). The true n? and population generating threshold parameters determined for each response style were substituted in the MPCM formula. Five category probabilities ( ik? ) were computed for each respondent as demonstrated in Equation 5. These obtained category probabilities became the success probability of a multinomial distribution. Assuming that one experiment was performed that yielded k = 5 possible outcomes with probabilities 1i? ?, ik? , if the kth outcome was obtained, the kth entry of the multinomial random vector took on a value of 1, while all other entries took on values of 0. The value 1 was scored as k-1, and finally category scores from 0 to 4 were assigned. The item responses data used in the simulation was generated with R 2.14.1 (R Development Core Team, 2011). The following Figures 14 and 15 present the conditional frequency distributions of category responses obtained for a single simulated data set. In the plots, the data set is divided into four groups according to the respondents? ? level, i.e., below 25th percentile, from 25th to 50th, from 50th to 75th, and above 75th percentile. Within each group, the frequency of category responses was counted. Figure 14 is based on an item with lower item location whereas Figure 15 is based on an item with higher item location. It is clearly seen from Figures 14 and 15 that the category response probabilities are jointly influenced by the respondent?s ? level and 76 a response style. For the ORS class with no response style bias involved, high response category frequencies gradually shift from the lower categories to higher categories as the percentile becomes higher. This pattern of category probability shift conditional on ? level is commonly observed across all response-style classes. If ERS, MRS, or ARS is involved, however, particular response categories tend to produce the largest frequency within across all levels of ? while the gradual shift of the category probability conditional on ? levels remains. 77 Figure 14. Conditional frequency distributions of category responses for an item with lower item location 78 Figure 15. Conditional frequency distributions of category responses for an item with higher item location 79 3.4 Analysis and Evaluation Criteria The simulated data sets that represent different mixtures of response-style respondents were estimated with the MCPM using mdltm software. mdltm allows the analyses with a wide range of latent variable models such as uni-dimensional and multi-dimensional IRT models, latent class models, mixture IRT models and diagnostic models (e.g., von Davier, 2005b). It implements the EM algorithm (Dempster, Laird, & Rubin, 1977) to obtain marginal maximum likelihood estimates of parameters. The parameter estimates provided by mdltm were collated and the evaluation criteria were calculated using R 2.14.1. 3.4.1 Fitting competing models Assuming that the true model was known as the MPCM but the number of latent classes in population was unknown, the current study fit simulated data with three MPCMs with increasing numbers of latent classes. For 2-class generated data sets, 1-, 2-, and 3-class MPCM were fit to the data. For 3-class generated data sets, 2-, 3-, and 4-class MPCM were fit. Finally, for 4-class generated data sets, 3-, 4-, and 5- class MPCM were fit. These three competing estimation models: i) under-fitting model, which had one class less than the data generation model, ii) correct-fitting model, which had the same number of classes as the data generation model, and iii) over-fitting model, which had one class more than the data generation model, were compared with respective to their information criterion statistics, AIC, BIC, and CAIC. 80 3.4.2 Convergence check To ensure that the results of each simulation analysis were grounded only on well-estimated solutions, convergence checks were conducted for each of the three competing solutions for each simulated data set. If non-convergence occurred for the correct-fitting model, all three competing solutions from that replication were discarded. To make up for the simulation data sets that were discarded as a result of non-convergent solutions, additional data set were generated. This allowed for a total of one hundred converged replication results for each simulation condition. 3.4.3 Model selection To assess the relative effectiveness of the performance of the information criterion statistics, AIC, BIC, and CAIC in identifying the correct number of latent classes in the MPCM, the index values were obtained for each of the three estimation models. One among the three estimation models that provided the smallest index value was selected as being associated with the best-fitting model. For each index, the proportions of replications in which the true model was identified as the best-fitting model were computed. In addition, the proportions of under-identification and over- identification of latent classes were also examined. The results of the three indices were compared to find their relative effectiveness in identifying the correct number of latent classes under the various simulation conditions. 3.4.4 Problem of label switching Label switching refers to the arbitrary mismatch between generated class membership and estimated class membership in a simulation study of mixture 81 modeling. In the current study, for a mixture data of ORS and ERS, for example, there are two possible ways that the estimated latent classes are labeled: ORS for the first estimated class and ERS for the second estimated class or conversely, ERS for the first and ORS for the second estimated class. In a general formulation, there are up to C! (C ? C-1 ? ? ?2?1, where C is the number of latent classes) possible permutations of latent class membership assignments. Only one of the possible permutations is the correct match and others indicate the occurrence of various patterns of label switching. In order to obtain correct measures for parameter recovery evaluation, switched labels must be detected and mismatched class membership must be corrected before aggregating estimates across multiple replications. In a simulation study where a large number of replication results need to be aggregated, it is practically impossible to manually inspect individual output for each data set to identify the occurrence of label switching. The process of correcting latent class labels needs to be automatized in the course of analysis. In the current study, a post-hoc technique was devised by the author to detect and correct switched latent class membership based on the information from the threshold estimates. This algorithm takes advantage of the distinctive order of thresholds that characterize each response-style class. As presented in Table 3, the mean values of population generating thresholds across all items for each response- style class show particular orders in terms of their magnitude. If the means of estimated thresholds ( 1? , 2? , 3? , and 4? ) for an estimated class satisfies the order of { 1? < 2? < 3? < 4? }, that class is identified as an ORS class. If the set of means 82 satisfies the condition of { 1? > 0 and 4? < 0} in a class, that class is identified as an ERS class. For MRS and ARS class, the conditions of { 1? < 0 and 2? <0 and 1? > 2? } and { 1? <0 and 2? >0 and 3? <0 and 4? >0} are applied, respectively. Table 8. Means of Generated Threshold Parameters for Each Response-style Class Class Threshold1 Threshold2 Threshold3 Threshold4 ORS -1.5037 -0.6644 0.6644 1.5037 ERS 0.8243 0.4845 -0.4845 -0.8243 MRS -1.4938 -2.6157 2.6157 1.4938 ARS -0.3363 0.4550 -0.5150 0.2763 In addition to employing this algorithm using thresholds characteristics, a different algorithm that is based on the information from respondent classification developed by Tueller, Drotar, and Lubke (2011) was implemented. The results of employing these two different algorithms were compared. 3.4.5 Classification accuracy The classification accuracy was evaluated for the correct-fitting model solutions. The classification accuracy was computed as the proportion of respondents who were assigned to their generated class membership based on the magnitudes of the posterior probabilities for the various class memberships. Not only the correct classification rate but also the nature of misclassifications was closely examined. Misclassified individuals were cross-tabulated for all possible combinations of misclassification to explore whether there was any particular misclassification pattern. 83 3.4.6 Threshold parameter recovery The accuracy of threshold parameter recovery was evaluated in terms of Pearson r, root mean square error (RMSE), and standard error of estimates (SE). Correlation and RMSE provide the measures of overall accuracy of parameter estimates. The closer the generated and estimated parameters are to each other, the higher positive correlation and the smaller RMSE are expected. For threshold parameter recovery, SE was computed based on the standard deviation of sample estimates from their average value. This indicates the stability of parameter estimates. A great fluctuation of estimated parameter values from replication to replication increases the SE. For item parameter recovery, the four evaluation criteria were calculated for each of four thresholds. They are computed as follows: ?? = = = W w I i ikwikkk r W Corr 1 1 ?? 1 ???? , k = 1, .., 4. WI RMSE W w I i ikwik k ? ? = ?? = =1 1 2)?( )?( ?? ? , ?? ?? = = = = ?? ?? ? ? ?? ?? ? ? ? ? ? = W w I i W w I i wik wikk WIWI SE 1 1 2 1 1 ? ? 1)?( ? ?? . where kk r ??? is the Pearson r between kth true threshold ( ik? )and its estimate ( ik?? ). i indicates ith item (i = 1,?I), w is wth replication (w = 1,?W). The mean bias, which is the measure of discrepancy between generated and estimated parameters, was not considered as an evaluation criterion for threshold 84 parameter recovery in the current study. During the parameter estimation in the current study, the item constrain method was used for the purpose of model identification. As introduced briefly in Section 2.5.1, either item parameter or person trait parameter needs to be constrained to solve the indeterminacy problem in IRT models. The software mdltm allows user to choose either of the two constrain methods. If item constraints are used, the sum of the estimated thresholds will be zero in each latent class while if person constraints are used, the sum of the estimated ? s will be zero in each latent class. The current study used the former method and, consequently, the mean bias across thresholds and items turned out to be zero for all simulation conditions, which was illegitimate to be used as an evaluation criterion as was originally proposed. 3.4.7 Person trait parameter recovery The accuracy of person trait parameter (? ) recovery was evaluated in terms of Pearson r, bias, and root mean square error (RMSE). For theta recovery, the evaluation criteria were calculated for each class as follows: ?? = = = W w N n nwn r W Corr 1 1 ?? 1 ???? , ?? = = ? ? = W w N n nnWN Bias 1 1 )?(1)?( ??? , WN RMSE W w N n nn ? ? = ?? = =1 1 2)?( )?( ?? ? , 85 where n? is person nth true trait , n?? is its estimate and N is the sample size or total number of respondents. 3.4.8 Model-based correction of score bias due to response styles The relation between sum scores and the MPCM ? estimates was investigated. To explore how the relation differ across latent classes when two, three, or four different types of response style were mixed, plots in which the MPCM ? estimates were depicted as a function of sum scores were created. 3.4.9 Evaluation of effects of manipulated factors One of the main interests of the current study was to investigate the influence of the four factors on the MPCM performance: i) type of mixture at five levels, ii) mixing proportions at two levels, iii) sample size at three levels, and ii) test length at three levels. Using the evaluation criteria measures (i.e., percentages, biases, RMSEs, correlations, and SEs) as the dependent variables, several factorial ANOVAs were conducted. Four main effects of the manipulated factors and all two-way interaction effects were included in the ANOVA model. The higher order interaction effects were folded into the error term. In the current study, many cell means were unavailable because of the exclusions of the simulation conditions in which the problems of estimation and label switching occurred. Under this incomplete design where some estimated cell means were missing, the interpretation of higher order interaction effects was seen as being quite difficulty to properly interpret and quite limited and, 86 thus, would provide limited (possibly misleading) information about the manipulated factors in this study. The influence of manipulated factors was determined to be statistically significant if the associated p-value < .05. Practical significance was measured by the effect size index, total effect SS SS = 2? , defined as the variance accounted for by the manipulated effect. According to Cohen (1988), 2? of 0.06 and 0.14 represent medium and large effect sizes for factorial ANOVA analysis, respectively. In the current study, the importance of the effects of the manipulated factors was evaluated based on the combination of statistical significance and practical significance. Only those manipulated factors for which their p-value was smaller than 0.05 and, at the same time, 2? was greater than 0.06 for medium effect or 0.14 for large effect was interpreted for its importance. 87 Chapter 4: Results Chapter 4 presents results of the current simulation study in six sections. Before presenting the results to answer the main research questions, the first section 4.1 addresses how the current study treated problems related to the convergence of the program to provide reasonable model parameter estimates as well as issues surrounding label switching. Section 4.2 provides the results of model selection under the MPCM based on information criterion statistics. Assessment of the results of model performance in the recovery of latent class membership, item threshold parameters, and person trait parameters are provided in Section 4.3, 4.4, and 4.5, respectively. Finally, findings regarding the model-based correction of person trait estimates are discussed in Section 4.6. 4.1. Initial Treatment of Estimation Problems and Label Switching Problems 4.1.1 Non-convergence and boundary estimates The population models used to generate item response data for this simulation study were five different MPCMs: i) three 2-class MPCMs representing mixtures of the ORS-ERS, ORS-MRS, and ORS-ARS, ii) a 3-class MPCM representing a mixture of the ORS-ERS-MRS, and iii) a 4-class MPCM representing a mixture of the ORS- ERS-MRS-ARS. These five data generation models were estimated under not only the same MPCM model (i.e., correct-fitting), but also an under-fitting model (i.e., estimation with the MPCM that has one class fewer than the data generation model) as 88 well as over-fitting (i.e., estimation with the MCPM that has one more latent classes than the population generating model). Two situations that may indicate problems in achieving convergence of parameter estimates were checked for these three estimation solutions. The first situation could be characterized when estimation terminated without convergence. The second situation that prompted monitoring occurred when maximum likelihood estimates of item thresholds skirted the boundary of permissible parameter values. These two problems were reported separately. The software mdltm provides an explicit warning message that indicates the occurrence of the first of these situations. The percentage of replications in which this warning message appeared is reported in Table 9. For the second condition, threshold estimates that were more extreme than 9.0 or -9.0 were flagged and the percentage of the replications in which one or more boundary estimates were flagged is reported in parentheses in Table 9. Correct-parameterization. Under the correct-fitting, non-convergence as well as boundary estimates did not occur across all levels of the ORS-ERS mixtures. However, for the other types of mixtures, significant numbers of boundary estimates appeared when the sample size was relatively small (N = 1200). Specifically, boundary estimates occurred for the MRS or ARS thresholds when the expected response probabilities for the corresponding response categories were essentially zero. When the sample size was N = 1200 and the mixing proportions were ? = 0.9 versus ? = 0.1, there were only 120 responses in the MRS or ARS class. Recall that the expected category probability for the 1st and 5th response categories for the MRS class 89 was set up to be approximately 6% while that for the 1st, 2nd, and 3rd categories for the ARS class was approximately 5%. That means that as small as 72 or 60 responses were assigned for those response categories. This data generation condition resulted in essentially zero expected frequencies in some randomly generated samples and may very well explain why the software converged to such extreme boundary values. It appears that the sample size of N = 1200 was not large enough to provide sufficient information and subsequent maximum likelihood estimates often fell at the boundary. Under-parameterization. Under the under-fitting, neither non-convergence nor boundary estimates occurred for any of the 2-response-style mixtures as well as for the 3-response-style mixtures. However, the 4-response-style mixture with 4-items and a sample size of N = 6000 produced a non-convergence rate of 0.49 when it was fit with an under-fitting model. Over-parameterization. Expectedly, under the over-fitting, estimation problems increased and almost all simulation conditions produced boundary threshold estimates. The average rate of the occurrence of boundary estimates problems was 0.46. The higher rate of boundary estimates were observed when i) the data generation model had three or four latent classes, ii) the sample size was N = 1200, or iii) the mixing proportions were unequal. These findings may contain real implications for practitioners using these methods in real data analytic situations. That is, the occurrence of infeasible extreme threshold values may be an indication of over- parameterization (estimating a model with too many latent classes) or an insufficient sample size to estimate parameters of a given data set, or a combination of the two. 90 Table 9. Percentages of the Occurrence of Non-convergence and Boundary Threshold Estimates Type of Mixture ORS ERS ORS MRS ORS ARS Estimation model 1class 2class 3class 1class 2class 3class 1class 2class 3class Mixing Proportions Item Sample 1200 0 (0)? 0 (0) 0 (6) 0 (0) 0 (9) 0 (67) 0 (0) 0 (0) 1 (6) 4 3000 0 (0) 0 (0) 0 (0) 0 (0) 0 (3) 0 (10) 0 (0) 0 (0) 0 (1) 6000 0 (0) 0 (0) 0 (1) 0 (0) 0 (0) 0 (16) 0 (0) 9 (0) 8 (1) 1200 0 (0) 0 (0) 5 (51) 0 (0) 0 (0) 0 (96) 0 (0) 0 (0) 0 (56) 50:50 10 3000 0 (0) 0 (0) 1 (35) 0 (0) 0 (0) 0 (82) 0 (0) 0 (0) 0 (92) 6000 0 (0) 0 (0) 2 (27) 0 (0) 0 (0) 0 (84) 0 (0) 0 (0) 0 (87) 1200 0 (0) 0 (0) 2 (44) 0 (0) 0 (0) 2 (99) 0 (0) 0 (0) 2 (33) 20 3000 0 (0) 0 (0) 2 (41) 0 (0) 0 (0) 1 (87) 0 (0) 0 (0) 0 (26) 6000 0 (0) 0 (0) 2 (32) 0 (0) 0 (0) 16(50) 0 (0) 0 (0) 0 (60) 1200 0 (0) 0 (0) 0 (19) 0 (0) 3 (1) 7 (24) 0 (0) 3(25) 13(42) 4 3000 0 (0) 0 (0) 0 (6) 0 (0) 1 (3) 20(13) 0 (0) 0 (2) 15 (8) 6000 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (21) 0 (0) 8 (0) 7 (5) 1200 0 (0) 0 (0) 1 (48) 0 (0) 0(65)? 0 (79) 0 (0) 1(48) ? 0 (15) 90:10 10 3000 0 (0) 0 (0) 0 (34) 0 (0) 0 (16) 0 (54) 0 (0) 0 (6) 0 (16) 6000 0 (0) 0 (0) 0 (31) 0 (0) 0 (0) 1 (17) 0 (0) 0 (0) 0 (0) 1200 0 (0) 0 (0) 0 (25) 0 (0) 0(75) ? 0 (93) 0 (0) 0(58) ? 0 (64) 20 3000 0 (0) 0 (0) 1 (22) 0 (0) 0 (7) 0 (37) 0 (0) 0 (0) 0 (23) 6000 0 (0) 0 (0) 3 (77) 0 (0) 0 (0) 7 (45) 0 (0) 0 (0) 3 (4) 91 Table 9_continued Type of Mixture ORS ERS MRS ORS ERS MRS ARS Estimation model 2class 3class 4class 3class 4class 5class Mixing Proportions Item Sample 1200 0 (0) 0 (8) 0 (41) 0 (5) 0 (48) ? 0 (67) 4 3000 0 (0) 0 (5) 2 (21) 0 (0) 0 (12) 0 (41) 6000 0 (0) 1 (1) 0 (12) 0 (0) 0 (0) 0 (21) 1200 0 (0) 0 (5) 4 (79) 0 (13) 0 (27) 3 (89) 50:50 10 3000 0 (0) 0 (0) 0 (32) 0 (0) 0 (1) 0 (44) 6000 0 (0) 0 (0) 2 (51) 0 (0) 0 (0) 2 (44) 1200 0 (0) 0 (1) 0 (29) 0 (6) 0 (10) 1 (74) 20 3000 0 (0) 0 (0) 0 (82) 0 (0) 0 (0) 0 (51) 6000 0 (0) 0 (0) 0 (92) 0 (0) 0 (4) 0 (77) 1200 0 (0) 0 (15) 0 (61) 0 (29) 0 (46) ? 0 (71) 4 3000 0 (0) 0 (7) 0 (42) 2 (11) 0 (11) 1 (30) 6000 0 (0) 2 (3) 3 (14) 49 (0) 5 (3) 5 (66) 1200 0 (0) 0 (69) ? 1 (93) 1 (51) 0 (96) ? 2 (99) 90:10 10 3000 0 (0) 0 (19) 1 (90) 0 (18) 0 (32) 0 (77) 6000 0 (0) 0 (0) 0 (44) 0 (1) 0 (3) 0 (50) 1200 0 (0) 0 (79) ? 1 (98) 0 (78) 0 (93) ? 0 (99) 20 3000 0 (0) 0 (8) 0 (79) 0 (8) 0 (19) 0 (83) 6000 0 (0) 1 (9) 0 (32) 0 (0) 0 (9) 0 (43) Note. ? Percentage of the occurrences of boundary estimates is presented in parentheses ? Excluded from simulation summary due to high occurrence rate of boundary estimates Exclusion of estimation solutions with estimation problems. Ten conditions out of ninety in the current simulation design presented boundary thresholds estimates 92 in more than approximately half of the replications when the generated data sets were parameterized with the correct model. These problematic conditions with a high level of estimation problems were excluded from the simulation summary and are listed in Table 10. For other simulation conditions with a moderate level of estimation problems, (i.e., either non-convergence or estimates at boundary values between 1 % and 30 %), the problematic results were discarded and new replications that did not present these problems replaced the discarded replications. Table 10.Specifications of Simulation Conditions Excluded from Simulation Summary Due to Estimation Problems Type of mixture Mixing proportions Number of items Sample size Occurrence rate of boundary estimates (%) ORS-MRS 0.9 : 0.1 10 1200 65 ORS-MRS 0.9 : 0.1 20 1200 75 ORS-ARS 0.9 : 0.1 10 1200 48 ORS-ARS 0.9 : 0.1 20 1200 58 ORS-ERS-MRS 0.9 : 0.1 10 1200 69 ORS-ERS-MRS 0.9 : 0.1 20 1200 79 ORS-ERS-MRS-ARS 0.5 : 0.5 4 1200 48 ORS-ERS-MRS-ARS 0.9 : 0.1 4 1200 46 ORS-ERS-MRS-ARS 0.9 : 0.1 10 1200 96 ORS-ERS-MRS-ARS 0.9 : 0.1 20 1200 93 In general, parameter estimation in the MPCM achieved fairly high convergence rates across various simulation conditions. However, the sample size of 93 N = 1200 appeared to be insufficient to provide well-estimated parameters especially when a small proportion of the respondents in a sample presented ERS, MRS or ARS. 4.1.2 Label switching problems As is usual in any mixture modeling simulation study, label switching occurred. In the current study, label switching was detected using two different algorithmic approaches. The first algorithm was based on the information from the threshold estimates developed by the author while the second algorithm was based on the information from respondent classifications developed by Tueller, Drotar, and Lubke (2011). Label switching correction algorithm based on thresholds information. As explained in Section 3.4.4, to automate the correction of switched class membership, an algorithm was developed that exploited the distinctive order of the thresholds that characterized each response style. To demonstrate how the algorithm works, an illustrative example in which threshold estimates from the 4-response-style mixture with 10-itmes and a sample size of N = 6000 was used in the following. First, the mean thresholds for ten items were calculated for each replication. Instead of using individual item threshold estimates, the mean values over all items were used because mean values were more consistent from replication to replication than individual item threshold estimates. The following matrix shows the mean thresholds for the first five replications. 94 Class 1 Class 2 Class 3 Class 4 1? 2? 3? 4? 1? 2? 3? 4? 1? 2? 3? 4? 1? 2? 3? 4? Rep1 0.83 0.45 -0.47 -0.81 -1.46 -2.69 2.63 1.52 -0.25 0.39 -0.46 0.32 -1.47 -0.73 0.71 1.50 Rep2 0.86 0.39 -0.41 -0.85 -1.40 -2.71 2.66 1.46 -0.39 0.44 -0.42 0.38 -1.48 -0.70 0.78 1.41 Rep3 0.83 0.54 -0.52 -0.86 -1.45 -0.65 0.65 1.44 -0.19 0.45 -0.54 0.29 -1.44 -2.58 2.63 1.40 Rep4 -1.44 -0.68 0.62 1.50 0.83 0.51 -0.41 -0.93 -0.27 0.45 -0.48 0.30 -1.41 -2.59 2.55 1.45 Rep5 -1.48 -0.61 0.65 1.44 0.84 0.42 -0.48 -0.78 -0.38 0.70 -0.54 0.21 -1.47 -2.55 2.61 1.42 The first set of four thresholds from Replication 1 satisfies the condition of { 1? > 0 and 4? < 0}, which characterizes the ERS class. Note that any of the remaining sets do not meet this condition. The second set satisfies the condition of { 1? < 0 and 2? <0 and 1? > 2? }, which characterizes the MRS class. The third set satisfies the condition of { 1? <0 and 2? >0 and 3? <0 and 4? >0}, which characterizes the ARS class and finally, the fourth set satisfies the condition of { 1? < 2? < 3? < 4? }, which characterizes the ORS class. Originally, the generated latent class labels were ORS, ERS, MRS, and ARS for class 1, class 2, class 3, and class 4, respectively. Thus, the estimated class labels for Replication 1, i.e., ERS, MRS, ARS, and ORS were identified as switched labels. There are 4! = 24 possible ways that four class labels can be switched. Each replication was checked for all twenty-four possible mismatches and the proper label was labeled for each latent class. The switched class labels that were identified for the five replications in the illustration are as follows: 95 Class 1 Class 2 Class 3 Class 4 Rep1 ERS MRS ARS ORS Rep2 ERS MRS ARS ORS Rep3 ERS ORS ARS MRS Rep4 ORS ERS ARS MRS Rep5 ORS ERS ARS MRS Based on these identified class labels, the thresholds matrix was reorganized as presented below. Likewise, matrices of class membership assignment as well as person trait estimates (not presented in this document) were also rearranged for use in the subsequent analyses in the study. ORS ERS MRS ARS 1? 2? 3? 4? 1? 2? 3? 4? 1? 2? 3? 4? 1? 2? 3? 4? Rep1 -1.47 -0.73 0.71 1.50 0.83 0.45 -0.47 -0.81 -1.46 -2.69 2.63 1.52 -0.25 0.39 -0.46 0.32 Rep2 -1.48 -0.70 0.78 1.41 0.86 0.39 -0.41 -0.85 -1.40 -2.71 2.66 1.46 -0.39 0.44 -0.42 0.38 Rep3 -1.45 -0.65 0.65 1.44 0.83 0.54 -0.52 -0.86 -1.44 -2.58 2.63 1.40 -0.19 0.45 -0.54 0.29 Rep4 -1.44 -0.68 0.62 1.50 0.83 0.51 -0.41 -0.93 -1.41 -2.59 2.55 1.45 -0.27 0.45 -0.48 0.30 Rep5 -1.48 -0.61 0.65 1.44 0.84 0.42 -0.48 -0.78 -1.47 -2.55 2.61 1.42 -0.38 0.70 -0.54 0.21 This label switching correction algorithm successfully identified switched labels when the quality of thresholds recovery was fairly good. However, this algorithm seemed to be rather strict, so that some switched labels were not automatically detected although they were discernible if inspected individually by looking at the whole picture of all items? threshold estimates in all classes. Label switching correction algorithm based on classification information. Tueller, Drotar, and Lubke (2011) developed a switched label detection algorithm that 96 utilized respondent classification results after estimation was completed. Their algorithm assumed that the frequency of correctly classified cases must be greater than the frequencies of misclassified cases. Therefore, each column of the class assignment matrix must have one column maxima. To help in understanding the algorithm developed by Tueller and his colleagues, three exemplar matrices of the frequencies of class membership assignment are presented below. The columns of the matrices represent true class membership and the rows represent assigned class membership. The first matrix shows a case where labels were not switched. The second matrix shows a case where the labels were switched and can be corrected. The third matrix shows a case where the labels were switched but cannot be corrected via their algorithm because its column has more than one column maxima. Labels not switched Labels Switched Cannot be corrected True 1 True 2 True 3 True 1 True 2 True 3 True 1 True 2 True 3 Assign1 96 6 2 Assign1 9 60 9 Assign1 38 33 36 Assign2 1 91 5 Assign2 80 1 14 Assign2 38 31 35 Assign3 3 7 89 Assign3 11 39 77 Assign3 24 36 34 Tueller et al. (2011) pointed out that reliable use of this algorithm requires reasonably high classification accuracy. They provided guidelines to prevent spurious correction by setting up a level of class assignment criterion that allows the researcher to decide how much more respondents are required to be correctly assigned than expected by chance. Although drastic improvement was not anticipated from an additional application of the Tueller?s algorithm, it seemed to be a potential alternative to 97 maximize the efficiency of automatic procedure to resolve the label switching dilemma. Since Tueller?s algorithm uses different sources of information, some replications for which the algorithm based on thresholds was not able to detect switched labels may find a solution via Tueller?s algorithm. Results of detecting and correcting switched labels. When the two algorithms were both able to solve switched labels, they yielded identical results. Interestingly, switched labels in some replications were detected by only one of the algorithms, but not both. The two algorithms, therefore, were incorporated in the course of the analysis and, as a result, switched labels in more replications were solvable in an automated manner than when either of the two algorithms was used alone. There were thirteen simulation conditions in which label switching could not be detected for some of the replications despite applying the two algorithms as well as a more in-depth manual inspection carried out for individual outputs. The following illustration presents a case of switched labels, which was not able to be solved by any of the three methods: i) estimated thresholds did not hold the particular conditions of the order of thresholds, ii) the class assignment matrix presented more than one column maxima, and iii) the manual inspection of the thresholds of all four items was not informative to separate three classes. 98 Class 1 Class 2 Class 3 1? 2? 3? 4? 1? 2? 3? 4? 1? 2? 3? 4? -1.87 -0.97 1.44 1.41 -0.19 -0.82 0.17 0.84 -1.23 -0.53 0.72 1.04 Labels not switched True 1 True 2 True 3 Assign1 381 0 95 Assign2 347 56 18 Assign3 232 64 7 Class 1 Class 2 Class 3 1? 2? 3? 4? 1? 2? 3? 4? 1? 2? 3? 4? Item1 -2.28 -0.95 1.37 1.39 -0.77 -0.28 -0.21 0.89 -1.39 -0.47 -0.33 -0.57 Item2 -2.48 -0.91 1.57 1.74 -0.10 -0.52 0.04 0.95 -1.17 -0.27 2.22 3.26 Item3 -1.64 -1.05 0.81 1.12 -0.61 -0.85 0.07 0.53 -1.43 -0.90 -0.34 -0.34 Item4 -1.08 -0.99 1.99 1.38 0.71 -1.62 0.80 0.98 -0.94 -0.46 1.32 1.82 As implied in the above example, the fact that there were unsolvable switched labels should not be regarded as an indication of any flaw or ineffectiveness of the algorithms. Instead, it seemed to be a reflection of the nature of the generated data sets and/or quality of the estimation. These thirteen conditions were also the ones for which the model selection based on the information criteria failed to identify the correct data generation model (The related results of model selection are presented subsequently in Section 4. 2). The specifications of the simulation conditions in which unsolvable switched labels were observed and the occurrence rates are summarized in Table 11. For these thirteen conditions, unsolvable replications were discarded and only the remaining solvable solutions were used to compute the evaluation criteria. 99 Table 11. Specifications of Simulation Conditions in which Switched Labels are unsolvable Type of mixture Mixing proportions Number of items Sample size Percentage of the occurrences of unsolvable switched labels ORS-ERS 0.9 : 0.1 4 1200 41 ORS-MRS 0.9 : 0.1 4 1200 41 ORS-MRS 0.9 : 0.1 4 3000 40 ORS-ERS-MRS 0.5 : 0.5 4 1200 14 ORS-ERS-MRS 0.9 : 0.1 4 1200 45 ORS-ERS-MRS 0.9 : 0.1 4 3000 45 ORS-ERS-MRS 0.9 : 0.1 4 6000 42 ORS-ERS-MRS-ARS 0.5 : 0.5 4 3000 46 ORS-ERS-MRS-ARS 0.5 : 0.5 4 6000 56 ORS-ERS-MRS-ARS 0.5 : 0.5 10 1200 43 ORS-ERS-MRS-ARS 0.9 : 0.1 4 3000 67 ORS-ERS-MRS-ARS 0.9 : 0.1 4 6000 63 ORS-ERS-MRS-ARS 0.9 : 0.1 10 3000 54 4.2. Model selection Once the replications that did not converge had been replaced, the AIC, BIC, and CAIC values were collated from each of the three competing estimation solutions for each replication. The percentage of the replications in which one of the competing models being identified as the best-fitting model by each information criterion index was recorded. The following Tables 12-16 presented the results. Generally, the BIC and CAIC performed nearly equally well with a slightly higher accuracy rate for the BIC across many conditions. On the other hand, the AIC resulted in over-identification problem (choosing a model with more classes) across all 100 of the simulation conditions. In the current study, the BIC was found to be the most effective information criterion statistic to use for the identification of the correct number of latent classes of the MPCM. The model selection results for each type of mixture are presented in the following sections in detail. Model selection under the ORS-ERS mixtures. Table 12 presents the selection results for the ORS-ERS mixtures. The ORS-ERS mixtures were well recognized as 2-response-style mixtures based on the BIC and CAIC across all simulation conditions. An exception was the condition of the 4-items and a sample of N = 1200 with unequal mixing proportions, which resulted in 97% of under- identification problem (choosing a model with fewer classes). Note that this condition presented 41% of unsolvable label switching problem as well. Table 13 presents the results of the model selection under the ORS-MRS mixtures. Generally, the ORS- MRS mixtures were not identified as correctly as other types of 2-response-style mixtures. As introduced in Section 2.6.2, ?degree of heterogeneity? is related to the difficulty of detecting component distributions in the MCPM. It was predicted that when the item parameters and threshold distances differ strongly, ?unmix? the mixture distribution will be easier in Rost (1991). Looking back at the category characteristic curves (CCCs) illustrated in Figure 4 - Figure 8, the differences between the ORS and MRS thresholds may be seen as being less distinctive than those between the ORS and ERS thresholds as well as the ORS and ARS thresholds. Consequently, the ORS-MRS mixtures were relatively more difficult to be identified as a mixture distribution. 101 Table 12.Model Selection under the ORS-ERS Mixtures Information Criterion AIC BIC CAIC Number of classes of the estimation model 1 2 3 1 2 3 1 2 3 Type of Mixture Mixing Proportions Item Sample 1200 0 89 11 0 100 0 0 100 0 4 3000 0 97 3 0 100 0 0 100 0 6000 0 92 8 0 100 0 0 100 0 1200 0 74 26 0 100 0 0 100 0 50:50 10 3000 0 85 15 0 100 0 0 100 0 6000 0 72 28 0 100 0 0 100 0 1200 0 78 22 0 100 0 0 100 0 20 3000 0 55 45 0 100 0 0 100 0 ORS 6000 0 52 48 0 100 0 0 100 0 ERS 1200 0 92 8 97 3 0 100 0 0 4 3000 0 87 13 1 99 0 6 94 0 6000 0 89 11 0 100 0 0 100 0 1200 0 71 29 0 100 0 0 100 0 90:10 10 3000 0 70 30 0 100 0 0 100 0 6000 0 38 62 0 100 0 0 100 0 1200 0 57 43 0 100 0 0 100 0 20 3000 0 52 48 0 100 0 0 100 0 6000 0 96 4 0 100 0 0 100 0 When the condition was the 4-items and a sample size of N = 1200 with equal mixing proportions, only 48% of the ORS-MRS data sets were correctly identified. When the mixing proportions were unequal, the correct model selection rates based on the BIC or CAIC became even lower and an increase in the sample size from N = 1200 to N = 6000 did not improve the rates significantly. Despite the increase in the number of items up to ten, the correct selection rates was still very low (5%) with a sample size of N = 1200. 102 Table 13.Model Selection under the ORS-MRS Mixtures Information Criterion AIC BIC CAIC Number of classes of the estimation model 1 2 3 1 2 3 1 2 3 Type of Mixture Mixing Proportions Item Sample 1200 0 83 17 52 48 0 77 23 0 4 3000 0 34 66 0 100 0 0 100 0 6000 0 11 89 0 100 0 0 100 0 1200 0 85 15 0 100 0 0 100 0 50:50 10 3000 0 86 14 0 100 0 0 100 0 6000 0 88 12 0 100 0 0 100 0 1200 0 59 41 0 100 0 0 100 0 20 3000 0 78 22 0 100 0 0 100 0 ORS 6000 0 81 19 0 100 0 0 100 0 MRS 1200 34 58 8 100 0 0 100 0 0 4 3000 1 94 5 100 0 0 100 0 0 6000 3 64 33 90 10 0 92 8 0 1200 0 65 35 95 5 0 100 0 0 90:10 10 3000 0 40 60 9 91 0 12 88 0 6000 0 5 95 0 100 0 0 100 0 1200 0 38 62 0 100 0 5 95 0 20 3000 0 8 92 0 100 0 0 100 0 6000 0 52 48 0 100 0 0 100 0 Table 14 presents the results of model selection under the ORS-ARS mixtures. All levels of ORS-ARS data sets were identified correctly as a 2-class mixture based on the BIC and the CAIC. It appeared that the highly pronounced thresholds characteristics in the ARS class i.e., all thresholds are positive for half of items and all thresholds are negative for the other half of items, made the identification of this class easier than the identification of either the ERS or MRS class. 103 Table 14. Model Selection under the ORS-ARS Mixtures Information Criterion AIC BIC CAIC Number of classes of the estimation model 1 2 3 1 2 3 1 2 3 Type of Mixture Mixing Proportions Item Sample 1200 0 79 21 0 100 0 0 100 0 4 3000 0 56 44 0 100 0 0 100 0 6000 0 24 76 0 100 0 0 100 0 1200 0 50 50 0 100 0 0 100 0 50:50 10 3000 0 27 73 0 100 0 0 100 0 6000 0 11 89 0 100 0 0 100 0 1200 0 46 54 0 100 0 0 100 0 20 3000 0 10 90 0 100 0 0 100 0 ORS 6000 0 0 100 0 100 0 0 100 0 ARS 1200 0 86 14 0 100 0 0 100 0 4 3000 0 86 14 0 100 0 0 100 0 6000 0 77 23 0 100 0 0 100 0 1200 0 43 57 0 100 0 0 100 0 90:10 10 3000 0 20 80 0 100 0 0 100 0 6000 0 1 99 0 100 0 0 100 0 1200 0 24 76 0 100 0 0 100 0 20 3000 0 52 48 0 100 0 0 100 0 6000 0 3 97 0 100 0 0 100 0 Table 15 and Table 16 present the results of the model selection for the 3- response-style and 4-response-style mixtures. Given the results of the 2-response style mixtures, it was foreseen that the data generation model with three or four response styles would have difficulties to be identified under the 4-items conditions. The results showed that if each response style constitutes an equal proportion of population a sample size of N = 1200 with 10-items seemed to be minimum condition in which 3- response-style or 4-response-style mixtures can be correctly identified based on the BIC or the CAIC. When the mixing proportions were unequal, a sample size of N = 3000 with 10-items seemed to be necessary for the correct model selection. 104 Table 15. Model Selection under the ORS-ERS-MRS Mixtures Information Criterion AIC BIC CAIC Number of classes of the estimation model 2 3 4 2 3 4 2 3 4 Type of Mixture Mixing Proportions Item Sample 1200 12 74 14 99 1 0 100 0 0 4 3000 0 93 7 99 1 0 99 1 0 6000 0 87 13 38 62 0 57 43 0 1200 0 85 15 0 100 0 0 100 0 33:33:33 10 3000 0 96 4 0 100 0 0 100 0 6000 0 87 13 0 100 0 0 100 0 1200 0 91 9 0 100 0 0 100 0 20 3000 0 84 16 0 100 0 0 100 0 ORS 6000 0 88 12 0 100 0 0 100 0 ERS 1200 67 29 4 100 0 0 100 0 0 MRS 4 3000 14 61 25 84 16 0 93 7 0 6000 3 57 40 96 4 0 97 3 0 1200 0 75 25 94 6 0 100 0 0 80:10:10 10 3000 0 75 25 0 100 0 0 100 0 6000 0 69 31 0 100 0 0 100 0 1200 0 91 9 0 100 0 5 95 0 20 3000 0 51 49 0 100 0 0 100 0 6000 0 71 29 0 100 0 0 100 0 105 Table 16. Model Selection under the ORS-ERS-MRS-ARS Mixtures Information Criterion AIC BIC CAIC Number of classes of the estimation model 3 4 5 3 4 5 3 4 5 Type of Mixture Mixing Proportions Item Sample 1200 33 65 2 99 1 0 99 1 0 4 3000 16 16 68 99 1 0 99 1 0 6000 0 88 12 94 6 0 99 1 0 1200 0 89 11 4 96 0 23 77 0 25:25:25:25 10 3000 0 84 16 0 100 0 0 100 0 6000 0 91 9 0 100 0 0 100 0 1200 0 96 4 0 100 0 0 100 0 ORS 20 3000 0 86 14 0 100 0 0 100 0 ERS 6000 0 71 29 0 100 0 0 100 0 MRS 1200 62 26 12 100 0 0 100 0 0 ARS 4 3000 24 41 35 96 4 0 99 1 0 6000 1 67 32 46 54 0 48 52 0 1200 0 80 20 44 56 0 45 55 0 70:10:10:10 10 3000 0 78 22 7 93 0 7 93 0 6000 0 59 41 0 100 0 0 100 0 1200 0 92 8 0 100 0 6 94 0 20 3000 0 72 28 0 100 0 0 100 0 6000 0 37 63 0 100 0 0 100 0 4.3 Classification of Respondents The simulation results regarding classification of respondents with respect to their response style are presented in two parts separately: i) for correct classifications and ii) misclassifications. The mean percentage of respondents who were correctly assigned to their true (generated) class membership was computed over one hundred replications as an index of classification accuracy. Likewise, the mean percentage of respondents who were incorrectly assigned to a class other than their true class was computed as an index of misclassification rate. In addition, the standard error (SE) of 106 the classification accuracy as well as the SE of the misclassification were obtained by computing the standard deviation of the one-hundred percentage values. 4.3.1. Classification accuracy Classification accuracy for each response class is presented in Table 17 along with the SE of the classification accuracy in parentheses. The blank cells in the table represent the conditions for which a high proportion of replications presented estimation problems and thus, the classification accuracy was not computed. The cells marked with asterisks in the table are the conditions in which a high percentage of unsolvable label switching problems occurred. For those conditions, the classification accuracy was computed with a fewer number of solutions, the ones excluding unsolvable replications. The conditions marked with asterisks, however, presented an unexpected trend in the simulation results. In these conditions, although the simulated testing circumstances were relatively ?poor? (e.g. smaller number of test items and small sample size) the classification accuracy turned out to be better. One explanation for this aberrant trend could be that because the solutions that achieved relatively more accurate estimates were selectively retained. It was also clearly shown that the classification accuracies were accompanied with very high SE under those conditions. Taking all of this information into account, the conditions marked with asterisks were excluded from the ANOVA analysis along with the conditions with estimation problems. 107 Table 17. Percentages of Correct Classification and Standard Errors of Classification Accuracy Type of mixture and mixing proportions ORS 0.5 ERS 0.5 ORS 0.9 ERS 0.1 ORS 0.5 MRS 0.5 ORS 0.9 MRS 0.1 ORS 0.5 ARS 0.5 ORS 0.9 ARS 0.1 Assigned class ORS ERS ORS ERS ORS MRS ORS MRS ORS ARS ORS ARS Item Sample size 1200 80.78 (4.1) 86.88 (5.3) 90.14* (6.0) 66.14* (11.5) 80.60* (8.0) 72.60* (7.2) 65.32* (14.0) 71.69* (22.4) 94.07 (2.2) 94.02 (2.1) 98.45 (1.1) 86.27 (4.1) 4 3000 81.46 (2.5) 87.77 (2.3) 93.73 (2.8) 61.43 (10.2) 90.70 (2.4) 58.32 (5.5) 91.72* (3.4) 50.02* (10.3) 94.50 (1.8) 93.77 (1.8) 98.80 (0.7) 86.41 (2.4) 6000 81.35 (2.2) 88.26 (1.9) 95.19 (1.3) 58.04 (5.7) 91.09 (1.8) 57.97 (4.9) 69.27 (3.0) 58.22 (4.8) 95.05 (1.2) 93.41 (1.3) 98.81 (0.6) 86.40 (1.8) 1200 93.15 (1.2) 96.30 (1.0) 97.36 (0.7) 86.58 (4.1) 90.85 (1.9) 87.94 (2.2) 98.00 (0.5) 98.64 (0.7) 10 3000 93.32 (0.9) 96.27 (0.6) 97.39 (0.4) 87.18 (2.7) 91.05 (1.1) 88.24 (1.3) 97.45 (0.9) 69.15 (4.2) 98.91 (0.4) 98.79 (0.6) 99.76 (0.2) 96.27 (1.3) 6000 93.19 (0.6) 96.39 (0.5) 97.48 (0.3) 87.50 (1.5) 91.38 (0.8) 88.06 (0.8) 98.04 (0.4) 67.88 (2.3) 98.93 (0.2) 98.88 (0.4) 99.84 (0.1) 96.18 (0.9) 1200 96.69 (0.6) 98.93 (0.4) 98.29 (0.5) 96.22 (1.8) 96.44 (1.0) 94.19 (1.5) 99.69 (0.3) 99.61 (0.3) 20 3000 96.73 (0.5) 98.88 (0.3) 98.38 (0.3) 96.31 (1.2) 96.49 (0.5) 94.38 (0.7) 99.04 (0.2) 86.17 (2.4) 99.74 (0.2) 99.63 (0.2) 99.90 (0.1) 98.84 (0.7) 6000 96.77 (0.4) 98.94 (0.2) 98.39 (0.2) 96.45 (0.9) 96.76 (0.4) 94.16 (0.5) 99.15 (0.2) 86.14 (1.5) 99.76 (0.1) 99.64 (0.2) 99.94 (0.1) 98.87 (0.5) 108 Table 17_continued. Type of mixture and mixing proportions ORS 0.33 ERS0.33 MRS 0.33 ORS 0.8 ERS 0.1 MRS 0.1 Assigned class ORS ERS MRS ORS ERS MRS Item Sample size 1200 55.27* (11.1) 78.45* (13.4) 77.32* (8.4) 59.44* (13.2) 69.38* (13.7) 65.47* (10.6) 4 3000 59.29 (9.5) 83.16 (9.3) 75.61 (8.7) 73.66* (12.7) 60.48* (11.5) 64.32* (11.5) 6000 61.6 (10.0) 84.63 (7.7) 74.51 (9.8) 80.27* (14.0) 61.11* (9.5) 55.03* (15.5) 1200 83.27 (3.4) 95.73 (1.4) 87.60 (2.9) 10 3000 83.98 (1.9) 96.06 (0.9) 88.65 (1.5) 94.76 (0.7) 87.43 (2.7) 70.49 (3.7) 6000 84.26 (1.5) 96.14 (0.6) 88.74 (1.2) 95.32 (0.6) 87.56 (1.8) 69.01 (2.6) 1200 93.06 (1.5) 98.80 (0.7) 94.14 (1.4) 20 3000 93.27 (0.9) 98.84 (0.4) 94.42 (1.0) 97.18 (0.4) 96.38 (1.1) 86.71 (2.3) 6000 93.42 (0.8) 98.83 (0.3) 94.43 (0.7) 97.32 (0.3) 96.82 (0.9) 86.75 (2.1) Type of mixture and mixing proportions ORS 0.25 ERS 0.25 MRS 0.25 ARS 0.25 ORS 0.7 ERS 0.1 MRS 0.1ARS 0.1 Assigned class ORS ERS MRS ARS ORS ERS MRS ARS Item Sample size 1200 4 3000 6000 1200 10 3000 84.02 (1.6) 94.14 (0.8) 88.49 (1.4) 95.07 (0.6) 6000 84.03 (1.2) 94.04 (0.7) 88.81 (1.0) 95.16 (0.5) 94.76 (0.6) 86.44 (2.3) 70.70 (2.8) 94.20 (1.1) 1200 92.66 (1.8) 98.26 (0.9) 94.21 (1.8) 97.37 (1.1) 20 3000 93.24 (1.1) 98.24 (0.5) 94.03 (1.0) 97.53 (0.6) 97.16 (0.4) 95.95 (1.5) 87.94 (2.1) 97.15 (1.0) 6000 93.42 (0.7) 98.37 (0.3) 94.49 (0.6) 97.52 (0.4) 97.25 (0.3) 96.13 (0.8) 87.47 (1.5) 97.19 (0.6) Note.* Calculated excluding some of replications for which switched labels were unsolvable. 109 In the following reports of the factorial ANOVA results, only the effects that were both statistically and practically significant are interpreted for their importance. Overall classification accuracy. The percentages of correct classification obtained for each class were averaged across latent classes within the given mixture as an index of overall classification accuracy and used as a dependent variable of the factorial ANOVA. Table 18 summarizes the results of the factorial ANOVA on the overall classification accuracy. Table 18. Factorial ANOVA Results on Overall Classification Accuracy Source Type III Sum of Squares Df F p 2? Mixture 1079.91 4 460.53 0.00 0.28 Proportion 104.35 1 178.00 0.00 0.03 Item 1635.07 2 1394.57 0.00 0.42 Sample 0.04 2 0.04 0.97 0.00 mixture * item 422.27 7 102.90 0.00 0.11 proportion * item 48.93 2 41.73 0.00 0.01 item * sample 0.34 4 0.15 0.96 0.00 mixture * proportion 48.69 4 20.77 0.00 0.01 mixture * sample 0.57 8 0.12 0.99 0.00 proportion * sample 1.14 2 0.97 0.39 0.00 Error 17.00 29 Corrected total 3908.84 The significant factors on the overall classification accuracy were the main effect of the type of mixture (F(4,29) = 460.53, p < .001; 2? = 0.28) and test length (F(3,29) = 1394.57, p < .001; 2? = 0.42), as well as the interaction effect between type of mixture and test length (F(7,29) = 102.90, p < .001; 2? = 0.11). The effect sizes of 110 the two main effects were large whereas that of the interaction effect was medium. Table 19 presents the cell means of the classification accuracy at the levels of independent variables of the test length and type of mixture. Table 19.Cell Means of the Overall Classification Accuracy Item Mixture OE OM OA OEM OEMA Total 4 81.49 70.93 93.33 73.13 na 79.72 Mean 10 93.51 87.00 98.42 87.27 89.16 91.07 20 97.58 94.29 99.56 94.69 95.28 96.28 Total 91.41 86.10 96.87 88.00 92.98 For the significant main effects, post-hoc comparisons were conducted. The results of the Tukey HSD (with ?FW = .05) tests showed that the overall classification accuracy differ significantly among all five different types of mixtures as well as among the three levels of test length. As expected in the earlier sections based on the degrees of heterogeneity in the thresholds plots, the mixtures of ORS and ARS respondents were most accurately classified (96.87 %) while the ORS and MRS mixtures were most difficult to be correctly distinguished (86.10 %). The 3-response- style mixtures showed lower level of overall classification accuracy than the 4- response-style mixtures. It seems to be because of the contribution of the low classification accuracy of the MRS class to the overall classification for the 3- response-style mixtures and also the contribution of the high classification accuracy of the ARS class for the 4-response-style mixtures. Regardless of the type of mixture, the overall classification accuracy was higher than 94% when the test length was I = 20. The interaction effect between the type of mixture and test length was further investigated. In the interaction plot presented in Figure increase in the classification accuracy between the ORS-ARS mixture was mixture. The pairwise comparisons mixture showed that the increase between was significant at the p < .05 whereas that increase for the other four mixtures was significant at the p <.001. Figure 16. Interaction effect between type of mixture and test length on the overall classification accuracy Classification accuracy for each response style. classification accuracy, the classification accuracy for each response 111 16, it was observed the test length of I = 10 and relatively smaller than that increase for other types of of the three levels of test length for each type of I = 10 and I = 20 for the ORS- In addition to the overall -style clas that the I = 20 for ARS mixture s was 112 also evaluated. Table 20 summarizes the four factorial ANOVA results and presents only significant effects that met both the statistical and practical significance criteria. Table 20. Effect size ( 2? ) for the Classification Accuracy Conditional on Statistical Significance (p < 0.05) Source ORS ERS MRS ARS mixture 0.21 proportion 0.14 item 0.26 0.35 0.23 0.49 mixture * item 0.14 Test length was the common factor influencing the classification of ORS, ERS, MRS, and ARS respondents. Regardless of the type of response style, as the number of items increased, the correct classification rate increased with a significant difference: M4 (86.51) < M10 (93.35) < M20 (96.93) for ORS: M4 (78.60) < M10 (91.98) < M20 (97.65) for ERS: M4 (64.93) < M10 (81.06) < M20 (91.31) for MRS: and M4 (90.05) < M10 (96.65) < M20 (98.34) for ARS. The results of the Tukey HSD (with ?FW = .05) tests on the main effect of the type of mixture showed that 98.38% of ORS respondents were correctly classified in the ORS-ARS mixtures whereas only 86.39% of them were correctly identified in the ORS-ERS-MRS mixtures. In the rest of the mixtures, 92.83 % of ORS respondents on average were correctly classified. These classification accuracy rates were statistically significantly different (p < .05). The interaction effect found for the ORS class was in the same pattern as the interaction effect for the overall classification accuracy. 113 The mixing proportions influenced the classification of ERS respondents. Under the equal proportions conditions, ERS respondents were classified significantly better than under the unequal proportions conditions: Munequal (82.97) < Mequal (94.12) (p < .001). A noteworthy result in the classification accuracy analysis was that the sample size was not a significant factor. As may be noticed in Table 17, the differences in the classification accuracy rates at the three sample sizes were negligible in most of the conditions. If this model is used in empirical studies to detect people with different response styles, the number of items of an instrument is the most important factor to be considered. As long as a sufficient number of items (at least ten items) is used, a sample with N = 1200 would provide an equivalent level of classification accuracy as a larger sample with N = 6000 would provide. 4.3.2. Misclassification To investigate whether misclassification occurred particularly between certain types of response styles, the 3-response-style mixtures and 4-response-style mixtures were examined with respect to all possible mismatching between true (generated) and assigned class. Since classification rates did not significantly differ at different levels of sample size, Table 21 summarizes the marginal misclassification rates over the three levels of sample size. 114 Table 21. Percentages of Misclassified Respondents Type of mixture ORS ? ERS ? MRS True class and mixing proportions ORS 0.33 ORS 0.33 ERS 0.33 ERS 0.33 MRS 0.33 MRS 0.33 ORS 0.7 ORS 0.7 ERS 0.1 ERS 0.1 MRS 0.1 MRS 0.1 Assigned class ERS MRS ORS MRS ORS ERS ERS MRS ORS MRS ORS ERS 4 17.50 23.78 16.58 1.33 21.53 2.65 9.86 19.00 31.79 4.56 36.59 2.20 Item 10 6.62 9.53 3.92 0.11 10.38 1.28 2.64 2.33 12.35 0.16 29.68 0.58 20 3.13 3.60 1.17 0.01 4.96 0.72 2.37 2.29 2.38 0.02 8.90 0.55 Total 9.08 12.30 7.22 0.48 12.29 1.55 4.96 7.87 15.50 1.58 25.06 1.11 Type of mixture ORS ? ERS ? MRS ? ARS True class and mixing proportions ORS 0.25 ORS 0.25 ORS 0.25 ERS 0.25 ERS 0.25 ERS 0.25 MRS 0.25 MRS 0.25 MRS 0.25 ARS 0.25 ARS 0.25 ARS 0.25 Assigned class ERS MRS ARS ORS MRS ARS ORS ERS ARS ORS ERS MRS 4 Item 10 6.46 9.19 0.32 3.67 0.07 2.12 10.17 1.30 0.04 0.40 4.52 0.01 20 3.17 3.85 0.04 1.19 0.01 0.56 5.01 0.87 0.01 0.08 2.48 3.17 Total 4.82 6.52 0.18 2.43 0.04 1.34 7.59 1.30 0.87 0.03 0.08 0.40 True class and mixing proportions ORS 0.7 ORS 0.7 ORS 0.7 ERS 0.1 ERS 0.1 ERS 0.1 MRS 0.1 MRS 0.1 MRS 0.1 ARS 0.1 ARS 0.1 ARS 0.1 Assigned class ERS MRS ARS ORS MRS ARS ORS ERS ARS ORS ERS MRS 4 Item 10 2.75 2.39 0.10 11.47 0.23 1.85 28.71 0.58 0.01 1.65 4.14 0.01 20 1.72 1.07 0.01 3.47 0.00 0.50 12.10 0.19 0.01 0.31 2.52 0.00 Total 2.24 1.73 0.06 7.47 0.12 1.18 20.41 0.39 0.01 0.98 3.33 0.01 For the 3-response-style mixtures, the most commonly occurring misspecification was the misclassification of MRS respondents within the ORS class (MO) under unequal mixing proportions, followed by the misclassification of ERS respondents within the ORS class (EO) under unequal mixing proportions (hereafter a misclassification of ?A? respondents within the ?B? class is referred to as AB while a misclassification of ?B? respondents within the ?A? class is referred to as BA). On the other hand, EM and ME rarely occurred. Especially, when the test length was long and, thus, overall classification accuracy was high, the chance of EM was essentially zero. This rare occurrence of EM was consistent regardless of the mixing proportions. 115 When the mixing proportions were equal, OE and EO as well as OM and MO did not differ significantly. However, when the mixing proportions were unequal (i.e., 10 % of population was MRS or ERS respondents while the majority was ORS respondents), MO and EO significantly increased (25.06 % and 15.50 %, respectively). It seems that it was easier for the distorted response-style respondents to be misclassified within the normal response-style respondent if the distorted group was a small sized group. However, this trend was not observed for the ARS class. Under the 4-response-style mixture, the chance of MO and EO was also significantly high (20.41% and 7.47 %, respectively) as well as EM and ME again rarely occurred (0.08 % and 0.85 %, respectively). In addition, there were several other misclassifications that were associated with essentially zero chance of occurrence. They were OA (0.12 %), MA (0.44 %), AO (0.5%), and AM (0.2%). 4.4 Threshold Parameter Recovery Recovery of item thresholds was evaluated with respect to the RMSE, Pearson correlation, and SE. Initially, these three evaluation measures were assessed for each of the four thresholds. The evaluation measures were then averaged across the four thresholds for use in the ANOVA analysis. The averaged evaluation measures are provided in the following sections. Sections 4.1.1, 4.4.2, and 4.4.3 discuss the results for each of the evaluation criteria based on the factorial ANOVA. 4.4.1. Evaluation of the RMSE. The averaged RMSE is presented in Table 22 followed by the factorial ANOVA results for each latent class in the subsequent section. 116 Table 22. RMSE of Threshold Parameter Estimates Type of mixture ORS ERS ORS MRS ORS ARS ORS ERS MRS ORS ERS MRS ARS Class ORS ERS ORS MRS ORS ARS ORS ERS MRS ORS ERS MRS ARS Mixing Sample Item Proportions 1200 4 .194 .238 10 .144 .201 .153 .281 .137 .275 .197 .252 .348 20 .140 .195 .144 .230 .139 .280 .179 .238 .291 .209 .283 .339 .411 4 .117 .162 .207 .536 .095 .222 .305 .231 .438 Equal 3000 10 .092 .125 .102 .166 .090 .193 .123 .155 .196 .146 .185 .245 .270 20 .087 .125 .095 .144 .090 .197 .112 .148 .177 .131 .173 .204 .249 4 .085 .114 .196 .434 .060 .172 .212 .166 .318 6000 10 .064 .090 .078 .115 .066 .156 .089 .113 .141 .102 .130 .166 .190 20 .062 .087 .072 .103 .066 .159 .080 .106 .124 .098 .125 .146 .139 1200 4 10 .104 .518 20 .103 .461 Unequal 4 .091 .511 3000 10 .067 .315 .072 .614 .068 .445 .077 .331 .557 20 .066 .285 .070 .391 .068 .403 .070 .287 .332 .076 .287 .382 .420 4 .057 .358 .196 .455 .050 .411 6000 10 .046 .225 .054 .357 .051 .301 .053 .229 .370 .060 .253 .343 .351 20 .046 .201 .052 .252 .050 .283 .051 .206 .258 .057 .231 .260 .289 117 ORS class. The factorial ANOVA results of the RMSE of threshold parameter estimates for the ORS class (RMSE-threshold-ORS) are presented in Table 23. Table 23. Factorial ANOVA Results on the RMSE of Threshold Estimates for ORS Class Source Type III Sum of Squares Df F P 2? Mixture 0.029 4 102.08 0.00 0.16 proportion 0.005 1 67.53 0.00 0.03 Sample 0.017 2 120.08 0.00 0.09 Item 0.024 2 168.67 0.00 0.13 mixture * item 0.032 7 65.02 0.00 0.18 proportion * item 0.000 2 0.61 0.55 0.00 sample * item 0.001 4 2.89 0.04 0.01 mixture * proportion 0.001 4 4.18 0.01 0.01 mixture * sample 0.002 8 3.28 0.01 0.01 proportion * sample 0.000 2 3.53 0.04 0.00 Error 0.002 27 Corrected total 0.182 63 The significant factors on the RMSE-threshold-ORS were the main effect of the type of mixture (F(4,27) = 102.08, p < .001; 2? = 0.16), sample size (F(2,27) = 120.08, p < .001; 2? = 0.09), test length (F(2,27) = 168.67, p < .001; 2? = 0.13), as well as the interaction effect between type of mixture and test length (F(7,27) = 65.02, p < .001; 2? = 0.18). Table 24 presents the cell means of the RMSE at the levels of independent variables of the type of mixture, sample size, and test length. 118 Table 24.Cell Means of the RMSE of Threshold Estimates for the ORS Class Sample Item Mixture OE OM OA OEM OEMA 1200 4 0.194 na na na na 10 0.124 0.153 0.137 0.197 na 20 0.122 0.144 0.139 0.179 0.209 total 0.137 0.149 0.138 0.188 0.209 3000 4 0.104 0.207 0.082 0.305 na 10 0.080 0.087 0.079 0.100 0.146 20 0.077 0.083 0.079 0.091 0.104 total 0.087 0.109 0.080 0.137 0.118 6000 4 0.071 0.196 0.055 0.212 na 10 0.055 0.066 0.059 0.071 0.081 20 0.054 0.062 0.055 0.066 0.078 total 0.060 0.108 0.056 0.097 0.079 In general, the RMSE-threshold-ORS decreased consistently as the sample size and test length increased in each type of mixture. For the significant main effects, the post-hoc comparisons were conducted. The results of the Tukey HSD (with ?FW = .05) test showed that RMSE-threshold-ORS differed as following: MOA (0.078) < MOE (0.092) < MOEMA (0.110) = MOM (0.115) < MOEM (0.129), where inequality sign indicates a significant difference and equality sign indicates an insignificant difference Regarding the main effect of the sample size, the decrease in the RMSE- threshold-ORS as sample size increased was significant between all three levels based on the Tukey HSD test (with ?FW = .05): M1200 (0.154) > M3000 (0.103) > M6000 (0.080). Regarding the main effect of the test length, the decrease in the RMSE- threshold-ORS was significant as the test length increased from I = 4 to I = 10 but was not significant as the test length increased from I = 10 to I = 20: M4 (0.138) > M10 (0.093) = M20 (0.093). 119 The significant interaction effect between the type of mixture and test length is depicted in Figure 17. In the figure, clearly seen is the superior recovery of the ORS threshold parameters in the ORS-ARS mixture even at the I = 4 level. The pairwise comparisons of the three levels of test length for each type of mixture showed that the increase in the RMSE from I = 4 to I = 10 as well as that from I = 10 to I = 20 was not statistically significant for the ORS-ARS mixture while the decrease in the RMSE from I = 4 to I = 10 was significant for all other types of mixture. Figure 17. Interaction effect between type of mixture and test length on the RMSE of threshold estimates for the ORS class ERS class. The factorial ANOVA results of the RMSE of threshold parameter estimates for the ERS class (RMSE-threshold-ERS) are presented in Table 25. 120 Table 25. Factorial ANOVA Results on the RMSE of Threshold Estimates for the ERS Class Source Type III Sum of Squares Df F p 2? Mixture 0.008 2 30.06 0.00 0.02 Proportion 0.133 1 995.61 0.00 0.32 Sample 0.074 2 275.62 0.00 0.18 Item 0.029 2 107.41 0.00 0.07 mixture * item 0.001 3 2.46 0.11 0.00 proportion * item 0.016 2 60.20 0.00 0.04 sample * item 0.002 4 2.93 0.07 0.00 mixture * proportion 0.001 2 1.99 0.18 0.00 mixture * sample 0.001 4 2.06 0.15 0.00 proportion * sample 0.018 2 66.68 0.00 0.04 Error 0.002 12 Corrected total 0.417 36 The significant factors on the RMSE-threshold-ERS were the main effect of the mixing proportions (F(1,12) = 995.61, p < .001; 2? = 0.32), sample size (F(2,12) = 275.62, p < .001; 2? = 0.18), and test length (F(2,12) = 107.41, p < .001; 2? = 0.07). Table 26 presents the cell means of the RMSE at the levels of independent variables of the mixing proportions, sample size, and test length. 121 Table 26. Cell Means of the RMSE of Threshold Estimates for the ERS Class Sample Item Mixing Proportions Equal Unequal 4 0.238 na 1200 10 0.227 0.518 20 0.239 0.461 total 0.235 0.490 3000 4 0.197 0.511 10 0.155 0.323 20 0.149 0.286 total 0.163 0.336 6000 4 0.140 0.358 10 0.111 0.236 20 0.106 0.213 total 0.116 0.243 The Tukey HSD (with ?FW = .05) test showed the same patterns of significant differences as those that were observed for the RMSE-threshold-ERS. Regarding the main effect of the sample size, the decrease in the RMSE-threshold-ERS as sample size increased was significant between all three levels: M1200 (0.298) > M3000 (0.237) > M6000 (0.176). Regarding the main effect of the test length, the decrease in the RMSE-threshold-ERS was significant as the test length increased from I = 4 to I = 10 but was not significant as the test length increased from I = 10 to I = 20: M4 (0.254) > M10 (0.223) = M20 (0.215). The main effect of the mixing proportions showed a smaller RMSE when the mixing proportions were equal: MUnequal (0.313) > MEqual (0.166). The mixing proportion was not a significant factor for the ORS class. It was a significant factor for the ERS class as well as the other two classes. It makes sense because the ORS class 122 always took on the larger proportion of the generated samples while the ERS, MRS, and ARS took on only 10% of the respondents. MRS class. The factorial ANOVA results of the RMSE of threshold parameter estimates for the MRS class (RMSE-threshold-MRS) are presented in Table 27. Table 27.Factorial ANOVA Results on the RMSE of Threshold Estimates for the MRS Class Source Type III Sum of Squares Df F p 2? Mixture 0.005 2 3.97 0.05 0.01 Proportion 0.077 1 120.46 0.00 0.13 Sample 0.128 2 99.95 0.00 0.22 Item 0.096 2 74.69 0.00 0.17 mixture * item 0.012 3 6.32 0.01 0.02 proportion * item 0.039 2 30.22 0.00 0.07 sample * item 0.008 3 3.93 0.04 0.01 mixture * proportion 0.005 2 4.09 0.05 0.01 mixture * sample 0.002 4 0.76 0.58 0.00 proportion * sample 0.016 1 25.66 0.00 0.03 Error 0.006 10 Corrected total 0.573 32 As was found for the ERS class, the significant factors on the RMSE-threshod- MRS were the main effect of the mixing proportions (F(1,10) = 120.46, p < .001; 2? = 0.13), sample size (F(2,10) = 99.95, p < .001; 2? = 0.22), and test length (F(2,10) = 74.69, p < .001; 2? = 0.17). While the most influencing factor was the mixing proportions for the ERS class, the sample size was the most important factor for the MRS class. Table 28 presents the cell means of the RMSE at the levels of independent variables of the mixing proportions, sample size, and test length. 123 Table 28. Cell Means of the RMSE of Threshold Estimates for the MRS class Sample Item Mixing Proportions Equal Unequal 4 na na 1200 10 0.31 na 20 0.29 na total 0.30 na 3000 4 0.49 na 10 0.20 0.59 20 0.18 0.37 total 0.26 0.46 6000 4 0.38 0.46 10 0.14 0.36 20 0.12 0.26 total 0.19 0.33 The Tukey HSD (with ?FW = .05) test showed the same patterns of significant differences as those were observed for the previous two classes. Regarding the main effect of the sample size, the significant differences were as following: M1200 (0.256) > M3000 (0.298) > M6000 (0.337). Regarding the main effect of the test length, the significant differences were as following: M4 (0.436) > M10 (0.300) = M20 (0.242). In addition, the main effect of the mixing proportions showed the significant difference: MUnequal (0.381) > MEqual (0.193). ARS class. The factorial ANOVA results of the RMSE of threshold parameter estimates for the ARS class (RMSE-threshold-ARS) are presented in Table 29. 124 Table 29. Factorial ANOVA Results on the RMSE of Threshold Estimates for the ARS Class Source Type III Sum of Squares Df F p 2? Mixture 0.011 1 43.47 0.00 0.06 Proportion 0.094 1 360.78 0.00 0.49 Sample 0.055 2 105.98 0.00 0.29 Item 0.008 2 16.29 0.01 0.04 mixture * item 0.001 1 5.65 0.08 0.01 proportion * item 0.004 2 7.65 0.04 0.02 sample * item 0.000 3 0.02 1.00 0.00 mixture * proportion 0.000 1 0.01 0.92 0.00 mixture * sample 0.006 2 10.76 0.03 0.03 proportion * sample 0.004 1 16.38 0.02 0.02 Error 0.001 4 Corrected total 0.190 20 The significant factors on the RMSE-threshold-ARS were the type of mixture (F(1,4) = 43.47, p < .001; 2? = 0.06), mixing proportions (F(1,4) =360.78, p < .001; 2? = 0.49), and sample size (F(2,4) = 105.98, p < .001; 2? = 0.29). Unlike for the other classes, test length was not significant for the ARS class. Table 30 presents the cell means of the RMSE at the levels of independent variables of the mixing proportions, sample size, and test length. The Tukey HSD (with ?FW = .05) test showed the decrease in the RMSE-thr- ARS from N = 1200 to N = 3000 was not significant while the decrease from N = 3000 to N = 6000 was significant: M1200 (0.336) = M3000 (0.322) > M6000 (0.241). Regarding the main effect of the test length, the significant differences were found between I = 4 and I = 10: M4 (0.358) > M10 (0.273) = M20 (0.279). In addition, the main effect of the mixing proportions showed the significant difference: MUnequal (0.387) > MEqual (0.163). 125 Table 30.Cell Means of the RMSE of Threshold Estimates for the ARS class Proportion Sample Type of mixture OA OEMA Equal 1200 0.278 0.411 3000 0.204 0.260 6000 0.162 0.165 Total 0.207 0.252 Unequal 1200 na na 3000 0.492 0.420 6000 0.317 0.320 Total 0.405 0.353 4.4.2. Evaluation of the correlation The second criterion used to evaluate the threshold parameter recovery was the Pearson correlation coefficient between generated and estimated thresholds. Table 31 reports the correlations that were averaged across the four thresholds. The factorial ANOVA conducted on the correlation measures showed that the significant factors for each response-style class considering both statistical and practical importance turned out to be the same as those that were found to be significant on the RMSE measures. The factorial ANOVA results for the correlation measures are presented in a single table concisely in Table 31 instead of providing four analysis results in separate ANOVA tables. 126 Table 31.Correlations Between Generated and Estimated Threshold Parameters Type of mixture ORS ERS ORS MRS ORS ARS ORS ERS MRS ORS ERS MRS ARS Class ORS ERS ORS MRS ORS ARS ORS ERS MRS ORS ERS MRS ARS Mixing Sample Item Proportions 1200 4 .794 .694 10 .888 .858 .836 .852 .857 .986 .760 .792 .808 20 .885 .862 .839 .880 .853 .986 .786 .802 .831 .736 .748 .790 .966 4 .910 .836 .832 .780 .930 .994 .691 .755 .758 Equal 3000 10 .932 .933 .925 .937 .935 .995 .883 .905 .916 .848 .870 .887 .986 20 .951 .930 .924 .946 .930 .994 .892 .908 .922 .864 .878 .903 .988 4 .949 .908 .882 .836 .967 .997 .840 .855 .837 6000 10 .966 .965 .959 .971 .965 .995 .934 .947 .953 .919 .931 .937 .993 20 .963 .964 .960 .973 .963 .997 .943 .949 .960 .920 .935 .939 .994 1200 4 10 .911 .546 20 .909 .587 Unequal 4 .945 .469 .964 .954 3000 10 .961 .705 .961 .632 .963 .963 .950 .699 .653 20 .959 .738 .959 .767 .960 .968 .954 .741 .751 .947 .746 .763 .965 4 .980 .603 .895 .780 .980 .985 6000 10 .982 .818 .980 .840 .982 .984 .976 .827 .790 .973 .790 .803 .978 20 .979 .843 .979 .862 .980 .985 .976 .848 .853 .975 .802 .820 .978 127 Table 32. Effect size ( 2? ) for Correlation for Thresholds Parameters Conditional on Statistical Significance (p < 0.05) Factor ORS ERS MRS ARS Mixture 0.15 Proportion 0.29 0.24 0.67 Sample 0.16 0.18 0.27 0.33 Item 0.09 0.14 0.11 mixture * item 0.08 ORS class. The factorial ANOVA results of the correlation of threshold parameter estimates for the ORS class (Correlation-threshold-ORS) showed that the main effect of the type of mixture (F(4,27) = 73.33, p < .001; 2? = 0.15), sample size (F(2,27) = 154.60, p < .001; 2? = 0.16), test length (F(2,27) = 82.97, p < .001; 2? = 0.09), as well as the interaction effect between type of mixture and test length (F(7,27) = 22.98, p < .001; 2? = 0.08) were significant. Table 33 presents the cell means of the RMSE at the levels of independent variables of the type of mixture, sample size, and test length. The main effect of the type of mixture differed from each other as following: MOA (0.945) = MOE (0.933) > MOM (0.918) > MOEMA (0.898) > MOEM (0.882). Regarding the main effect of the sample size, the increase in the Correlation-threshold- ORS as sample size increased was significant between all three levels: M1200 (0.838) < M3000 (0.919) > M6000 (0.954). Regarding the main effect of the test length, the increase in the Correlation-threshold-ORS was significant as the test length increased 128 from I = 4 to I = 10 but was not significant as the test length increased from I = 10 to I = 20: M4 (0.897) < M20 (0.923) = M10 (0.954). Table 33.Cell Means of the RMSE of Threshold Estimates for the ORS Class Sample Item Mixture OE OM OA OEM OEMA 1200 4 0.794 na na na na 10 0.900 0.836 0.857 0.760 na 20 0.897 0.839 0.853 0.786 0.736 total 0.877 0.838 0.855 0.773 0.736 3000 4 0.928 0.832 0.947 0.691 na 10 0.947 0.943 0.949 0.917 0.848 20 0.955 0.942 0.945 0.923 0.906 total 0.943 0.920 0.947 0.874 0.886 6000 4 0.965 0.889 0.974 0.840 na 10 0.974 0.970 0.974 0.955 0.946 20 0.971 0.970 0.972 0.960 0.948 total 0.970 0.943 0.973 0.934 0.947 The significant interaction effect between the type of mixture and test length showed the same pattern as the interaction effect found in the RMSE-threshold-ORS evaluation. The interaction was basically due to the superior recovery for the ORS thresholds for even as the case in which only four items were used. ERS class. The factorial ANOVA results of the Correlation-threshold-ERS showed that the main effect of the mixing proportions (F(1,12) = 1210.70, p < .001; 2? = 0.29), sample size (F(2,12) = 378.92, p < .001; 2? = 0.18), test length (F(2,12) = 286.76, p < .001; 2? = 0.14) were significant. Table 34 presents the cell means of the correlation at the levels of independent variables of the mixing proportions, sample size, and test length. 129 Table 34. Cell Means of the Correlation for the ERS Class Sample Item Mixing Proportions Equal Unequal 4 0.694 na 1200 10 0.825 0.546 20 0.804 0.587 total 0.793 0.567 3000 4 0.796 0.469 10 0.903 0.702 20 0.905 0.742 total 0.877 0.683 6000 4 0.882 0.603 10 0.948 0.812 20 0.949 0.831 total 0.932 0.790 The main effect of the mixing proportions showed a higher correlation of threshold parameters when the mixing proportions were equal: MUnequal (0.717) < MEqual (0.874). Regarding the main effect of the sample size, the increase in the Correlation-threshold-ERS as sample size increased was significant between all three levels: M1200 (0.736) < M3000 (0.794) < M6000 (0.866). Regarding the main effect of the test length, the increase in the Correlation-threshold-ERS was significant as the test length increased from I = 4 to I = 10 but was not significant as the test length increased from I = 10 to I = 20: M4 (0.731) < M10 (0.828) = M20 (0.830). MRS class. The factorial ANOVA on the Correlation-threshold-MRS showed that the main effect of the mixing proportions (F(1,10) = 140.08, p < .001; 2? = 0.24), sample size (F(2,10) = 79.03, p < .001; 2? = 0.27), test length (F(2,10) = 31.05, p < .001; 2? = 0.11) were significant. Table 35 presents the cell means of the correlation at the levels of independent variables of the mixing proportions, sample size, and test length. 130 Table 35.Cell Means of the RMSE of Threshold Estimates for the MRS Class Sample Item Mixing Proportions Equal Unequal 4 na na 1200 10 0.830 na 20 0.834 na total 0.832 na 3000 4 0.769 na 10 0.913 0.643 20 0.924 0.760 total 0.881 0.713 6000 4 0.837 0.780 10 0.954 0.811 20 0.957 0.845 total 0.926 0.821 The main effect of the mixing proportions showed a higher correlation when the mixing proportions were equal: MUnequal (0.886) < MEqual (0.776). Regarding the main effect of the sample size, the increase in the Corr-thr-MRS between N = 3000 and N = 6000 was significant but not significant between N = 1200 and N = 3000: M1200 (0.832) = M3000 (0.817) < M6000 (0.877). Regarding the main effect of the test length, the increase in the Correlation-threshold-MRS was significant as the test length increased from I = 4 to I = 10 but was not significant as the test length increased from I = 10 to I = 20: M4 (0.798) < M10 (0.845) = M20 (0.864). ARS class. The factorial ANOVA results of the Correlation-threshold-ARS showed that the main effect of the mixing proportions (F(1,5) = 125.59, p < .001; 2? = 0.67) and sample size (F(2,5) = 35.56, p < .001; 2? = 0.33) were significant. Table 36 131 presents the cell means of the correlation at the levels of independent variables of the mixing proportions and sample size. Table 36. Cell Means of the RMSE of Threshold Estimates for the ARS Class Sample Proportions Equal Unequal 1200 0.979 na 3000 0.991 0.963 6000 0.995 0.982 total 0.990 0.973 Regarding the main effect of the sample size, the increase in the Correlation- threshold-MRS between N = 3000 and N = 6000 was significant but not significant between N = 1200 and N = 3000: M1200 (0.979) = M3000 (0.979) < M6000 (0.989). 4.4.3. Evaluation of the standard error The third criterion used to evaluate the threshold parameter recovery was the standard error of estimates (SE), which was the calculated standard deviation of the estimated thresholds provided from all replications. Table 37 reports the SE that was averaged across the four thresholds. The factorial ANOVA results for the SE measures are present in a single table concisely in Table 38 instead of providing four analysis results in separate ANOVA tables. 132 Table 37. SE of Threshold Parameter Estimates Type of mixture ORS ERS ORS MRS ORS ARS ORS ERS MRS ORS ERS MRS ARS Class ORS ERS ORS MRS ORS ARS ORS ERS MRS ORS ERS MRS ARS Mixing Sample Item Proportions 1200 4 .118 .133 10 .053 .071 .056 .104 .047 .080 .078 .090 .135 20 .036 .046 .037 .060 .034 .062 .047 .057 .078 .054 .067 .093 .088 4 .069 .091 .060 .274 .055 .102 .268 .152 .327 3000 10 .032 .044 .035 .063 .030 .053 .047 .051 .077 .057 .065 .097 .089 20 .027 .029 .025 .036 .021 .040 .029 .035 .047 .033 .039 .054 .055 4 .052 .064 .042 .199 .038 .073 .179 .114 .246 6000 10 .024 .031 .024 .043 .021 .038 .035 .040 .057 .041 .046 .064 .064 20 .015 .021 .016 .027 .017 .027 .021 .024 .034 .023 .026 .044 .045 1200 4 10 .035 .180 20 .024 .113 4 .064 .334 .043 .354 3000 10 .024 .106 .029 .240 .024 .150 .047 .051 .077 20 .016 .073 .016 .100 .016 .099 .029 .035 .047 .019 .067 .106 .095 4 .033 .195 .045 .211 .029 .221 6000 10 .016 .079 .017 .138 .017 .097 .035 .040 .057 .021 .076 .137 .105 20 .011 .047 .011 .067 .011 .067 .021 .023 .033 .012 .049 .070 .068 133 Table 38. Effect size ( 2? ) for the SE Conditional on Statistical Significance (p < 0.05) Factor ORS ERS MRS ARS mixture 0.22 proportion 0.14 0.32 item 0.23 0.31 0.34 0.36 Sample size 0.09 0.09 0.08 Mixture * item 0.28 Proportion * item 0.10 0.12 ORS class. The factorial ANOVA results of the SE-threshold-ORS showed that the main effects of the type of mixture (F(4,27) = 80.03, p < .001; 2? = 0.22) and test length (F(2,27) = 173.73, p < .001; 2? = 0.23), as well as the interaction effect between type of mixture and test length (F(7,27) = 59.30, p < .001; 2? = 0.28) were significant. Table 39 presents the cell means of the SE at the levels of independent variables of the type of mixture and test length. Table 39. Cell Means of the SE of Threshold Estimates for the ORS Class Item Mixture OE OM OA OEM OEMA 4 0.067 0.049 0.041 0.224 na 10 0.031 0.032 0.028 0.053 0.040 20 0.022 0.021 0.020 0.032 0.028 total 0.038 0.032 0.029 0.088 0.033 The main effect of the type of mixture differed from each other as following: MOA (0.029) = MOM (0.032) = MOEMA (0.033) = MOE (0.038) < MOEM (0.038). Regarding the main effect of the test length, the increase in the Correlation-threshold- ORS was significant at the three levels: M4 = 0.078 > M10 = 0.035 > M20 = 0.024 in 134 MRS class (M1200 = 0.067 < M3000 = 0.116 < M6000 = 0.232), and in ARS class (M1200 = 0.080 < M3000 = 0.119 < M6000 = 0.276). The significant interaction effect between the type of mixture and test length was mainly due to the poor stability for the I = 4 short test to estimate ORS thresholds in the mixture of more than two latent class parameters. The interaction plot is present in Figure 18. Pairwise comparison showed that the difference in the SE between any of the two mixtures was not significant for the I = 20 conditions. Figure 18. Interaction effect between type of mixture and test length on the SE of threshold estimates for the ORS class ERS class. The factorial ANOVA results of the SE-threshold-ERS showed that the main effect of the mixing proportions (F(1,12) = 79.18, p < .001; 2? = 0.14), test length (F(2,12) = 88.07, p < .001; 2? = 0.31), and sample size (F(1,12) = 25.89, p < .001; 135 2? = 0.09) as well as the interaction effect between mixing proportion and test length (F(2,12) = 29.18, p < .001; 2? = 0.10). Table 40 presents the cell means of the SE at the levels of independent variables of the mixing proportion, test length, and sample size. Table 40. Cell Means of the RMSE of Threshold Estimates for the ERS Class Sample Item Mixing Proportions Equal Unequal 4 0.133 na 1200 10 0.081 0.180 20 0.057 0.113 total 0.077 0.147 3000 4 0.122 0.334 10 0.053 0.079 20 0.034 0.058 total 0.063 0.111 6000 4 0.089 0.195 10 0.039 0.065 20 0.024 0.040 total 0.046 0.073 The main effect of the mixing proportions showed a larger standard error when the mixing proportions were unequal: MUnequal (0.098) > MEqual (0.061). Regarding the main effect of the sample size, the decrease in the SE was significant between N = 3000 and N = 6000 and was not significant between N = 3000 and N = 1200: M1200 (0.095) = M3000 (0.084) > M6000 (0.058). Regarding the main effect of the test length, the decrease was significant at all three levels: M4 (0.155) > M10 (0.069) > M20 (0.047). The significant interaction effect between the was also because of the disproportionate increase in the SE for interaction plot is presented in Figure 19 Figure 19. Interaction effect between type of mixing proportion and test length on the SE of threshold estimates for the ERS class MRS class. The factorial that the main effect of the sample size ( length (F(2,10) = 85.26, p < .001; SE at the levels of independent variables of the mixing proportion, test length, and sample size. 136 mixing proportion and test length I = 4 condition. The . ANOVA results of the SE-threshold-MRS showed F(2,10) = 22.35, p < .001; 2? = 0.09 2? = 0.34). Table 41 presents the cell means of the ) and test 137 Table 41. Cell Means of the SE of Threshold Estimates for the MRS Class Sample Item 4 na 1200 10 0.120 20 0.077 total 0.094 3000 4 0.301 10 0.111 20 0.065 total 0.119 6000 4 0.219 10 0.083 20 0.046 total 0.095 Regarding the main effect of the test length, the decrease was significant at all three levels: M4 (0.251) > M10 (0.099) > M20 (0.060). Based on the Tukey HSD (with ?FW = .05) test any of the difference in the SE between the three levels of sample size was significant: M1200 (0.094) = M3000 (0.095) > M6000 (0.119). ARS class. The factorial ANOVA results of the SE-threshold-ARS showed that the same effects on the ERS class were also significant for the ARS class. The significant factors were the main effect of the mixing proportions (F(1,5) = 128.61, p < .001; 2? = 0.32), test length (F(2,5) = 73.97, p < .001; 2? = 0.36), and sample size (F(2,5) = 17.18, p < .001; 2? = 0.08) as well as the interaction effect between mixing proportion and test length (F(2,5) =25.00, p < .001; 2? = 0.12). Table 42 presents the 138 cell means of the SE at the levels of independent variables of the mixing proportion, test length, and sample size. Table 42. Cell Means of the SE of Threshold Estimates for the ARS Class Sample Item Mixing Proportions Equal Unequal 4 Na 0.354 1200 10 0.080 0.150 20 0.075 0.097 total 0.077 0.175 3000 4 0.102 0.221 10 0.071 0.101 20 0.048 0.068 total 0.068 0.112 6000 4 0.073 0.288 10 0.051 0.117 20 0.036 0.082 total 0.049 0.140 4.5 Person Trait Parameter Recovery The mdltm that uses the marginal MLE method provides as many class-specific person trait estimates (? ) as the number of classes specified in the model for each respondent. The assigned estimate is the one that is associated with the class for which his or her posterior probability of class membership is the highest. If a respondent is incorrectly classified, he or she is given an improper estimate that is estimated within those who may be qualitatively different from himself or herself. The current study analyzed the accuracy of ? recovery for the following groups of respondents: i) the whole group of respondents based on their assigned class membership (i.e., all misclassified respondents were included), and ii) a group of correctly classified respondents. In real world data analysis, respondent?s true latent ? ? 139 class membership is unknown information, and, hence how inaccurately his or her ? is assessed due to incorrect classification is never known. These separate analyses of ? recovery provided not only the results of the accuracy of ? recovery but also the quantification of the impact of misclassification on ? recovery. Recovery of person trait parameters was evaluated with respect to Bias, RMSE, and Pearson correlation. 4.5.1. Evaluation of the bias The factorial ANOVA results on bias of person trait estimates showed that any of the main effects of the four manipulated factors and their two-way interaction effects were neither statistically nor practically significant. Table 43 reports the marginal bias for all respondents (whole group) and for the correctly classified respondents (selected group) in each simulation condition. As can be seen in Table 43, the bias was very small and fluctuating around zero across all simulation conditions. 140 Table 43. Theta Recovery for All Respondents and Correctly Classified Respondents Type of mixture and Mixing proportions ORS 0.5 ERS 0.5 ORS 0.9 ERS 0.1 Whole Group Selected Group Whole Group Selected Group ORS ERS ORS ERS ORS ERS ORS ERS Item Sample Bias -.001 .004 .005 .001 1200 RMSE .540 .559 .497 .487 Corr .776 .876 .805 .882 Bias -.001 .003 -.001 .002 -.002 .005 -.001 .009 4 3000 RMSE .507 .560 .492 .487 .503 .658 .495 .530 Corr .785 .878 .807 .884 .839 .899 .846 .887 Bias .002 .003 .003 .002 .013 .009 -.002 -.006 6000 RMSE .503 .559 .490 .486 .516 .573 .496 .521 Corr .786 .879 .807 .885 .859 .861 .849 .888 Bias -.001 -.001 -.001 .000 .000 -.008 .000 -.009 1200 RMSE .368 .450 .343 .368 .358 .607 .347 .382 Corr .911 .916 .924 .932 .926 .908 .932 .933 Bias .003 .001 .002 .000 .001 .008 .001 .005 10 3000 RMSE .366 .449 .340 .367 .355 .609 .345 .372 Corr .912 .917 .925 .933 .926 .909 .932 .933 Bias .000 .000 .001 .000 .000 .000 .000 .003 6000 RMSE .365 .450 .341 .365 .356 .607 .346 .369 Corr .912 .917 .924 .933 .926 .910 .931 .934 Bias -.002 .001 -.002 .000 .001 -.005 .001 -.001 1200 RMSE .270 .370 .252 .293 .262 .551 .255 .288 Corr .957 .941 .962 .957 .962 .917 .964 .958 Bias .001 .001 .001 .000 .001 .000 .001 .002 20 3000 RMSE .270 .368 .253 .289 .262 .555 .255 .290 Corr .956 .941 .962 .958 .962 .917 .964 .958 Bias .000 .000 .000 .000 .001 .001 .001 .001 6000 RMSE .269 .367 .252 .290 .262 .554 .256 .289 Corr .957 .942 .962 .958 .962 .919 .964 .958 141 Table 43_Continued Type of mixture and Mixing proportions ORS 0.5 MRS 0.5 ORS 0.9 MRS 0.1 Whole Group Selected Group Whole Group Selected Group Assigned class ORS MRS ORS MRS ORS MRS ORS MRS Item Sample Bias 1200 RMSE Corr Bias .004 .008 .003 .006 4 3000 RMSE .576 .733 .510 .757 Corr .857 .087 .871 .080 Bias .000 -.010 .000 -.010 0.001 0.002 0.001 0.000 6000 RMSE .575 .731 .508 .755 0.576 0.732 0.566 0.754 Corr .858 .032 .871 .032 0.858 0.025 0.859 0.028 Bias .016 -.007 .000 .003 1200 RMSE .511 .733 .358 .472 Corr .861 .716 .938 .840 Bias -.001 .003 .001 .000 -0.001 -0.005 -0.001 0.000 10 3000 RMSE .418 .484 .358 .468 0.370 0.532 0.354 0.501 Corr .931 .827 .938 .844 0.933 0.779 0.936 0.768 Bias .000 -.001 .000 .002 0.000 0.014 0.000 0.014 6000 RMSE .418 .477 .356 .469 0.373 0.487 0.356 0.482 Corr .932 .830 .939 .844 0.933 0.746 0.936 0.776 Bias -.001 .002 .001 -.003 1200 RMSE .416 .478 .268 .354 Corr .933 .830 .965 .921 Bias -.001 -.004 .001 -.001 -0.001 0.000 -0.001 -0.002 20 3000 RMSE .319 .368 .268 .354 0.281 0.390 0.268 0.360 Corr .959 .913 .965 .922 0.962 0.891 0.965 0.907 Bias .000 -.001 .000 .001 0.000 0.002 0.000 0.001 6000 RMSE .317 .367 .265 .353 0.281 0.383 0.268 0.358 Corr .959 .914 .965 .922 0.962 0.888 0.964 0.905 142 Table 43_Continued Type of mixture and Mixing proportions ORS 0.5 ARS 0.5 ORS 0.9 ARS 0.1 Whole Group Selected Group Whole Group Selected Group Assigned class ORS ARS ORS ARS ORS ARS ORS ARS Item Sample Bias 1200 RMSE Corr Bias -0.005 -0.001 -0.003 -0.001 0.002 0.007 0.002 0.000 4 3000 RMSE 0.508 0.585 0.498 0.585 0.503 0.632 0.505 0.654 Corr 0.876 0.788 0.859 0.771 0.863 0.686 0.865 0.746 Bias -0.003 0.008 0.000 -0.001 0.003 -0.001 0.002 0.000 6000 RMSE 0.505 0.590 0.501 0.586 0.504 0.612 0.506 0.615 Corr 0.864 0.812 0.860 0.765 0.863 0.706 0.866 0.761 Bias 1200 RMSE Corr Bias -0.002 0.025 -0.004 0.026 -0.001 0.037 -0.001 0.036 10 3000 RMSE 0.357 0.428 0.353 0.425 0.354 0.428 0.356 0.431 Corr 0.936 0.905 0.934 0.902 0.935 0.889 0.936 0.896 Bias -0.001 0.024 -0.003 0.024 0.000 0.027 0.000 0.027 6000 RMSE 0.356 0.429 0.352 0.426 0.356 0.423 0.358 0.426 Corr 0.936 0.905 0.934 0.902 0.935 0.889 0.936 0.892 Bias 0.001 0.027 -0.002 0.026 1200 RMSE 0.264 0.879 0.266 0.328 Corr 0.964 0.978 0.964 0.946 Bias 0.001 0.028 0.000 0.028 0.001 0.027 0.001 0.027 20 3000 RMSE 0.268 0.330 0.266 0.327 0.267 0.325 0.268 0.332 Corr 0.964 0.947 0.964 0.947 0.964 0.944 0.964 0.946 Bias 0.002 0.026 0.001 0.026 0.000 0.025 0.000 0.024 6000 RMSE 0.267 0.329 0.264 0.326 0.267 0.321 0.268 0.326 Corr 0.965 0.947 0.965 0.947 0.964 0.945 0.964 0.946 143 Table 43_Continued Type of mixture and mixing proportions ORS 0.33ERS 0.33MRS 0.33 ORS 0.8 ERS 0.1MRS 0.1 Whole Group Selected Group Whole Group Selected Group Assigned class ORS ERS MRS ORS ERS MRS ORS ERS MRS ORS ERS MRS Item Sample Bias 1200 RMSE Corr Bias -.016 .002 -.004 -.002 .001 -.012 4 3000 RMSE .701 .591 .681 .520 .483 .664 Corr .755 .871 .610 .823 .883 .624 Bias .010 .001 .040 -.004 -.002 .014 6000 RMSE .633 .589 .664 .505 .487 .657 Corr .777 .873 .584 .830 .883 .615 Bias .000 -.002 -.006 .003 -.001 -.013 1200 RMSE ..429 .482 .496 .345 .366 .475 Corr .900 .908 .824 .927 .932 .842 Bias -.001 .001 .001 .000 .000 .001 .000 .009 .007 .000 .003 .012 10 3000 RMSE .420 .483 .485 .343 .365 .471 .680 .824 .776 .666 .692 .705 Corr .905 .909 .830 .928 .933 .845 .733 .772 .520 .738 .763 .571 Bias .000 -.003 .001 .001 -.001 .001 .000 .001 .006 .000 .003 .005 6000 RMSE .420 .480 .480 .343 .365 .469 .375 .603 .491 .346 .365 .481 Corr .906 .910 .830 .928 .933 .845 .922 .909 .757 .932 .935 .785 Bias .002 -.001 .001 .002 -.001 .002 1200 RMSE .314 .397 .370 .254 .289 .355 Corr .949 .935 .913 .962 .958 .921 Bias .000 .000 .001 .001 -.001 .000 .000 -.006 .011 .000 .002 .002 20 3000 RMSE .310 .396 .369 .251 .290 .353 .276 .552 .398 .255 .291 .356 Corr .950 .935 .914 .963 .958 .922 .959 .916 .885 .964 .958 .910 Bias .000 -.002 .003 .000 -.001 -.002 -.008 -.004 .012 .000 -.001 .000 6000 RMSE .312 .394 .370 .252 .290 .352 .311 .334 .436 .256 .287 .287 Corr .950 .936 .913 .963 .958 .923 .951 .945 .907 .964 .959 .959 144 Table 43_Continued Type of mixture and mixing proportions ORS 0.25ERS 0.25MRS 0.25 ARS 0.25 ORS 0.7 ERS 0.1MRS 0.1 ARS 0.1 Whole Group Selected Group Whole Group Selected Group Assigned class ORS ERS MRS ARS ORS ERS MRS ARS ORS ERS MRS ARS ORS ERS MRS ARS Item Sample Bias 1200 RMSE Corr Bias 4 3000 RMSE Corr Bias 6000 RMSE Corr Bias 1200 RMSE Corr Bias .001 .000 .002 -.002 -.011 -.014 -.002 .006 10 3000 RMSE .334 .370 .465 .418 .408 .541 .481 .440 Corr .908 .901 .801 .865 .915 .903 .815 .875 Bias -.011 -.014 -.002 .006 -.012 -.007 .006 .001 .001 -.001 .000 .001 .001 -.001 .000 .001 6000 RMSE .338 .371 .469 .420 .408 .541 .481 .440 .378 .632 .508 .442 .347 .372 .479 .424 Corr .915 .903 .815 .875 .915 .903 .815 .875 .921 .905 .760 .868 .931 .933 .791 .879 Bias -.001 -.001 .001 -.003 -.001 -.001 .001 .003 1200 RMSE .254 .291 .355 .317 .312 .419 .399 .324 Corr .950 .932 .905 .939 .950 .932 .905 .938 Bias .001 .000 .002 -.002 .000 .001 .002 -.003 -.001 .001 .005 -.005 -.001 .001 .005 -.005 20 3000 RMSE .252 .288 .358 .313 .313 .419 .399 .324 .278 .553 .405 .328 .255 .288 .379 .318 Corr .950 .932 .905 .938 .950 .932 .905 .938 .959 .917 .886 .935 .964 .959 .902 .940 Bias .000 .002 -.002 .029 .000 .001 .002 -.003 .001 .002 .000 .001 .001 .002 .000 .001 6000 RMSE .251 .289 .352 .313 .313 .418 .396 .320 .278 .566 .381 .323 .255 .288 .357 .315 Corr .950 .932 .905 .938 .950 .932 .905 .938 .959 .916 .893 .937 .964 .958 .908 .941 145 4.5.2. Evaluation of the RMSE The factorial ANOVA was conducted on the RMSE measures for both whole group and selected group. Since the same factors were found to be significant in these two analyses, the factorial ANOVA results for the whole group were reported in this section. The results showed that the test length was the common significant factor across all four response-style classes and also was the only significant factor for the ORS, MRS, and ARS classes. The type of mixture was another significant factor for the ERS class. These ANOVA results are presented in a single table concisely in Table 44. Table 44. Effect size ( 2? ) for the RMSE of Theta Estimates Conditional on Statistical Significance (p < 0.05) Factor ORS ERS MRS ARS mixture 0.11 item 0.39 0.10 0.39 0.53 The test length was the significant factor on the RMSE for person trait parameters in the ORS class (F(2,24) = 99.09, p < .001; 2? = 0.39), ERS class (F(2,10) = 8.66, p < .001; = 0.10), MRS class (F(2,9) = 33.05, p < .001; 2? = 0.39), and ARS class (F(2,4) = 3234.06, p < .001; 2? = 0.53). In addition to the main effect of the test length, the type of mixture was significant for the ERS class (F(2,10) = 10.24, p < .001; = 0.11). 2? 2? 146 Based on the Tukey HSD (FW?=0.5) test, the RMSE difference between I = 4 and I = 10 as well as between I = 10 and I = 20 were significant in the ORS class: M4 (0.544) > M10 (0.395) > M20 (0.282), ERS class: M4 (0.584) > M10 (0.515) > M20 (0.393), MRS class: M4 (0.708) > M10 (0.532) > M20 (0.384), and ARS class: M4 (0.601) > M10 (0.425) > M20 ( 0.326). For the ERS class, the main effect of the type of mixture differed from each other as following: MOEMA (0.312) < MOEM (0.510) = MOM (0.510). 4.5.3. Evaluation of the Correlation As was found in the factorial ANOVA on the RMSE in Section 4.5.2, the test length was the significant factor on the correlation between generated and estimated person trait parameters in the ORS class (F(2,25) = 52.36, p < .001; 2? = 0.35), ERS class (F(2,10) = 5.28, p < .001; = 0.20), MRS class (F(2,9) = 166.97, p < .001; 2? = 0.61), and ARS class (F(2,4) = 3859.53, p < .001; 2? = 0.72). Table 45. Effect size ( 2? ) for the Correlation of Theta Estimates Conditional on Statistical Significance (p < 0.05) Factor ORS ERS MRS ARS Item 0.35 0.20 0.61 0.72 Based on the Tukey HSD (FW?=0.5) test, the correlation difference between I = 4 and I = 10 as well as between I = 10 and I = 20 were significant in the ORS class: M4 (0.832) < M10 ( 0.911) < M20 (0.958), ERS class: M4 (0.877) = M10 ( 0.901) < M20 2? 147 (0.935), MRS class : M4 (0.328) < M10 ( 0.772) < M20 (0.899), and ARS class : M4 (0.757) < M10 ( 0.890) < M20 (0.946). 4.5.4. Impact of misclassification on person trait estimation To test the impact of the misclassification on person trait parameter recovery, the discrepancies in the RMSE and correlation measures between the whole and selected group were tested. A paired t-test was conducted for each latent class on the marginal discrepancies over all manipulated factors. Table 46 and Table 47 present the descriptive statistics of the RMSEs and the correlations for the whole and selected groups, respectively. The results of the paired t-test are presented in Table 48. The effect size was evaluated using Cohen?s d (d=mean difference / standard deviation of mean difference),which indicates a small effect size if d > 0.2 , a medium effect size if d > 0.5, or a large effect size if d > 0.8. In table 48, Cohen?s d is presented when the mean difference is statistically significant at p < .05 Table 46. Cell Means of the RMSE of theta estimates Type of mixture Group N Mean SD ORS Whole 60 0.386 0.119 Selected 60 0.361 0.105 ERS Whole 34 0.478 0.126 Selected 34 0.406 0.113 MRS Whole 31 0.495 0.136 Selected 31 0.472 0.135 ARS Whole 20 0.425 0.114 Selected 20 0.428 0.116 148 Table 47. Cell Means of the Correlation of Theta Estimates Type of mixture Group N Mean SD ORS Whole 61 0.911 0.060 Selected 61 0.919 0.054 ERS Whole 34 0.911 0.035 Selected 34 0.923 0.039 MRS Whole 30 0.772 0.219 Selected 30 0.790 0.222 ARS Whole 20 0.882 0.083 Selected 20 0.882 0.074 Table 48. Paired t-test Results on the Impact of Misclassification on Theta Recovery Type of mixture Evaluation measures t df p Cohen?s d ORS RMSE 4.204 59 .000 0.54 Correlation -4.079 60 .000 0.49 ERS RMSE 2.838 33 .008 0.38 Correlation -3.522 33 .001 0.17 MRS RMSE 2.107 30 .044 0.52 Correlation -3.108 29 .004 0.60 ARS RMSE -0.755 19 .460 Correlation 0.131 19 .897 As can be seen in Table 48, the impact of misclassification was statistically significant for all response-style classes except the ARS class. The reason that the theta recovery was not impacted for the ARS class is because the classification accuracy was high. The effect size was generally medium level except for the correlation for the ERS class. 149 4.6. Model-based Correction of Score Bias Figure 20 depicts the relation between sum score and estimated ? for each class of the 3-response-style mixture. The data for this figure was obtained from the equal proportions, 10-items with a sample size of N = 6000 condition. This figure showed how the ? estimates of the MPCM would provide a tool to correct the sum score bias due to response styles. For example, if a respondent?s ? level is above the mean (i.e., ? > 0) and belongs to ERS class his or her estimated ? is lower than when he or she belongs to the ORS class. Since the sum score is likely to be inflated by his or her endorsement of a higher extreme category, his or her ? should be adjusted downward to correct the inflated sum score. Conversely, if a respondent?s ? level is below the mean (i.e., ? < 0) and belongs to ERS class, the estimated ? is higher than when he or she belongs to the ORS class. Because he or she would have selected a lower extreme response category more often, his or her estimated ? should be adjusted upward to compensate the deflated sum score. If a respondent with a ? level that is higher than the mean belongs to MRS class, his or her estimated ? is higher than when he or she belongs to the ORS class. His or her tendency to select middle categories would have deflated sum score, and, therefore, the correction is made to compensate his or her score lost due to the response tendency. Conversely, if a respondent?s ? level is below the mean and belongs to MRS class, the estimated ? is lower than when he or she belongs to the ORS class. Because he would have selected the middle-category despite his or her lower ? level, his or her estimated inflated sum score. Figure 20. Theta estimates Figure 21 represents th the 4-response-style mixture. appears that the correction for the ARS class is very much alike the correction for the MRS class. This is understandable because the ARS responses were generated assuming a balanced scale that int choice of response categories. Although ARS respondents endorse higher extreme categories only, the use of a balanced scale causes the the mean score. 150 ? should be adjusted downward to correct the as a function of sum score for ORS, ERS, and MRS Class e relation between sum scores and estimated In this figure, a function for the ARS class was added. It ended to cancel out ARS respondent?s directional sum scores to regress toward ? under Figure 21. Theta estimates Class The plots in which the relation between ORS in the 2-response-style mixtures because they the relations appeared the same as those were depicted in Figures 20 and 21. 151 as a function of sum score for ORS, ERS, MRS, and ARS -ERS, ORS-MRS, and ORS -ARS 152 Chapter 5: Discussion The primary goal of the current study was to investigate the performance of the mixture distribution polytomous Rasch model in accurately recovering model parameters under the heterogeneous population conditions in which people differed in their response styles, or individual tendencies in responding to the formal aspects of rating scales. The current study examined the mixture polytomous Rasch model with two, three, and up to four latent classes within each of which a different response style was manifested. One of the latent classes was simulated to represent ordinary response style (ORS), which did not manifest a distorted use of response categories of a rating scale. The rest of the latent classes were characterized by one of the following distorted response styles, i.e., extreme response style (ERS), middle category response style (MRS), and acquiescent response style (ARS). Response styles have been recognized as a source of systematic measurement bias. Ignoring or failing to adequately account for the impact of the response styles in latent trait measurement leads to various psychometric problems such as invalidating test score differences at both individual and group levels, inflating test reliability, obscuring structural relations among psychological constructs of interest, and confounding the interpretation of the findings in comparative studies. As a model-based approach to control for these adverse effects, mixture polytomous Rasch models, particularly the mixture partial credit model (MPCM) has been increasingly applied in empirical research where ordered polytomous item responses were analyzed. The MCPM was suggested as a method for classifying 153 people according to their response styles as the model was proposed by Rost (Rost, 1991). Cumulative results from previous studies have evidenced that respondents who share the ERS constitute a latent class while the other class(es) is often composed of respondents with a non-extreme response style. Once different response styles are detected within different latent classes, the subsequent analysis of a psychological construct of interest can be conducted under the control of response styles. It is promising that the application of the MCPM has potential for a better estimation of person trait as well as a better prediction of relevant criteria. In addition, the MPCM is a flexible modeling framework in that the nature of latent classes does not need to be known a priori. ?What are the types of response styles manifested in this data set?? and ?which response style do most people present in this group?? are explored and answered during the course of the MPCM analysis. Although previous empirical studies have detected relatively simple structures of the mixture of response styles, i.e., mostly a combination of ERS and another style characterized as a rather moderate response style (perhaps MRS or ORS), this flexibility of the MPCM extends the potential for identifying more diverse response styles that might exist in a data set. There is a need for a simulation study to evaluate the performance of the MPCM including accurate recovery of the model parameters, thereby assessing the soundness of the application of the MPCM to account for various types of response style effects that may be presented in real world testing situation. Little information is known thus far, however, regarding the accuracy of model parameter recovery in the 154 MPCM and testing conditions that can exert an influence on the model parameter recovery. The current simulation study, therefore, focused on the evaluation of: i) the accuracy of recovering class membership, threshold parameters, and person trait parameters in various testing conditions, and ii) the model-based correction of score bias due to response styles. Of particular importance, the current study included more complex and realistic, mixture structure where multiple classes of ORS, ERS, MRS, and ARS coexist. The following sections include a summary of the findings, discussion of the important issues surrounding interpretations of the MPCM results, recommendation for applied researchers, as well as limitations of the current study and implications for future research. 5.1 Summary of Findings Non-convergence and boundary threshold estimates. Estimation problems were examined as a preliminary analysis of the simulation results. First, the rate of non-convergence, which may very well be indicative of problems in model identifiability and instability of parameter estimates, was obtained. This non- convergence rate was 0 % for 80 out of 90 simulation conditions when the data sets were correctly estimated with the data generation model. The other 10 simulation conditions showed the non-convergence rate ranging between 1 % and 9 %. The ORS- ERS mixture conditions never encountered non-convergence while the highest rate of non-convergence occurred under the ORS-ARS mixture conditions. 155 When the generated data set was under parameterized, non-convergence problems occurred only for two conditions of the 4-response-style mixtures. Conversely, when the generated data was over parameterized, thirty-five out of ninety simulation conditions showed non-convergence ranging between 1 % and 20 %. A high percentage of these problems occurred with the ORS-MRS mixtures. Boundary threshold estimates were also monitored and screened. Extreme thresholds exceeding 9.0 or -9.0 in the provided mdltm outputs were filtered out. Boundary estimates never occurred when the 2-response-style data sets were under parameterized. When the generated data was correctly parameterized, the occurrence of boundary estimates was closely related to the sample size, more specifically the expected category frequencies. A high percentage of boundary estimate problems ranging between 48 % and 96 % occurred mostly for the response categories in the MRS and ARS class for which the expected response frequency was essentially zero. Nearly all of the simulation conditions presented boundary threshold estimates when the data sets were over parameterized. As a result of checking non-convergence and boundary threshold estimates problems, ten simulation conditions were removed from the design. These excluded conditions were associated with a sample size of N = 1200 and the unequal mixing proportions condition (except one condition with four response styles and equal mixing proportions). These results would seem to indicate that an appearance of implausible threshold values in an empirical data analytic study may be an indication of over parameterization (i.e., estimating a model with too many latent classes) or an 156 insufficient sample size to estimate parameters for a given model, or a combination of the two conditions. Label Switching. The current study tackled the label switching problems by jointly applying two different algorithmic approaches, each of which utilizes different source of information. The first algorithm developed by the author used the characteristic features of the order of four thresholds in each response style class. The second approach developed by Tueller et al. (2011) used the results of respondent classification results. By incorporating these two algorithms, the efficiency of the automated process of detecting and correcting switched labels was enhanced. Thirteen simulation conditions turned out to have a large proportion of replications in which switched labels were unresolved. It was found that there was a great deal of overlap between the cases where switched labels were not corrected and the BIC and CAIC were unable to correctly identify the data generation model. A close investigation of this overlap allowed the researchers to better understand the hidden structures of the subpopulation distributions as well as the capabilities and limitations of the MPCM in modeling those population heterogeneities. Model selection. Among the three information criterion statistics, AIC, BIC, and CAIC, the BIC and CAIC performed nearly equally well in identifying the data generation model with a slightly higher accuracy for the BIC. Across all of the simulation conditions, the AIC showed high rates of over-identification of the latent classes. Based on the current simulation results, the AIC should not be recommended for use in model selection under the MPCM. 157 In general, the BIC was found to most accurately identify the correct number of latent classes in the MPCM. Under the simulation conditions in which neither estimation problems nor unresolved label switching problems occurred, the data generation model was identified 100% of the time based on the BIC. The simulation conditions in which the BIC did not perform perfectly were associated with at least one of the following conditions: i) the test length I = 4, ii) the sample size N = 1200, and iii) the mixing proportions were unequal. Classification accuracy. Generally, the ORS-ARS mixtures allowed for accurate classification under all simulation conditions while the ORS-MRS mixtures were the least accurate in providing correct classification of respondents followed by the ORS-ERS mixtures. Misclassification of ERS respondents within the MRS class (EM) and misclassification of MRS respondents within the ERS class (ME) rarely occurred. In addition to EM and ME, the chance of OA, MA, AO, and AM was also essentially zero. The most important factor influencing respondent classification accuracy was test length. The effect size of test length was extraordinary large ( 2? = 0.42). Under the least complex, 2-response-style mixtures, when test length was I = 4, ORS-ERS, ORS-MRS, and ORS-ARS mixtures allowed for an average classification accuracy rate of 81%, 70 %, and 93%, respectively. As the number of items increased to I = 10, the average classification accuracy increased to 94%, 87%, and 98%, respectively. While for the test length, I = 20, it reached 98%, 94%, and almost 100%, respectively. Under the 3-response-style mixtures, as test length increased from I = 4 to I = 10 and 158 then from I = 10 to I = 20, the corresponding average classification accuracy improved from 73% to 87%, and then to 95%, respectively. Under the most complex, 4- response-style mixtures, classification accuracy was 89% and 95% when I = 10 and 20, respectively. Significant interaction effects were mainly due to the outstanding classification accuracy for the ARS class even under the I = 4 condition. Threshold recovery. Generally, as the sample size increased from N = 1200 to N = 3000, then to N = 6000, threshold recovery tended to be more accurate. While the increase in the test length from I = 4 to I = 10 improved threshold recovery significantly, the increase from I = 10 to I = 20 did not result in a significant difference. Threshold recovery for the ARS class was quite accurately achieved under even I = 4 condition and, consequently, the test length was not found to be an influencing factor for this class. When the distorted response styles, i.e., ERS, MRS, and ARS presented with a small proportion in a sample of respondents, the threshold recovery was significantly less accurate for those small latent class. ORS thresholds were most accurately recovered under the ORS-ARS mixtures and least accurately recovered under the ORS-ERS-MRS mixtures. Therefore, it may not be necessarily true that thresholds of a more complex model are less accurately recovered. Standard error of threshold estimates dramatically increased for the models with 3 response- style classes when the test length of I = 4 was considered. Person trait recovery. The factor that most affected the accuracy of ? recovery was the test length. The accuracy of ? recovery in each response-style class increased as the test length increased. 159 Overall, the person trait ? was well recovered when the test length was I = 10 or I = 20. A sample size of N = 1200 provided relatively lower correlations between generated and estimated ? parameters. Across the three levels of test length, the mean RMSE ranged from 0.28 to 0.54 for the ORS class; 0.39 to 0.58 for the ERS class; 0.38 to 0.53 for the MRS class; and 0.33 to 0.43 for the ARS class. The mean correlations ranged from 0.83 to 0.96 for the ORS class, 0.88 to 0.94 for the ERS class, and 0.77 to 0.90 for the MRS class, and 0.76 to 0.95 for the ARS class. When the accuracy of ? recovery was computed for those who were correctly classified, there was always an increase in the level of accuracy compared to when the accuracy was computed for all respondents including misclassified cases. The discrepancies in the accuracy level between all respondent group and correctly classified respondent group were tested. The results of the paired t-test showed statistically significant impact of misclassification on the person trait estimation. Correction of score bias. In an empirical rating scale data, respondent?s sum scores may be biased if his or her particular response style operates while responding to the response categories. The most practical benefits of employing the MPCM is that sum scores that might have been biased due to the compounding effects of the response styles can be corrected through the class-specifically estimated ? . The current study showed that the MPCM provides ? estimates that were corrected for the sum score bias caused by the different response styles. In general, the inflated score bias that occurred for ERS respondents with a higher ? level and for MRS and ARS respondents with a lower ? level were adjusted downward whereas the 160 deflated score bias occurred for ERS respondents with a lower ? and for MRS and ARS respondents with a higher ? level were adjusted upward. 5.2 Discussion The current study showed that the model parameters of the MPCM were recovered well and that classification accuracy was reasonably relatively high. Of particular importance, rather complex mixture structure where up to four different response-style subpopulations were mixed appeared to be reasonably well modeled by the MPCM under the simulated testing conditions that were considered in this study. This observed model performance support the potential utility of this model in real world data analysis situation where there is a possibility that there exist hidden subpopulations that differ from each other with respect to response styles. Previous empirical studies have shown the utility of this mixture modeling approach in various researches in the fields of study including personality, organizational, and clinical psychology. The latent groupings identified in those studies could be attributed to social desirability, faking, structural differences, and different response styles. By examining the thresholds plots for each estimated latent class and analyzing the contents of the items for which latent classes specifically show differences, there seems to be the potential for new findings and insights in psychological constructs that can be revealed beyond the presence of response styles. Testing conditions and MPCM performance. The preliminary examinations of the estimation issues and label switching solutions, as well as the model selection analysis provided coherent information regarding the structure of the response-style 161 mixture distributions and testing conditions that allowed the MPCM to adequately deal with the response style problems. As more profound differences in response styles were manifested across latent classes, the easier for the MCPM to detect the differences. Thus, the structural differences in the thresholds between ORS and ARS class appeared to be more easily identified than those between ORS and ERS while the differences between ORS and MRS were the most difficult to be distinguished. As the structural differences were harder to detect, the higher rates of the occurrence of boundary estimates, unresolved label switching as well as the lower rates of the correct model selection based on the BIC were observed. When the nature of the response-style mixture distribution imposed a challenge on the parameter estimation, a larger sample size and/or a larger number of test items were required for reasonable parameter estimation. The current simulation study showed that when the test length was I = 10 and the sample size of N = 3000, the MPCM performed fairly well in recovering model parameters for the most complex 4-response-style mixtures with equal proportions. The MPCM performance shown under this nature of mixture distribution and those testing conditions are the following: i) the correct model selection rate based on BIC was 100%, ii) classification accuracies were 84%, 94%, 88%, and 95% , iii) the mean RMSE of the four thresholds were 0.15, 0.19, 0.25, and 0.25, iv) the mean correlation for the four thresholds were 0.85, 0.87, 0.89, and 0.99, v) the mean SE of the four thresholds were 0.06, 0.07, 0.10, and 0.09, vi) the biases of ? estimates were -0.01, - 0.01. 0.00. 0.00, vii) the RMSEs of ? were 0.41, 0.54, 0.48, and 0.44, and viii) the 162 correlations of ? were 0.92, 0.90, 0.82, and 0.88, for the ORS, ERS, MRS, and ARS class, respectively. Based on the findings in the current study, some recommendations are suggested for applied researchers. Regarding the common issues in measurement, ?how large should the sample size be?? and ?how many items should be asked??, 10 items with a 5-category Likert scale and the number of respondents of 3000 was found to warrant reasonably accurate parameter estimation and respondent classification when up to four different response styles among ORS, ERS, ARS, MRS were presented in a data set under equal proportions. If the data being analyzed includes less diverse types of response styles, the same level of parameter estimation and respondent classification could be achieved with less than 3000 respondents. If the relative sizes of different response-style group are unequal, more than 3000 respondents may be needed to achieve the same level of accuracy. Comparisons of person trait across latent classes. One of the arguments that had been raised in the mixture IRT domain was whether person trait ? estimates obtained from different classes could be legitimately compared based on their magnitudes. This argument revolves around the notion that the continuous variable measured within each class is qualitatively different in mixture IRT models. As was discussed by Rost et al. (1997), the comparisons could certainly be problematic if the profiles of the item locations (i.e., the means of the thresholds) were substantially different across latent classes. This would indicate that people in different classes present different cognitive structures or psychological constructs. In these cases, since 163 questionnaires could not claim to be measuring the same trait in different populations, trait estimates obtained through the use of questionnaires could not be used to compare differences among respondents across the latent response-style classes. When the item difficulties were very much the same across latent classes, however, what distinguished latent classes was the dispersion of item responses, not the difficulty of an item (Rost et al., 1997). When this condition held, the comparison of person trait across different classes could be justified because the class specific ? values only adjusts for the effects of the class-specific dispersion of responses. In practice, item location profiles should be checked across latent classes before attempting any interpretation of latent class differences. If the item location profiles from each class locate significantly different positions, the difference across latent classes may better be characterized with respect to certain latent traits rather than response styles. Correction of score bias and predictability. The current study showed that the MPCM provided the corrected ? estimates that clearly differentiated the effects of different response styles. Given that the model provided this alternative, ?purified? score for each response style, an important issue to address is whether using the ?purified? ? improves predictability of relevant criterion variable. This idea was addressed by Maij-de-Meij et al. (2008). Improved predictability is a question that awaits an answer from empirical research in various fields. The current simulation provided results that help in building a foundation upon which this practical utility of the MPCM can be further investigated among applied researchers. 164 5.3 Limitations of the current study and implications for future research The current study included extreme simulation conditions with an intention to explore possible limitations of the MPCM performance. The combination of test length of I = 4, sample size of N = 1200, and unequal mixing proportions that allows only 10% of the respondents to be members of the smaller classes were highly challenging conditions to achieve good parameter estimation in the context of mixture distribution polytomous IRT modeling. While setting up these extreme conditions helped in revealing some limitations in the application of the MPCM, it caused several cell means to be unavailable, limiting the interpretations of the factorial ANOVA results regarding the effects of the manipulated factors. The interpretations of the current results that involved the acquiescent responses should be limited to the testing situation where a well-constructed balanced scale was used. From a methodological perspective, the current simulation results were meaningful in that the aberrant response behavior could possibly be controlled through the use of a balanced scale and the MPCM. The results showed that the ARS respondents were almost perfectly differentiated from other types of respondents and received a corrected ? similar to what MRS respondents would receive. However, whether the corrected ? contains the same psychological meaning for this group of respondents is evidently a question that calls for a degree of informed judgment among experts in the content area where the psychological test results would be scrutinized. 165 The generated item locations within each class had small variability in the current study. In the MPCM, between-class variability not only in the order of thresholds and threshold distances but also in the item locations among test items may contribute to the recovery accuracy of the parameters (e.g., Rost, 1991). This small between-group variability in item locations might have contributed positively or negatively to the parameter recovery results of this study. In this study, polytomous item responses obtained with a 5-category Likert scale items were used. It has been previously investigated in the literature that the parameter recovery of the partial credit model differed depending on the number of categories on the rating scale that was used. The simulation results could possibly be different if different numbers of response categories were used. The effects of the variability in item locations within latent classes and the effects of different numbers of response categories warrant further studies. Future studies can also explore the other mixture distribution IRT model than the Rasch family models. Researchers have pointed out that the equal discrimination assumption of the Rasch models can be easily violated in real data analytic situations. The extension of other polytomous IRT models to mixture distributions would have the potential for allowing researchers to have a more complete view of hidden structural differences including personality or cognitive constructs, faking and social desirability tendencies, non-invariant items, as well as response styles. Empirical studies needs to be conducted to evaluate whether trait estimates of the mixture IRT 166 models corrected for the confounding effects of different response styles can improve predictability of criteria variables in various social behavioral research. 167 Appendix A Table A.1. Category probabilities for individual items for ERS class Item Category1 Category2 Category3 Category 4 Category 5 1 0.3734 0.1065 0.0402 0.1065 0.3734 2 0.3829 0.0880 0.0582 0.0880 0.3829 3 0.4120 0.0614 0.0532 0.0614 0.4120 4 0.4173 0.0675 0.0306 0.0675 0.4173 5 0.3956 0.0800 0.0488 0.0800 0.3956 6 0.3958 0.0914 0.0257 0.0914 0.3958 7 0.4363 0.0514 0.0247 0.0514 0.4363 8 0.4037 0.0777 0.0370 0.0777 0.4037 9 0.3727 0.1020 0.0506 0.1020 0.3727 10 0.4069 0.0785 0.0293 0.0785 0.4069 Mean 0.3997 0.0804 0.0410 0.0804 0.3997 Table A.1. Category probabilities for individual items for MRS class Item Category1 Category2 Category3 Category 4 Category 5 1 0.0254 0.0889 0.7713 0.0889 0.0254 2 0.0343 0.1118 0.7079 0.1118 0.0343 3 0.0492 0.0614 0.7788 0.0614 0.0492 4 0.0852 0.0976 0.6345 0.0976 0.0852 5 0.0519 0.0661 0.7640 0.0661 0.0519 6 0.0546 0.0752 0.7405 0.0752 0.0546 7 0.0368 0.1064 0.7136 0.1064 0.0368 8 0.0502 0.1052 0.6892 0.1052 0.0502 9 0.0602 0.1441 0.5913 0.1441 0.0602 10 0.0591 0.1088 0.6642 0.1088 0.0591 Mean 0.0507 0.0966 0.7055 0.0966 0.0507 168 Table A.1. Category probabilities for individual items for ARS class Item Category1 Category2 Category3 Category 4 Category 5 1 0.7136 0.1424 0.1270 0.0124 0.0046 2 0.0046 0.0124 0.1270 0.1424 0.7136 3 0.7669 0.1065 0.0758 0.0421 0.0087 4 0.0087 0.0421 0.0758 0.1065 0.7669 5 0.8269 0.1045 0.0313 0.0205 0.0170 6 0.0170 0.0205 0.0313 0.1045 0.8269 7 0.7102 0.1535 0.0906 0.0420 0.0036 8 0.0036 0.0420 0.0906 0.1535 0.7102 9 0.7338 0.2144 0.0356 0.0096 0.0066 10 0.0066 0.0096 0.0356 0.2144 0.7338 Mean 0.7503 0.1443 0.0721 0.0253 0.0081 169 References Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 42, 69-81. Andrich, D. (1982). An extension of the Rasch model for ratings providing both location and dispersion parameters. Psychometrika, 47, 105-113. Austin, E. J., Deary, I. J., & Egan, V. (2006). Individual differences in response scale use: Mixed Rasch modeling of responses to NEO-FFI items. Personality and Individual Differences, 40, 1235-1245. Austin, E. J., Deary, I. J., Gibson, G. J., McGregor, M. J., Dent, J. B. (1998). Individual response spread in self-report scales: personality correlations and consequences. Personality and Individual Differences, 24, 421-438. Bachman, J. G., & O?Mally, P. M. (1984) ?Yes-saying, nay-saying, and going to extremes: Black-white differences in response styles. Public Opinion Quarterly, 48, 491-509. Baker, F. B., & Kim, S. H. (2004). Item Response Theory: Parameter Estimation Techniques (2nd ed.). New York, Dekker. Baumgartner, H., & Steenkamp, J. E. M. (2001). Response styles in marketing research: A cross-National Investigation. Journal of Marketing and Research, 38, 143-156. 170 Berg, I. A., & Collier, J. S. (1953). Personality and group differences in extreme response sets. Educational Psychological Measurement, 13, 164-169. Billet, J. B., & McClendon, M.J. (2009). Modeling acquiescence in measurement models for two balanced sets of items. Structural equation modeling: A multidisiplinary Journal, 7, 608-628. Bock, R. D. (1972). Estimaing item parameters and latent ability when responses are scored in two or more nominal categories, Psychometrika, 37, 29-51. Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39, 331-348. Bolt, D. M., Johnson, T. R. (2009). Addressing score bias and differential item functioning due to individual differences in response style. Applied Psychological Measurement, 33, 335-352. Bolt, D. M., & Newton, J. R. (2011). Multiscale measurement of extreme response style. Educational and Psychological Measurement, 7, 814-833. Bozdogan, H. (1987). Model selection and Akaike?s information criterion (AIC): The general theory and tis analytical extensions. Psychometrika, 52, 345-370. Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64, 153-168. Brown, S. W., Garry, M., Silver, B., & Loftus, E. (1997). Conceptions and misconceptions of what and how we remember: Survey results. Paper presented 171 at the annual conference of the American Psychological Society, Washington, DC. Buckley, J. (2009). Cross-national response styles in international educational assessment: Evidence from PISA 2006. NCES Conference on the Program for International Student Assessment: What we can learn from PISA. (Downloaded from http://edsurvey.rti.org/PISA/ on January 12, 2012). Chen, C., Lee, S.-Y., & Stevenson, H. W. (1995). Response style and cross-cultural comparisons of rating scales among East Asian and North American students. Psychological Sciences, 6, 170-175. Cheung, G. W., Rensvold, R. B. (2000). Assessing extreme and acquiescence response sets in cross-cultural research using structural equations modeling. Journal of Cross-Cultural Psychology, 31, 187-212. Cho, Y., Jiao, H., & Macready, G. (2012a). Assessing the effects of different item parameter profiles in mixture Rasch models. Paper presented at the annual meeting of the American Educational Research Association, Vancouver, Canada. Cho, Y., Jiao, H., & Macready, G. (2012b). Simultaneous effects of different item discrimination profiles and item difficulty profiles in mixture 2PL models. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, Canada. Chronbach, L. J. (1946). Response set and test validity. Educational and Psychological Measurement, 6, 475-494. 172 Clarke III, I. (2000). Extreme response style in cross-cultural research: An empirical investigation. Journal of Social Behavior and Personality, 15, 137-152. Cohen, J. (1988). Statistical power analysis for the behavior sciences (2nd ed.) Routledge. Costa, P. T., & McCrae, R. R. (1992). Revised NEO Personality Inventory and NEO Five Factor Inventory. Professional Manual, Odessa, Florida: Psychological Assessment Resources Inc. Couch, A., & Kenison, K. (1960). Yeasayers and Naysayers: Agreeing response set as a personality variable. Journal of Abnormal Social Psychology, 60, 151-174. De Ayala, R. J. (2009). The theory and practice of item response theory. New York: Guilford Press. De Jong, M. G, Steenkamp, J.-B. E. M., Fox, J.-P., & Baumgartner, H. (2008). Using item response theory to measure extreme response style in marketing research: A global investigation. Journal of Marketing Research, 45, 104-115. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, ser. B, 39, 1-38. Diener, E., Emmons, R. A., Larsen, R. J., & Griffin, S. (1985). The satisfaction with life scale. Journal of Personality Assessment, 49, 71-75. Eid, M., Rauber, M. (2000). Detecting measurement invariance in organizational surveys. European Journal of Psychological Assessment, 16, 20-30. 173 Egberink, I. J. L., Meijer, R. R., Veldkamp, B.P. (2010). Conscientiousness in the workplace: Applying mixture IRT to investigate scalability and predictive validity. Journal of Research in Personality, 44, 232-244. Embretson, S. E., & Reise, S. P. (2000). Item reponse theory for psychologist. Mahwah, NJ: Lawrence Erlbaum Associates. Fahrenberg, J., Hampel, R., & Selg, H. (1989). Das Freiburger Pers?nalichkeitsinventar FPI [Freiburg Personaliry Inventory FPI] (5th ed.). G?ttingern, Germany: Hogrefe. Gollwitzer, M., Eid, M., J?rgensen, R. (2005). Response styles in the assessment of anger expression. Psychological Assessment, 17, 56-69. Greenleaf, E. A. (1992). Measuring extreme response style. Public Opinon Quarterly, 56, 328-351. Hofstede, G. (1980). Culture?s Consequences. International Differences in Work- Related Values. London: SAGE Publications. Hui, C. H., & Triandis, H. C. (1989). Effects of culture and response format on extreme response style, Journal of Cross-Cultural Psychology, 20, 296-309. Johnson, T., Kulesa, P., Cho, Y. I., & Shavitt, S. (2005). The relation between culture and response styles: Evidence from 19 countries. Journal of Cross-Cultural Psychology, 36, 264?277. Kulas, J. T., & Stachowski, A. A. (2008). Moddle category endorsement in odd- numbered Likert response scales: Associated item characteristics, cognitive 174 demands, and preferred meanings. Journal of Research in Personality, 43, 489-493. Harzing, A.-W. (2006). Response styles in cross-naitonal survey research: A 26-contry study. International Journal of Cross Cultural Psychology, 20, 296-309. Lau, A. (2009). Using a mixture IRT model to improve parameter estimates when some examinees are amotivated (Unpublished doctoral dissertaion). James Madison Univeristy, Harrisonburg, VA. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston, MA: Houghton Mifflin. Lee, C., & Green, R. T. (1991). Cross-cultural examination of the fishbein behavioral intentions model. Journal of International Business Studies, 25, 289-305. Lee, J. W., Jones, P. S., Meneyama, Y., & Zhang, X.E. (2002). Cultual difference in responses to a Likert scale. Research in Nursing & Health, 25, 295-306. Lewis, N. A., & Taylor, J. A. (1955). Anxiety and extreme response preference. Educational and Psychological Measurement, 15, 111-116. Li, F., Cohen, A. S., Kim, S-H, & Cho, S-J. (2009). Model selection methods for mixture dichotomour IRT models. Applied Psychological Measurement, 33 (5), 353-373. Likert, R. (1932). A technique for the measurement of attitude. Archives of Psychology, 140, 1-55. 175 Liu, Y., Wu, A. D., & Zumbo, B. D. (2010). The impact of outliers on Cronbach?s coefficient alpha estimate of reliability: ordinal/rating scale item responses. Educational and Psychological Measurement, 70, 5-21. Maij-de Meij, A. M., Kelderman, H., & van der Flier, H. (2005). Latent-trait latent- class analysis of self-disclosure in the work environment. Multivariate Behavioral research, 40, 435-459. Maij-de Meij, A. M., Kelderman, H., & van der Flier, H. (2008). Fitting a mixture item response theory model to personality questionnair data: characterizing latent classes and investigating possibilities for improving prediction. Applied Psychological Measurement, 32, 611-631. Marin, G., Gamba, R. J., & Marin, B. V. (1992). Extreme response style and acquiescence among Hispanic: The role of acculturation and education. Journal of Cross-cultural Psychology, 23, 498-509. Masters, G. N. (1984). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174. McLachlan, G. J., & Peel, D. (2000). Finite mixture models. New York: Wiley. Meiser, T., & Machunski, M. (2008). The personal structure of personal need for structure: A mixture-distribution Rasch analysis. European Journal of Psychological Assessment, 24, 27-34. Messick. S. (1991). Psychology and methodology of response styles. In Snow & Wiley (Eds.), Improving Inquiry in Social Science: A Volume in Honor of Lee J. Chronbach. Hillsdale, New Jersey: Lawrence Erlbaum Association. 161- 176 200. Mirowsky, J., & Ross, C. E. (1991). Elimination defense and agreement bias from measures of the sense of control: A 2?2 index. Social Psychology Quarterly, 54, 127-145. Mislevy, R. J., & Verhelst, N. D. (1990). Modeling item responses when differentsubjects employ different solution strategies. Psychometrika, 55, 195- 215. Molenaar, I. W. (1995). Some background for item response theory and the Rasch model. In G.H. Fischer & I.W. Molenaar (Eds). Rasch models: Foundations, recent development, and application (pp.3-14). New York: Springer. Moors, G. (2003). Diagnosing response style behavior by means of a latent-class factor approach. Socio-demographic correlates of ethnic discrimination reexamined. Quality & Quantity, 37, 277-302. Moors, G. (2004). Facts and artefacts in the comparison of attitudes among ethnic minorities. A multigroup latent class structure model with adjustment for response style behavior. European Sociological Review, 20, 303-320. Moors, G. (2008). Exploring the effect of a middle response category on response style in attitude measurement. Quality & Quantity, 42, 779-794. Norman, R. P. (1969). Extreme response tendency as function of emotional adjustment and stimulus ambiguity. Journal of Counseling and Clinical Psychology, 33, 406-410. Nunally, J. C. (1978). Psychometric Theory. 2nd Ed. New York: McCraw-Hill. 177 Paulhus, D. L. (1991). Measurement and control of response bias. In Robinson, Shaver, & Wright. (Eds). Measures of Personality and Social Psychological Attitudes (17-59). San Diego, CA: Academic Press. Preinerstorfer, D., & Formann, A. K. (2012). Parameter recovery and model selection in mixed Rasch models. British Journal of Mathematical and Statistical Psychology, 65, 251-262. Rasch, G. (1960/80). Probablistic models for some intelligence and attainment test. (Copenhagen, Danish Institute for Educational Research), expandid edition (1980) with forward and afterward by B.D. Wright. Chicago: The university of Chicago Press. Reise, S. P., & Gomel, J. N. (1995). Modeling qualitative variation within latent trait dimensions: Application of mixed-measurement to personality assessment. Multivariate Behavioral research, 30, 341-358. Ross, C. E., & Mirowsky, J. (1984). Socially desirable responses and acquiescence in a cross cultural survey of mental health. Journal of Health and Social Behavior, 25, 189-197. Rost, J. (1988). Measuring attitudes with a threshold model drawing on a traditional scaling concept. Applied Psychological Measurement, 12, 397-409. Rost, J. (1990). Rasch models in latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282. Rost, J. (1991). A logistic mixture distribution model for polytomous item responses. British Journal of Mathematical and Statistical Psychology, 44, 75-92. 178 Rost, J. (1997). Logistic Mixture Models. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of Modern Item Response Theory (pp. 449-463). New York, Springer. Rost, J., Carstensen, C., & von Davier, M. (1997). Applying the mixed Rasch model to personality questionnaires. In J. Rost & R. Langeheine (Eds). Application of latent trait and latent class models in the social sciences (pp. 324-332). M?nster, Germany: Waxmann. Schmitt, M. J., & Ryan, A. M. (1993). The big five in personnel selection: Factor structure in applicant and non-applicant populations. Journal of Applied Psychology, 78, 966-974. Schwarz, G. (1978). Estimating the dimension of a model. Analysis of Statistics, 6, 461-464. Smith, E. V., Ying, Y., & Brown, S. W. (2012). Using the mixed Rasch model to analyze data from the beliefs and attitudes about memory survey. Journal of Applied Measurement, 13, 23-40. Subedi, D. R. (2010). Investigating unobserved heterogeneity using item response theory mixture models (Unpublished doctoral dissertation). Michigan State University, Lansing, MI. Lau, A. (2009). Using a mixture IRT model to improve parameter estimates when some examinees are amotivated (Unpublished doctoral dissertaion). James Madison Univeristy, Harrisonburg, VA. 179 Spielberger, C. D. (1988). STAXI. State-Trait Anger Expression Inventory.Tempa, FL: Psychological Assessment Resources. Temple, D. E., & Geisinger, K. F. (1990). Response latency to computer-administered inventory items as an indicatory of emotional arousal. Journal of Personality Assessment, 54, 289-297. Van Herk, H., Poortinga, Y. H., & Verhallen, T. M. M. (2004). Response styles in rating scales: Evidence of method bias in data from 6 EU countries. Journal of Cross-Cultural Psychology 35, 346-360. von Davier, M. (2000). WINMIRA 2001 [Computer software]. St. Paul, MN: Assessment Systems. von Davier, M. (2005a). mdltm: Software for the general diagnostic model and for estimating mixture of multidimensional discrete latent traits models [Computer software]. Princeton, NJ: Educational Testing Service. von Davier, M. (2005b). A general diagnostic model applied to language testing data (ETS Research Report No. PR-05-16). Princeton, NJ: Educational Testing Service. von Davier, M., & Rost, J. (1995). Polytomous mixed Rasch models. In G.H. Fischer & I.W. Molenaar (Eds). Rasch models: Foundations, recent development, and application (pp.371-379). New York: Springer. Watson, D. (1992). Correcting for acquiescent response bias in the absence of a balanced scale: An application to class consciousness. Sociological Methods and Research, 21, 52-88. 180 Wu, P-C , & Huang, T-W. (2010). Person heterogeneity of the BDI-II-C and its effects on dimensionality and construct validity: Using mixture item response models. Measurement and Evaluation in Counseling and Development, 43, 155-167. Yamamoto, K. Y. (1987). A model that combines IRT and latent class models. Unpublished doctoral dissertation, University of Illinois Urbana-Champaign. Yang, Y., Harkness, J. A., Chin, T-Y., & Villar, A. (2010). Response styles and culture. In J. A., Harkness et al. (Eds). Survey Methods in Multinational, Multiregional, and Multicultural Contexts. John Wiley & Sons, Inc. Zickar, M. J., Gibby, R. E., & Robie, C. (2004). Uncovering faking samples in applicant, incumbent, and experimental data sets: An application of mixed- model item response theory. Organizational Research Methods, 7, 168-190. Zickar, M. J., & Robie, C. (1999). Modeling faking food on personality items: An item-level analysis. Journal of Applied Psychology, 84, 551-563.