ABSTRACT
Title of dissertation: METHODS OF INTEGRATING
MULTI-MODAL DATA FOR
DETECTING ABERRANT
TEST-TAKING BEHAVIORS
IN LARGE-SCALE ASSESSMENTS
Kaiwen Man
Doctor of Philosophy, 2020
Dissertation directed by: Professor Jeffrey R. Harring
Department of Human Development and
Quantitative Methodology
Many schools, states, and countries use scores from large-scale assessments in mak-
ing important high-stakes decisions in such areas as college admissions, academic perfor-
mance evaluations, and job promotions among others. These decisions rely on accurate,
reliable scores from which valid inferences about examinees can be assessed. However,
aberrant test-taking behaviors, including copying from other test-takers and practicing
with real items ahead of time, undermine the effectiveness of such assessments in yield-
ing accurate, precise information on an examinee?s performance. Also, with the wide
adoption of technology-enhanced online learning and testing system, especially as I am
writing my thesis while the outbreak of COVID-19 virus, it is critical to address an ex-
ample question like ?how to make the online-delivered tests more secure?? As a result,
investigating ways to identify potential cheaters after these assessments or batteries have
been taken and data collected is an important endeavor for the numerous administrators of
such assessments. The purpose of this line of research is to create, develop, investigate,
and test new approaches that will incorporate bio-information technology, such as eye-
tracking, into current machine-learning methods in the detection of cheating and other
aberrant testing behaviors in computer-based testing scenarios. In other words, cheating
detection for innovative large-scale assessments with big data techniques augmented by
bio-information technologies will be explored. The eye-tracking systems, in particular,
have the potential to capture cheating and other aberrant test-taking behaviors with visual
information gathered through the analysis of eye movement patterns (saccades, fixations,
pupil size). This type of data can be subtly gathered in real-time on test-takers as they
attempt to answer each assessment item. To assess the visual attention nuances across
test-takers, three negative binomial distribution-based visual fixation counts models will
be presented. Moreover, a joint-modeling approach of integrating product data (e.g., item
responses), process data (e.g., response times), and biometric information (visual fixation
counts) will be demonstrated. By joint modeling the three types of information, we can
assess test-takers? performance in a comprehensive way. Finally, selected supervised and
unsupervised statistical learning methods will be explored for detecting different types of
responding behaviors.
METHODS OF INTEGRATING MULTI-MODAL DATA FOR
DETECTING ABERRANT TEST-TAKING BEHAVIORS IN
LARGE-SCALE ASSESSMENTS
by
Kaiwen Man
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2020
Advisory Committee:
Professor Jeffrey R. Harring, Chair/Advisor
Professor Hong Jiao
Professor Yang Liu
Professor Donald J. Bolger
Professor Colleen O?Neal
?c Copyright by
Kaiwen Man
2020

Dedication
I dedicate this dissertation to God Almighty my creator, my solid pillar,
my source of wisdom, inspiration, knowledge, love and understanding.
?The fear of the Lord is the beginning of wisdom.? ? Proverbs 1:7
ii
Acknowledgments
I owe my gratitude to all the people who have made this thesis possible, and because
of whom my graduate experience has been one that I will cherish forever.
First and foremost, I would like to thank my academic advisor, Professor Jeffrey
Harring, for carrying me along this journey with patience and trust, who has always been
supportive in my good and bad times. Since I joined the EDMS family, he has provided
me significant opportunities to develop my professional career. My current academic
accomplishments, if any, are mainly due to Dr. Harring?s tremendous endorsement and
supervision. He is not only my mentor but also my role model.
I would also like to thank our program director, Dr. Gregory Hancock. Because of
his kindness, I had this valuable opportunity to chase my dream of being a professor in
America. Because of his unshaken faith in me, I would be able to continue my studies
after a bumpy ride during my first year in the program. Thanks for being such a wonderful
person in my life! Moreover, I would also like to thank Dr. Huahua Chang, Dr. Hong
Jiao, Dr. Kadriye Ercikan, Dr. Sandip Sinharay, Dr. Kun Yuan, Dr. Donald Bolger, Dr.
Colleen O?Neal, and Dr. Aimo Hinkkanen for their support and guidance.
I would also like to acknowledge help and support from some of my friends. I
would like to thank Dongbo Guo for his encouragement over the years, especially during
the difficult times. Thanks, Sarah Thomas, for linking me with other professionals who
are working in the test security field, which helped my research tremendously. Thanks,
Peida Zhan, for his insights and support on many research projects. Thanks to my church
friends Xiaofang Wang and Dan for their kind invitations for Thanksgiving and Christmas
iii
parties over the years, which gave me a strong sense of belonging during the holidays.
Thanks to my church elder Der-Chen Chang for his teachings about God, which brings
me the true happiness and peace. Thanks, Joseph Feser, for his valuable insights and
help on many things such as job interview preparation, thesis proofreading, and career
decision-making. Also, I would like to thank my other friends: Yewon Lee, Yi Wei, Feng
Yi, Monica Morell, Daniel Lee, and Tessa Johnson for their help along the journey.
Additionally, I owe my deepest thanks to my family - my mother and father who
have always stood by me, and have pulled me through against impossible odds at times.
Words cannot express the gratitude I owe them.
Furthermore, I would like to acknowledge financial support from the Educational
Testing Service by awarding me with the Harold Gulliksen Psychometric Research Fel-
lowship to conduct my dissertation research.
It is impossible to remember all, and I apologize to those I?ve inadvertently left out.
Lastly, thank God for His unconditional love, blessings, and infinite mercy upon
me.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Limitations of Previous Works . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Conceptual Framework for Current Study . . . . . . . . . . . . . . . . . 5
1.3 Research Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Literature Review 9
2.1 Test Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Statistical Methods for Identifying Aberrant Testing Behavior . . . . . . . 12
2.2.1 Similarity Analysis (Collusion, Answer-Copy) . . . . . . . . . . 12
2.2.1.1 Identical Errors Analysis . . . . . . . . . . . . . . . . 13
2.2.1.2 Error Similarity Analysis . . . . . . . . . . . . . . . . 14
2.2.1.3 Generalized Binomial Similarity . . . . . . . . . . . . 16
2.2.2 Gain Score Analysis . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.3 Erasure (Answer Changing) Analysis . . . . . . . . . . . . . . . 18
2.2.4 Person Fit Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.4.1 Representative Parametric Indices . . . . . . . . . . . . 21
2.2.4.2 Representative Non-Parametric Indices . . . . . . . . . 25
2.2.4.3 Representative Response Time based Index . . . . . . . 28
2.2.5 Use of Data Mining Methods to Detect Test Fraud . . . . . . . . 30
2.2.5.1 Unsupervised Machine Learning Methods . . . . . . . 31
2.2.5.2 Supervised Machine Learning Methods . . . . . . . . . 36
2.3 Incorporating Biometrics to Detect Aberrant Testing Behaviors . . . . . . 45
2.3.1 Insights into Problem-Solving Using Eye Tracking . . . . . . . . 45
2.3.1.1 Fixation . . . . . . . . . . . . . . . . . . . . . . . . . 46
v
2.3.1.2 Pupil Diameter . . . . . . . . . . . . . . . . . . . . . . 47
2.3.1.3 Blinking . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3.1.4 Saccades . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.1.5 Regression . . . . . . . . . . . . . . . . . . . . . . . . 49
2.3.2 Representative Ways to Integrate Process Data into Psychometric
Methods to Identify Aberrant Test Takers . . . . . . . . . . . . . 50
2.3.2.1 Incorporating RT as a Variable Into the Item Response
Model . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.2.2 Joint Modeling of Item Responses and Response Times 53
2.4 Future Directions and Challenges . . . . . . . . . . . . . . . . . . . . . . 59
2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3 Methodology 62
3.1 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.1.2 Experimental Conditions . . . . . . . . . . . . . . . . . . . . . . 63
3.1.3 Data Recording . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 New Test Engagement Model Based on Visual Fixation Counts . . . . . . 64
3.2.1 The Negative Binomial Fixation Model . . . . . . . . . . . . . . 65
3.2.2 Negative Binomial Fixation Model with Linear Trend . . . . . . . 68
3.2.3 Negative Binomial Fixation Model with Quadratic Trend . . . . . 71
3.3 A Three-way Joint Modeling Approach of Item Response, Response Time
and Fixation Counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.3.1 Measurement Models at Level 1 . . . . . . . . . . . . . . . . . . 72
3.3.2 Modeling Item Domain and Person Domain models at Level 2 . . 73
3.3.2.1 Modeling Person Domain Parameters . . . . . . . . . . 73
3.3.2.2 Modeling Item Domain Parameters . . . . . . . . . . . 74
3.3.3 Model Parameter Estimation . . . . . . . . . . . . . . . . . . . . 76
3.3.4 Evaluating Model-data Fit: Posterior Predictive Model Checking . 78
3.4 Integration of Bio- and Psychometrical Infomation into Machine Learn-
ing Methods for Detecting Aberrant Behaviors . . . . . . . . . . . . . . . 83
3.4.1 Data Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.4.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.4.3 Outcome Measures and Expected Results . . . . . . . . . . . . . 86
3.5 Research Significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4 Results 89
4.1 Summary Statistics of the Collected Data across conditions . . . . . . . . 89
4.2 Data Visualization Across Different Experimental Conditions . . . . . . . 90
4.3 Negative Binomial Visual Fixation Models . . . . . . . . . . . . . . . . . 94
4.3.1 Item Parameter Estimates . . . . . . . . . . . . . . . . . . . . . 94
4.3.2 Person Parameter Estimates . . . . . . . . . . . . . . . . . . . . 95
4.4 Three-way Factor Model Parameter Estimates . . . . . . . . . . . . . . . 98
4.4.1 Item Parameter Estimates . . . . . . . . . . . . . . . . . . . . . 99
4.4.2 Variance-Covariance Estimates . . . . . . . . . . . . . . . . . . . 100
vi
4.4.3 Item-Side Variance-Covariance Structure . . . . . . . . . . . . . 100
4.4.4 Person-Side Variance-Covariance Structure . . . . . . . . . . . . 102
4.4.5 Accessing the Item-Wise Data Model Fit . . . . . . . . . . . . . 102
4.5 Assessing Test-Taking Behaviors Across Different Experimental Conditions104
4.5.1 Impact of Having Pre-knowledge of Test Items on Item Charac-
teristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.2 Impact of Having Pre-knowledge of Test Items on Test-Takers
Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.6 Use of Person-fit Statistics to Classify Different Responding Behaviors . . 111
4.7 Use of Data Mining Methods to Classify Different Responding Behaviors 113
4.7.1 Representative Unsupervised Learning Methods . . . . . . . . . 116
4.7.2 Representative Supervised Learning Methods . . . . . . . . . . . 119
5 Discussion 127
5.0.1 Limitations for the Current Work . . . . . . . . . . . . . . . . . 130
5.0.2 Recommendations for Future Directions . . . . . . . . . . . . . 131
A List of Variable Names 134
B Summary Statistics of All the Variables 136
vii
List of Tables
1.1 General guidelines of data forensics Analysis . . . . . . . . . . . . . . . 2
2.1 Representative person fit indices . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Guttman scale and index calculation . . . . . . . . . . . . . . . . . . . . 27
2.3 Advantages and disadvantages of various data mining methods for detect-
ing aberrant test taking behaviors . . . . . . . . . . . . . . . . . . . . . 44
3.1 Input psychological and biological variables for data mining methods . . 86
4.1 Number of subjects in each condition . . . . . . . . . . . . . . . . . . . 90
4.2 Item parameter estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Variance covariance estimates . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Item parameter estimates of three-way factor model . . . . . . . . . . . . 99
4.5 Variance-covariance estimates of three-way model . . . . . . . . . . . . . 101
4.6 Item parameter estimates across different experimental conditions . . . . 107
4.7 Person-side correlation matrix estimates . . . . . . . . . . . . . . . . . . 109
4.8 Sensitivity and specificity for PFS IRT- and RT-based methods . . . . . . 112
4.9 Classification accuracy for K-Means methods with three groups . . . . . 116
4.10 Sensitivity and specificity for K-means methods with two groups . . . . . 117
4.11 Classification accuracy for KNN methods with three groups . . . . . . . 121
4.12 Sensitivity and specificity for KNN Methods with two groups . . . . . . . 121
4.13 Classification accuracy for RF Methods with three groups . . . . . . . . . 124
4.14 Sensitivity and specificity for RF methods with two groups . . . . . . . . 125
A.1 Variable names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
B.1 Summary Statistics of All the Variables . . . . . . . . . . . . . . . . . . 137
viii
List of Figures
2.1 Relationship between several representative indicators . . . . . . . . . . . 50
2.2 Conditional independence of item responses and response times given
latent ability and speediness . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3 A mixture modeling approach to investigate the intraindividual variation
in responses and response times . . . . . . . . . . . . . . . . . . . . . . 57
2.4 A conditional joint modeling approach for locally dependent item re-
sponses and response times . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.1 A graphical representation of the negative binomial fixation model . . . . 68
3.2 Two items with fitted fixation counts . . . . . . . . . . . . . . . . . . . . 69
3.3 A graphical representation of the negative binomial fixation model . . . . 70
3.4 Trivariate joint model approach of item response, response time, and vi-
sual fixation counts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Graphical demonstration of posterior predictive model checking (PPMC)
Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.1 Scatterplots of essential variables under condition 1. . . . . . . . . . . . . 91
4.2 Scatterplots of essential variables under condition 2. . . . . . . . . . . . . 92
4.3 Scatterplots of essential variables under condition 3. . . . . . . . . . . . . 93
4.4 Individual test engagement estimates based on the NBFM. . . . . . . . . 97
4.5 Individual test engagement estimates based on the NBFM-LT. . . . . . . 97
4.6 Individual test engagement estimates based on the NBFM-QT. . . . . . . 98
4.7 Scatter-plots for item parameter estimates . . . . . . . . . . . . . . . . . 101
4.8 Scatter-plots for person parameter estimates . . . . . . . . . . . . . . . . 102
4.9 Posterior predictive p-values for 1-PL IRT model, log-normal response
time model, and negative binomial visual fixation counts model. . . . . . 104
4.10 Item parameter estimates across distinct experimental conditions, and
negative binomial visual fixation counts model. . . . . . . . . . . . . . . 108
4.11 Scatterplots for person-side parameter estimates. A loess non-parametric
smoothed curve is plotted for each scatterplot . . . . . . . . . . . . . . . 110
4.12 PFS?s performance of classifying different type of responding behaviors . 112
4.13 Pair-wise correlations between features . . . . . . . . . . . . . . . . . . 114
4.14 Feature importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
ix
4.15 Number of optimal clusters based on K-Means method . . . . . . . . . . 117
4.16 Segregations among the three groups based on K-means method . . . . . 118
4.17 The optimal number of neighborhood based on the KNN . . . . . . . . . 120
4.18 Segregations among the three groups based on KNN method . . . . . . . 123
4.19 Number of trees in random forest . . . . . . . . . . . . . . . . . . . . . . 124
4.20 Classification Tree as a demonstration of classifying different types of
responding behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
x
Chapter 1: Introduction
Over the last decade, the number of cheating-related test security events has grown
(Wollack & Fremer, 2013), especially on tests that aim to assess student achievement.
These incidents have become more discernible with the use of computer-based testing
(CBT), probably due to the high number of tests given, which could lead to higher rates
of item-exposure. Cheating behavior on educational and psychological tests has been
known to compromise the accuracy of results on assessments of student achievement
(Cizek & Wollack, 2017; Meijer & Sijtsma, 1995; W. J. van der Linden & Guo, 2008;
van Krimpen-Stoop & Meijer, 2001), and thus influence the inferences drawn from these
scores. These undesirable outcomes are exacerbated in high-stakes, competitive assess-
ment scenarios in which fraudulent test-taking behavior not only influences the scoring
of the deviant test-taker, but causes harm to other test-takers as these questionable scores
impact others scores with whom they are directly compared (Sinharay, 2017). Meijer
(1997) suggested that the existence of misfitting responses in test data could negatively
impact the reliability of test scores and validity of the inferences drawn from the scores on
these tests. Hendrawan, Glas, and Meijer (2005) also indicated that model data misfit due
to aberrant response patterns negatively impact item parameter estimation, which could
result in inaccurate latent ability estimates of examinees.
1
In order to secure our exams from unethical test-taking behaviors like cheating, in
recent years, a number of statistical methods have been proposed for detecting different
types of test-taking behaviors. The Council of Chief State School Officers (CCSSO)
and Association of Test Publishers (ATP) have suggested the following guidelines for
searching for misconduct (see Table 1.1).
Table 1.1: General guidelines of data forensics Analysis
Data Forensics Analysis Types of Testing Irregularities
Unusual score gains or losses Coaching on actual test content,
(test-retake) ?helping? during an examination
Changing answers by educators,
Eraser (answer changing) analysis Inappropriate assistance
during testing
Sharing answers during testing,
Similarity analysis (collusion) teachers helping before or during
testing, illicit use of stolen test questions
Inconsistent response patterns
Person fit analysis (aberrant wrong such as answering difficult
and right answer patterns) questions correctly while missing
easy questions
(Olson & Fremer, 2013)
Based on these guidelines, many statistical indices were created for detecting aber-
rant test-taking behaviors. Response similarity indices are used to evaluate the agreement
of two response vectors from two test takers, mainly focusing on flagging answer copy
cheating or collusion among individual test-takers. The representative indices are the K
index (Holland, 1996a), K1, K2, S1 indices (Sijtsma & Meijer, 1992; Sotaridona, van der
Linden, & Meijer, 2006a), Generalized Binomial index (W. J. van der Linden & Sotari-
dona, 2006) and the ? index (Wollack, 1997). Person-fit indices are computed to assess
2
different response patterns of test-takers, which could also be used for detecting copy
cheating and other types of behaviors such as pre-knowledge cheating and item stealing
(e.g., Belov & Armstrong, 2010; Fox & Marianti, 2016; Sinharay, 2017). Eraser detection
indices (EDIs) are utilized to detect suspicious answer changing behaviors (e.g., Sinharay
& Johnson, 2017; Wollack & Cizek, 2016). Gain scores analysis (GSA) are more focused
on flagging teacher cheating. In other words, GSA is mainly used to catch the unexpected
test score fluctuations at the group level. Representative GSA methods include those
proposed by Bishop and Egan (2016) and Skorupski and Egan (2011).
1.1 Limitations of Previous Works
The use and effectiveness of many of those indices used for aberrant behavior de-
tection have been compared and reported in many studies (e.g. Karabatsos, 2003; Reise,
1990; Sinharay, 2017). Many of these methods have been found to be sensitive in detect-
ing different types of behaviors. However, despite nuanced successes of these methods to
detect aberrant testing behavior, they have a number of limitations. The detection power
of the previously introduced indexes and statistics are constrained by their correspond-
ing inputs, such as item responses or response times, without further considering other
information of the test taker. Also, many testing programs have transferred from conven-
tional paper-pencil tests to computer-based tests and testing environments in these years.
A wealth of test-taker related multimodal behavioral data (item responses, response times
and process information) are collected in real time during the administration of tests.
Yet, most of the previously mentioned indices have limited power to incorporate the high
3
dimensional behavioral information into their functional forms. Furthermore, current de-
tection methods are isolated from each other on detecting different types of aberrant be-
haviors due to the nature of their designs. As a result, a unified platform for aggregating
the detection power of the most effective methods is lacking.
Some recent methodological investigations have attempted to address the afore-
mentioned limitations (e.g. Dai, 2013; Fox & Marianti, 2017a). Dai (2013) introduced a
mixture Rasch model to explore underlying latent groups by incorporating covariates, col-
lateral information, which could positively impact classification accuracy. Fox and Mari-
anti (2017a) proposed a person-fit test, which accounts for relations between item and
person characteristics by jointly modeling item responses and response times. Though
these methods are good examples of integrating auxiliary information for more accurate
classification of aberrant-behaved test takers from the normally-behaved group, they are
still deficient in a number of ways. First, incorporating high-dimensional input variables
with complex structures is challenging under current modeling frameworks. For exam-
ple, it is increasingly challenging for models to converge if many covariates are added
in the mixture model. Also, many of these variables could be nonlinearly-related to an
outcome of interest. Thus, ignoring such nonlinearities existing among all the covariates
may imply that the fitted linear model fails to accurately capture a systematic pattern be-
tween the outcome variable and a set of predictors (Huber, 1991). Secondly, the power
to detect aberrant behaviors based on these methods for item responses or response times
alone has been quite modest (e.g., Fox & Marianti, 2017a). Inclusion of essential behav-
ioral indicators has the potential for improving the power to detect aberrant test-taking
behaviors.
4
1.2 Conceptual Framework for Current Study
In order to address the previous challenges, this study aims to create, develop, inves-
tigate, and test new approaches that would incorporate bio-information technology, such
as eye-tracking, into traditional psychometric models (e.g., IRT and RT models). Also,
the bio-information measures and other psychometric measures can be used as inputs for
various machine-learning methods in the detection of cheating and other aberrant testing
behaviors in computer-based testing scenarios. In other words, cheating detection on in-
novative large-scale assessments with big data techniques augmented by bio-information
technologies will be explored. Data mining algorithms, a class of methods for clustering
cases, can be a convenient methodological platform for detecting fraudulent test taking
behaviors that can overcome noted limitations of traditional methods. Sensitivity to de-
tect aberrant behavior can be potentially increased by incorporating not only process and
biometrical data as inputs into these algorithms, but also indices based on traditional ap-
proaches. Additionally, in contrast to applications involving traditional IRT based, and
RT based methods, data mining algorithms have the facility to examine both linear as
well as nonlinear relations among variables. Therefore, complex interactions between
background, psychometric, and biometric variables would be better approximated than
simply assuming linear relationships among them.
Eye-gaze pattern related variables recorded by an eye-tracking system, in particular,
will be incorporated into the data-mining platform as essential biometric indicators. The
eye-tracking system has the potential to capture cheating and other aberrant test-taking
behaviors with bio-information gathered through the analysis of eye movement patterns
5
(e.g., saccades, fixations, pupil size). Such information in the context of large-scale as-
sessment testing scenarios may help address and answer some interesting questions such
as: (a) where does an examinee look and what does this information tell us about aberrant
testing behavior? (b) What information can be ignored? (c) When does blinking occur,
and what information does that convey about the examinees behavior? (d) How does the
pupil react to different stimuli? This type of bio-information can be subtly gathered in
real-time on test-takers as they answer assessment items in a CBT environment. Thus,
many supervised and unsupervised statistical learning methods will be explored in this
study by incorporating biometric information. Also, a new eye-gaze pattern based model
will be proposed. Moreover, the new model would be further jointly modeled with item
response model (IRT) and(or) response time models (RT).
Given the purpose of this research, several research questions to be addressed are:
1. Do differences exist in classification accuracy measured by sensitivity and speci-
ficity rates across the different data mining methods?
2. Which data source - item responses, response times, or eye-tracking data - is most
predictive of aberrant test taking behavior detection?
3. Do differences exist in classification accuracy of aberrant testing behaviors among
conventional standalone approaches and data mining methods? If differences exist,
can they be partially explained by incorporating both biometric and psychological
data?
6
1.3 Research Significance
This study explores the ways of incorporating biological information into tradi-
tional psychological methods by utilizing data-mining algorithms to better understand
test takers? behaviors. This methodological work has the potential to aid administrators
of large scale assessments to ferret out aberrant behaving examinees, but can lead to fu-
ture research in the area of test security. Second, this study has the potential to create new
eye-tracking measure-oriented models and develop methods that may flag aberrances with
increased accuracy by incorporating biometric measures into current framework. An ad-
vantage of the newly proposed statistical methods is that they would not be solely based
on one source of information, such as item responses, but rather on multiple sources of
information about test-takers including data stemming from bio-information technologies
and integrating log file information. Ideally, all of this information could be the inputs to
be aggregated using highly efficient computational methods such as cloud computing in
the data-mining framework. Third, through the methodological investigations and analy-
ses using empirical data, the signal-to-noise ratio (SNR) could be increased, which means
greater and more accurate classification with high sensitivity to different aberrant testing
behaviors. Fourth, the performances of the different methods (e.g., item response based
person-fit analysis, response time based fraud detection methods, K-means clustering,
supported vector machine, random forest, finite mixture modeling, neural networks) in
detecting different types of aberrant behaviors such as pre-knowledge cheating and copy
cheating in terms of classification accuracy would be further manifested by this study.
Also, this study could edify the research community by comparing cheating behavior de-
7
tection accuracy rate by incorporating both biometric and psychological information with
results from the traditional aberrant test-taking detection methods. Moreover, new mod-
els, such as gaze fixation counts based models or mixed effects models for jointly mod-
eling the eye-tracking data with traditional psychometric item response models would be
proposed. Additionally, some machine learning methods yield results that can be inter-
preted in a straightforward manner, which could then be communicated and perhaps for
easily understood by stakeholders. With high quality measures such as eye-tracking data
and indexes computed from the psychometric model, the rates of false negatives could be
better controlled.
Given these potential impacts on the enhancement of security of high-stake assess-
ment, this dissertation begins with an overview of the various methods on aberrant be-
havior detection, including traditional item response and response time based methods as
well as a set of data mining methods, followed by the introduction of eye-tracking tech-
nologies. Chapter 3 describes the detailed research design and the proposed methods.
Results from the current study are presented in Chapter 4. Implications, limitations, and
future directions of the present study are discussed in Chapter 5.
8
Chapter 2: Literature Review
Chapter 2 begins with a brief overview of recent developments in test security.
Many item response and response time based detection methods along with their unique
features will be discussed. Next, this chapter introduces various data mining methods,
which could be utilized for test security purposes. In addition, different types of eye-
tracking measures will be introduced. Ultimately, jointly modeling distinct latent con-
structs and their potential applications for incorporating biometric information will be
reviewed.
2.1 Test Security
Maintaining the security and confidentiality of student tests in any assessment pro-
gram is critical for ensuring valid test scores and providing standard and equal testing
opportunities for all students. Cizek (1999) made 18 recommendations for ensuring test
security on large-scale educational achievement testing programs. His recommendations
centered on how to establish rigorous standards for test safety and administration proce-
dures (e.g., seal each test booklet, preventing test administrators from accessing the test
booklets prior to the test). These recommendations provide a concrete foundation for
further discussions on the problem of test security.
9
According to Wollack and Fremer (2013), test security generally requires policies,
procedures, and guidelines involving the following characteristics:
1. Test safety should be guaranteed. Steps to ensure this might include the sealing of
individual test booklets prior to delivery and pre-inspecting examination locations.
2. Test administrations must be strict, meaning examination instructions and policies
must be made clear to all test takers.
3. Potential cheating behavior, before, during, and after test administrations should be
carefully monitored and prevented.
4. Test administrators should be provided training to appropriately proctor and report
potential cheating behaviors.
The ultimate purpose of establishing test security is to ensure test results that are
valid and accurate for assessing the performance of examinees.
Test security is vital to maintaining the fairness of the test for all test takers. With-
out test security, the validity and reliability of the test scores would be questioned for all
test takers, regardless of whether they engaged in cheating behaviors (e.g., Hendrawan et
al., 2005; Nering & Meijer, 1998). Insecure test environments also involve problems of
ethical issues. Test takers are more likely to cheat, especially in high-stake tests, when
the test is not secured (Wollack & Fremer, 2013). The insecure test can also negatively
impact the measurement classification validity (Hendrawan et al., 2005). Hendrawan et
al. (2005) indicated that some misfitting of response patterns caused by cheating behav-
iors, such as pre-knowledge cheating and cheating by copying answers could result in bi-
10
ased estimates of item parameters. Sotaridona, van der Linden, and Meijer (2006b), who
conducted a similar study focusing on this issue, found that item difficulty and discrim-
ination parameters were consistently larger, and that standard errors of estimates were
larger when the cheating behavior occurred. Hendrawan et al. (2005) also showed that
the misfitting response patterns caused by cheating behavior lead to inaccurate mastery
classification decisions. These studies have shown that the validity and reliability of the
test results are sabotaged if test security is not maintained. Test security is not only about
pre-administration prevention and the detection of aberrant testing behaviors that occur
during the test itself, but this term also encompasses training professional and ethical ad-
ministrators and exam supervisors to implement the test. According to Caveons webinar
(Schoenig, Geraets, & Mulkey, 2016), test security has a conceptual framework. Be-
fore test administration, exam booklets should be acquired and distributed in a safe and
confidential way. During the test administration, supervisors should proctor examinees
closely and report aberrant behaviors honestly and promptly. At the completion of the as-
sessment, test materials should be managed appropriately and undisclosed for preventing
changing item responses. Wollack and Fremer (2013) described unethical behaviors by
overseers of test administration (teachers or exam supervisors) such as failing to report
violations and giving extra help during test administrations. Such actions could under-
mine ethical standards, which could have a corrupting influence on professional conduct
and damage institutional confidence in using test results to make high-stake decisions.
Therefore, maintaining test fairness is absolutely necessary for correctly interpreting test
scores and for preserving the integrity of the testing regimen that organizations rely upon
to make critical decisions.
11
2.2 Statistical Methods for Identifying Aberrant Testing Behavior
In recent years, a number of statistical methods have been developed for detecting
different types of aberrant test-taking behaviors. The Council of Chief State School Of-
ficers (CCSSO) and Association of Test Publishers (ATP) have suggested the following
guidelines for searching for misconduct (see Table 1.1). Based on this guidance, specific
indices are suggested for detecting aberrant testing behavior that aligns to the different
testing environments.
2.2.1 Similarity Analysis (Collusion, Answer-Copy)
Detecting collusion and answer copying has become an essential issue in high-
stakes testing. In order to maintain the fairness of the test, many scholars are develop-
ing a variety of methods to detect and prevent answer copying and collusion behaviors
among test takers. Wollack (2011) indicated the following types of test collusion: (1)
illegal coaching by a teacher or test-prep school, (2) examinees accessing stolen test con-
tent posted on a study forum, (3) examinees copying answers from each other during an
exam, (4) examinees harvesting and sharing exam content using e-mail or internet, and (5)
teachers or administrator changing answers after the test has been administered. Based on
these behavioral features, many methods have been proposed to identify examinees who
engaged in colluded behavior (Bay, 1995; Belleza & Belleza, 1989; Cody, 1985; Frary,
Tideman, & Watts, 1977; Hanson, Harris, & Brennan, 1987; Holland, 1996b; Sotaridona
& Meijer, 2002a, 2003; Wollack, 1997).
In general, response similarity analysis attempts to calculate the likelihood of agree-
12
ment between two response vectors (Zopluoglu, 2016). Some of these methods focus on
matching incorrect responses between two response vectors; some use both matched in-
correct and matched correct responses as evidence of collusion. The following section
provides a historical and technical overview of the methods used for detecting unusual
response similarities.
2.2.1.1 Identical Errors Analysis
Bird (1927) derived an empirical null distribution of the number of identical errors
by randomly pairing test takers across different locations. The distribution of the number
of identical errors for each pair was used as a norm for a specific test. To determine
who the cheaters were, a cut-off value of the mean plus one standard deviation of the
distribution was used. If the number of identical errors within any pair who were taking
the specific exam was larger than the cut-off value, they are flagged as cheaters for having
an unusual degree of agreement.
This method laid the foundation of many similarity analyses used in educational
testing. It can be easily implemented on a variety of tests in different formats. However,
its weakness is that it does not include a general index to flag cheaters, and the cut-off
value varies across different tests.
Many other scholars have tried different methods to improve this work (Angoff,
1974; Crawford, 1930; Dickenson, 1945; Saupe, 1960). These methods were based on
the idea of using empirical distributions and have been relatively under-researched. Zo-
pluoglu (2016) provided three reasons for this: (1) lack of access by researchers to a
13
large-scale datasets, (2) limitations of computational power, and (3) null distributions
were exam-specific, and thus could not be generalized to other tests in a straightforward
manner.
2.2.1.2 Error Similarity Analysis
The error-similarity analysis index (Belleza & Belleza, 1989) based on tests using
multiple choice items was proposed for detecting test collusion by analyzing the proba-
bility of choosing the same series of incorrect alternative choices for every possible pair
of students. The index is defined as follows:
N!
Pk(1?P)N?k, (2.1)
k!(N?K)!
where P stands for the probability of any two students choosing the same wrong distractor
in a multiple choice question. This value was assumed to be 0.4 since it is not possible
to expect that all incorrect alternatives will have the same likelihood of being selected. N
signifies the total number of items in the test, while k is the number of items that received
the same incorrect answers.
The probability of choosing the same incorrect distractor by chance can be calcu-
lated for each pair of students. If there are S students, then the number of comparisons is
S(S?1)/2. The probability for each pair of students is used to determine the probability
of collusion behavior occurring by chance. When the sample size is large, the probability
of?collusion could follow the normal distribution with mean of NP and standard deviation
of NP(1?P). Test takers who have possible error-similarity scores above two standard
deviations from the mean on the distribution of error-similarity scores would be consid-
14
ered people who had colluded. For example, the probability of two students choosing
the same wrong answer is assumed to be 0.4. The probability of choosing the same five
answers out of 10 wrong items is 10!/(5!5!)0.45(0.6)5 = 0.20065. By calculating all pos-
sible pairs, those who have high error-similarity scores that are two standard deviations
above the mean would be flagged.
This line of research has several limitations. First, the value of P is assumed and
fixed due to the reality that it is not possible to expect with equal likelihood that all
incorrect alternatives will be selected. Second, this procedure requires the use of the
binomial distribution in order to obtain valid results; the sample size, therefore, needs to
be sufficiently large. Third, the error-similarity scores are test-length dependent, making
it difficult to compare across tests with different number of items. All of these methods
are specifically designed for paper-pencil tests due to the limited computational power at
the time that the research was carried out. However, the advantage of this method is that it
is easy to understand and calculate, especially when a computer is available for analyzing
the data.
Based on this work and that of (Saupe, 1960), a series of indices were derived: K1,
K2, S (Holland, 1996b; Sotaridona & Meijer, 2002b). Instead of using a fixed value of
P (the probability of any two students choosing the same wrong distractor in a multiple
choice question), these other indices used linear, quadratic regression equations, or a log-
linear model, respectively to predict P.
15
2.2.1.3 Generalized Binomial Similarity
The previous reviewed studies only focused on identical incorrect answers. W. J. van der
Linden and Sotaridona (2006) proposed a generalized binomial test method that used
a compound binomial distribution for the number of identical incorrect and correct re-
sponses between any pair of examinees. The formula is as follows
O
PM?c = ? (Pico?Pjco), (2.2)
o=1
where PM?c denotes the probability of matching for the ith and jth examinees. The prob-
abilities of selecting the oth response alternative of the cth item for examinees i and j, are
Pico and Pjco , respectively.
The probability of observing m matches on C items between two-response vectors
is computed as ( )
C
f mC = ? ? P?c 1??cM?c(1?PM ) , (2.3)c
c=1
where ?c is an indicator of whether or not a pair of examinees has the same responses to
item c. The summation is across all the possible combinations of n matches on C items.
The upper tail of this compound binomial distribution is used as the cut-off value to reflag
the people who have some degree of agreement.
For the methods previously described, the binomial distribution is the key com-
ponent to be utilized to flag people who are potentially copying from each other in the
similarity analysis. Usually, Type I error rate and power are utilized to evaluate the per-
formance of these proposed indices. The Type I error rate indicates the probability of an
honest test-taker being incorrectly detected as a cheater while power represents the proba-
16
bility of accurately detecting pairs who colluded on the test. There some other variations,
such as the ? (Wollack, 1997) and K statistics, based on the generalized binomial test
method (W. J. van der Linden & Sotaridona, 2006). Based on a simulation study (Zoplu-
oglu & Davenport, 2012), the K index yielded high power on detecting copy cheating, and
is used by some large testing companies such as Educational Testing Service (ETS) and
the College Board. All of the proposed similarity analysis methods could be utilized on
paper-pencil, computer-based, and internet-based tests. For the computer adaptive testing,
there is a lack of literature discussing the similarity analysis methods on the copy-cheating
detection, which leaves opportunities for future research. In contrast to comparing item
responses from examinees to detect cheating behavior, some analyses have relied on the
error similarity analysis.
2.2.2 Gain Score Analysis
Cannell (1988) questioned the integrity of achievement gains being made across all
states on norm-referenced tests. He pointed out that achievement gains usually occurred
as a result of unethical teaching practices at the local level, which consequently obscured
test security measures at the higher level (Cannell, 1988). There is some security-related
research that followed Cannell?s report focused on casual factors analysis (e.g., Shepard,
1990; Stonehill, 1988). An example of causal factors analysis demonstrated how teaching
to the test could result in increased gains in standardized test scores (Shepard, 1990).
However, it took more than a decade to consider using statistical methods for detecting
unexpected score gains in the large-scale assessment (Bishop, Liassou, Bulut, & Seo,
17
2011).
Jacob and Levitt (2003, 2004) proposed a method for using unexpected gain scores
to detect teacher cheating. They used a method that combined two indicators: (1) un-
expected test score fluctuation and (2) unusual student response patterns. The authors
applied a simple method by ranking the classroom-level gain scores and comparing the
rankings of all the classes across two time points. If scores of some classes increased
unexpectedly, the answer keys of students were checked. The presence of both indica-
tors suggested aberrant testing behavior, such as students receiving help during the test.
Consequently, both students and teacher would be flagged as cheaters. Other analyses
follow a similar idea of comparing the cumulative distribution function (CDF) of total
scores or scaled scores between two yearly assessment periods (Ho, 2008). If the differ-
ence between two CDFs from two years was large, or the high score range had a negative
difference, or the percentage of high scores in the second year had decreased relative to
the first year, then some unexpected gains between the two time points was indicated
(Ho, 2008). However, no universal cut-off value to be applied for these methods currently
exists and thus is rather subjectively applied.
2.2.3 Erasure (Answer Changing) Analysis
Qualls (2001) reported that students rarely erase their original responses. Primoli,
Liassou, Bishop, and Nhouyvanisvong (2011) found that erasures occur in roughly one
out of every 50 items. Additionally, other studies indicated that erasures occurred with in-
creased frequency on more difficult items and that test-takers with middle to high abilities
18
were more likely to change their answers (Mroch, Lu, Huang, & Harris, 2014; Primoli
et al., 2011). Based on these studies, several methods were proposed for uncovering sus-
picious answer changing. However, the erasure analysis is more focused on the group
level, such as classrooms and schools instead of individual level (Primoli et al., 2011).
Individual-level behavior usually was checked by other kinds of analyses, such as simi-
larity analysis and person-fit analysis. In erasure analysis, wrong to right (WR) change is
frequently used for statistical modeling. The most straightforward analysis is the group-
level Z-test (Bishop et al., 2011). For example, WR counts for a specific class or a school
is tested against the state population mean. The group would be flagged if the test statistic
was larger than a critical value with certain significant level. Bishop et al. (2011) also
proposed a simple linear regression method that used the mean of the total class erases
(TE) as a predictor with and the mean of the class WR as the outcome variable. The
authors found this method could account for a substantial proportion of variance at the
group level. However, this method suffers from heteroscedastic behavior as conditional
WR variance increases as the TE sum increases (Bishop et al., 2011). In order to over-
come this problem, Poisson regression was used and resulted in a better fitting model. The
groups that have either linear regression or Poisson residuals greater than 1.96 were con-
sidered to be suspicious answer changing groups (Bishop et al., 2011). later this method
was extended to a hierarchical linear modeling framework that used a two-level random
intercepts model within schools and a two-level random slope model that regressed the
number of WRs on individual-level ET counts. They concluded that both models were
more appropriate and fit significantly better than the simple linear regression method.
Ninety-five percent confidence intervals were constructed for each schools slope and were
19
flagged if the school interval did not cover the overall mean slope. Like a gain score anal-
ysis, an erasure analysis is a budding research area. Erasure analysis, however, is more
focused on identifying aberrant testing behavior of groups rather than of individuals.
2.2.4 Person Fit Analysis
Many person-fit indices have been created for detecting aberrant test-taking behav-
iors. Person-fit indices are computed to assess different response patterns of test-takers,
focusing on flagging copy-cheating and also on detecting other types of behaviors such
as pre-knowledge cheating and item stealing. The use and effectiveness of person-fit in-
dices used for copy-cheating detection is understudied compared to other analytic meth-
ods (Cizek & Wollack, 2017). However, some proposed person-fit indices have shown
a high degree of power to detect certain aberrant testing behaviors based on examinees
response patterns. For example, the HT index (Sijtsma & Meijer, 1992; Sinharay, 2018)
has been shown to be effective in detecting pre-knowledge cheating with high power. The
central premise behind most of the person-fit indices is to check whether or not a vector
of item responses is aligned with a person?s latent ability that is estimated from a specific
item response theory (IRT) model. Simply speaking, the probability of the reappearance
of a specific item response vector given the person?s latent-ability estimate from the IRT
model that we used is evaluated. Representative person-fit statistics used in different test-
ing environments are presented in the Table 2.1. There are two primary classifications of
these indices: parametric and nonparametric. Some of these methods will be subsequently
introduced.
20
Table 2.1: Representative person fit indices
Rash model RT based
PPMC Bayesian approach
U (Wright & Stone, 1979) (van der Linden & Guo, 2008)
lt(Marianti, Fox, Avetisyan, Veldkamp,
W (Wright & Masters, 1982) and Tijmstra (2014)
2PL and 3PL Guttman-based index
lz (Drasgow, Levine and Williams,1985) G (Guttman, 1944)
l?z (snijders, 2001) Agreement Index
ls (Sinharay, 2017) A (Kane & Brennan, 1980)
CAT Group-Based Index
K (Bradlow, Weiss, & Cho, 1998) rpbis (Donlan & Fischer, 1968)
Ht index (Sijtsma, 1988;Sijtsma
T (van Krimpen-Stoop & Meijer, 2000) & Meijer, 1992)
2.2.4.1 Representative Parametric Indices
U Index. The U index is computed from performing a residual analysis from ap-
plying the Rasch model (Rasch, 1961) to a set of examinees item responses (Wright &
Stone, 1979). The Rasch model is the simplest of IRT models in that it is parameterized
by a single, difficulty parameter. As a consequence of this parsimoniously parameterized
model, analyses require relatively small sample sizes (i.e., the number of examinees) to
produce reasonable data-model fit (Linacre, 1994). Since residuals are at the heart of the
U index computation, the U statistic could alternatively be calculated using other IRT
models such as 2PL or 3PL models; however, the performance of U under these exten-
sions has not been thoroughly studied. The computation of the U person-fit index follows
21
I ( )
? [Xi?Pi(?)]
2
U = , (2.4)
i=1 Pi(?)[1?Pi(?)]
where Pi(?) is the probability of correctly answering item i given the ability estimate, ? ,
Xi is the dichotomous response (0, 1) of item i for a specific person. This index follows
a Chi-square distribution with I degrees of freedom. Test-taking aberrances would be
flagged by a critical value of the distribution at a certain significance level (?).
There are other indices based on U that have been proposed in the literature, such
as the ZU , ZW and UB indices. These indices are either normalized versions of original
indices with different methods such as ZU and ZW or applied weighting strategies for
different items rather than treating them equally. Results from a comprehensive simula-
tion study (Karabatsos, 2003) showed that U performed better than ZU , ZW and UB in
detecting aberrant test-taking behavior of examinees.
Likelihood-Based Indices. The likelihood-based analysis of examinees response
patterns results in an index called l?z (Snijders, 2001). The l
?
z (Snijders, 2001) statistic
is the asymptotically correct standardized version of the lz statistic (Drasgow, Levine, &
Williams, 1985). Both of these statistics were based on the l?z statistic defined as the
loglikelihood of the item level scores for a test taker
I
l = ?[Yi jlogPi(? j)+(1?Yi)log(1?Pi(? j)], (2.5)
i=1
where Yi j, a random variable, denotes test taker j?s response (0 or 1) to item i. And,
Pi(? j) = P(Yi = 1) is the probability of a correct answer for test taker j on item i. Then
22
the expected value of the loglikelihood and its variance can be computed as
I { }
E(L|?) = ? Pi(? j)log Pi(? j) , (2.6)
i 1?Pi(? )
+ log(1?P
j i
(? j))
and,
I [ ]2
Var(L|?) = ?Pi(? j)(1?Pi(? j) log Pi(? j) . (2.7)
i 1?Pi(? j)
In general, the l statistic indicates that how likely it is to capture person j?s response
pattern under a fitted IRT model. Using the l index as a starting point, Drasgow and
colleagues (1985) provided a standardized version of l index in the usual manner by
subtracting the expected value of the right hand expression in Equation 2.6 above and
dividing by its standard deviation.
l?E(L|?)
lz = . (2.8)?(l|?)
The statistic in Equation 2.8 follows a standard normal distribution. In practice,
unknown ? would be replaced by an estimate, ?? . With this substitution, the value of lz
decreases with increasing degree of person misfit and large negative values of the index
(i.e., those smaller than -1.645 at the 5% significance level) are indicative of aberrant re-
sponse behavior. Although the sampling distribution is standard normal asymptotically,
one limitation to lz is that this index is not valid when true abilities are replaced by sample
ability estimates (W. Molenaar & Hoijtink, 1990; Reise, 1990). To correct this issue, (Sni-
jders, 2001) proposed a slight modification of the index to obtain the desired asymptotic
distribution with sample estimates of ability instead of the true (unknown) values and was
shown to work for different estimates of and under different IRT models (Magis, Rache,
& Bland, 2012). In essence, the revised index of l?z (Snijders, 2001) modifies both the
23
expectation function and the variance function in Equation 2.8 by taking into account the
sampling variability of ?? .
The l?z index based on a 3PL model is computed as
? ??? { ( ) l?E[l|? ]lz = ? } , (2.9)2?I ln Pi(?i j) ? cI(? j)ri(? j)) Pi(? j)(1?Pi(? j))1?Pi(? j
where [ ]
?I P?(? )ln Pi(? j)i i j 1?Pi(?? j
)
cI( j) = I ? , (2.10)?i Pi (? j)ri(? j)
and,
a exp[a (? ?b )]
ri(?
i i j i
j) = , (2.11)ci + exp[ai(? j?bi)]
and where, Pi(?i) is the first derivative of Pi(? j) with respect to ? j, and ai, bi, and ci
represent item discrimination, difficulty and pseudo-guessing parameters, respectively.
The l?z statistic simplifies when a 1PL or 2PL IRT model is used instead. Additionally, the
l?z statistic is still compared to a standard normal distribution to evaluate whether the test
taker is aberrant or not (see, e.g., Magis et al. (2012) for a useful flow chart for practical
implementation of l?z ).
Person Response Function Analysis. Trabin and Weiss (1983) proposed the D
index by utilizing the person response function (PRF; Weiss, 1973) to identify misfitting
item-score patterns. The person response function is a non-increasing function of the
item difficulty parameter. The D index was intended to compare the difference between
the expected PRF function, based on a certain IRT model, and the observed PRF function.
A significant difference between the expected and observed PRF would be indicative of a
misfitting response for that examinee (Trabin & Weiss, 1983).
24
The k items are ordered according to their difficulty parameters. Then, the k items
are assigned into S ordered subsets. Each subset contains m items, A1 = {1,2, ...,m},
A2 = {m+1, ...,2m},. . . ,As = {m+k?m+1, ...,k}. The expected PRF is constructed as
an estimate of the expected proportion of correct responses under a certain IRT model in
each subset, and is calculated as: m?1 ?g?As Xg, s = 1,2, ..,S. The observed proportion is
computed as m?1 ?g?A Xg, s = 1,2, ..,S. The difference between expected and observed
PRFs is then computed as D (??) = m?1s ?g?As[Xg?Pg(??)], s = 1,2, . . .S. By taking sum-
mation across all the subsets, the D index is computed as
S
D(??) = ? Ds(??). (2.12)
s=1
Based on the simulation study conducted by Karabatsos (2003), when the cut-off value
of the D index is equal to 0.55, the detection rate is maximized. Meanwhile, Karabatsos
also indicated that the D index provides the best performance on pre-knowledge cheating
among all of the parametric IRT-based indices.
2.2.4.2 Representative Non-Parametric Indices
Non-parametric person-fit statistics are defined as those that do not depend on any
IRT model (implicit in this definition is the fact that an IRT model stems from a particular
distribution). There are several reasons why non-parametric PFSs are popular alterna-
tives to the more conventional parametric PFSs. First, non-parametric IRT-based indices
are relatively easy to compute since they do not reply on any parametric IRT model.
Computation of parametric person-fit indices involves maximum likelihood or Bayesian
estimation, which can be computationally demanding compared to their non-parametric
25
counterparts, given characteristics of the testing situation. Second, nonparametric person-
fit indices yield relatively consistent results compared with parametric person-fit indices,
which often give different results depending on the kind of IRT model used. Several non-
parametric indices, such as the G index, the norm conformity index, and the group-based
index will now be presented.
Guttman Scale Index. The Guttman-based index (Guttman, 1944) or G index mea-
sures the degree of reasonableness of an examinee?s answers to a set of test items (Kara-
batsos, 2003; Meijer, 1994). It was the first nonparametric person-fit index developed for
detecting aberrant test-taking behavior. The way to flag test-taking aberrances is to count
the number of Guttman errors. Let Pm(m = 1, ..., I) denote the proportion of persons who
respond correctly to item m. Assume that there are I ordered items according to Pm in a
test such that Pm ? Pn (m = 1, ..., I?1;n = m+1, ..., I). Then, the G index is calculated
as
I?1 I
G = ? ? Imn, (2.13)
m=1 n=m+1
where Imn is an indicator taking on the value of 1 if a person has a Guttman error on items
m and n (i.e., Imn = 1); otherwise, Imn = 0.
The G index excludes all of the item-score combinations (0,1), which are called
Guttman errors. A Guttman error means the examinee answered correctly on a relatively
difficult item m and answered incorrectly on an easier item n according to the Guttman
scale, which orders the items from hardest to easiest. The permitted item-score patterns
are (0,1),(0,0), and (1,1). Table 2.2 demonstrates the calculation process for six hypoth-
esized examinees.
26
Table 2.2: Guttman scale and index calculation
Item 1 Item 2 Item 3
Examinee (Hardest) (Moderate) (Easiest) Responses Pairs G
1 1 1 1 (1,1)(1,1)(1,1) 0+0+0
2 0 1 1 (0,1)(0,1)(1,1) 0+0+0
3 0 0 1 (0,0)(0,1)(1,1) 0+0+0
4 0 0 0 (0,0)(0,0)(0,0) 0+0+0
5 1 0 1 (1,0)(1,1)(0,1) 1+0+0
6 1 1 0 (1,1)(1,0)(1,0) 0+1+1
Guttman Scale based Norm Conformity Index. As its name suggests, the Norm
Conformity Index (NCI; Tatsuoka & Tatsuoka, 1983) measures the extent of conformity
or consistency of an individual test takers response pattern on a set of items and is defined
as
2?I?1 ?I
NCI = 1? i=1 s=i+1
yi(1? ys)
, (2.14)
r(I? r)
where the realization of a test takers response (0 or 1) to item i is denoted as (i = 1, .., I)
across all items and all the items are ranked from low to high based on their difficulty
levels, in which item s is more difficult than item i. The numerator in Equation 2.14
represents the total number of Guttman conformal pairs, which only allow a test taker to
have two types of response pairs: (1) a relative easy item is answer correctly first and a
more difficult item is answered incorrectly later, which is denoted as (1, 0), or (2) a pair
of easy and hard items is answered correctly, which is denoted as (1, 1). Variable r is
the unweighted total score for a test taker (r < 1). More specifically, NCI measures the
proximity of the pattern to a baseline pattern in which all 0?s precede all 1?s when the
items are arranged in a pre-designated order (e.g., conforming to a Guttman scale)
27
Non-Parametric Transposed Scalability Index. The HT index (Sijtsma, 1986; Si-
jtsma & Meijer, 1992) is a nonparametric statistic which is the transposed formulation of
the scalability coefficient, HT (Loevinger, 1948) for items. The HT index for person n is
defined for a complete rectangular dataset of dichotomously scored items where the rows
represent I test takers and columns deno(te[ the J items as, ])
?Jm=1,m=
HT (n) = (n ?I[i=1 yniymi/(1? pn pm) ]) . (2.15)
?Jm=1,m=6 n min pn(1? pm), pm(1? pn)
where y (either 0 or 1) is the proportion of items answered correctly by test taker i, pm
denotes the proportion of items answered correctly by test taker m. The HT index is
the sum of the covariances between test taker n and the other test takers divided by the
maximum possible sum of those covariances. This constricts the range of allowable values
to be between -1 and 1. In scenarios in which the responses of test taker n are random, the
value of HT (n) will be close to zero. In a similar manner, responses that are positively
correlated with all other test takers would result in HT (n) taking on a positive value,
while HT (n) would take on negative values for the situation when test takers responses
are negatively correlated with other test takers. When the data are fit to the Rasch model,
HT (n) is expected to be somewhat positive (Sijtsma & Meijer, 1992).
2.2.4.3 Representative Response Time based Index
Although indices based on item responses have been shown to be somewhat ef-
fective in uncovering particular types of fraudulent testing behaviors, they do have some
limitations. Due to the simple structure of the item response sets, often comprised of
28
0s and 1s, test takers could imitate normal testing behavior, thus reducing the ability of
these IRT-based methods to identify actual cheaters. This may be especially true for de-
tection of minor anomalies (e.g., cheating occurring on certain items) for which methods
that possess greater sensitivity are needed. Indices based on examinees response times
represent one such class of methods that use additional testing behavior of the examinees
above and beyond their response profiles. Models for response times (see, e.g., Thissen,
1983; van der Liden & Sotaridona, 2006) were introduced to examine the identification of
cheating behaviors differently than using IRT-based methods. One such method devised
by van der Linden & Guo (2008) proposed assessing aberrant behaviors by examining
response time-based residuals, which the authors defined as the difference between ac-
tual and predicted RTs of a test takers answers using a method of crossvalidation. More
recent, Marianti and colleagues (2014) suggested
J I
? ? log(Tl i j)? (?i? ? j)t = , (2.16)
j=1 i=1 ?ei
where Ti j is the response time for test taker j on an item i, ?i is the time-intensity param-
eter that is the averaged population time required for answering that item, ? j is the speed-
iness parameter for each test taker, and ei j is the residual term of log response times. All
the parameters are estimated based on the RT model proposed by van der Linden (2006),
which is
log(Ti j) = ?i? ? j + ei j, ei j ? N(0,?2). (2.17)
These methods?those using IRT modeling of item responses and those using re-
sponse time data?have been useful in detecting various aberrant testing behaviors with
varying degrees of success. However, these conventional methods of aberrant test tak-
29
ing behavior detection have several limitations. One limitation of traditional methods is
that they are not well-equipped to integrate the vast amount of process data collected as
a natural byproduct of computer-based or computer-adaptive testing environments. Data
coming from log files as well as other test-taker characteristics are continually generated
and recorded at regular intervals during an assessment administration and may very well
communicate useful and diagnostic evidentiary information to help uncover patterns of
aberrant behaviors. Also, each method is used in isolation from one another. Treating
aberrant test-taking behavior detection in this manner does not exploit the potential ben-
efit that aggregating such information across methods might reveal.
2.2.5 Use of Data Mining Methods to Detect Test Fraud
Mining response data to identify clusters of respondents (e.g., such as those who
exhibit fraudulent test taking behavior) is not a new idea in assessment research. In their
paper detailing the facets of data mining, Romero, Gonzlez, Ventura, Del Jess, and Her-
rera (2009) explained that one must use a data mining strategy that is appropriate for the
type of data one wishes to identify, such as data mining to identify patterns of behaviors.
Their explanation indicates that data mining can facilitate the identification of cognitive
and behavior processes (Berkhin, 2006), and pertinent to the current study, aberrant test-
taking behaviors. According to Kerr and Chung (2012), identification of processes within
response patterns is typically done with clustering algorithms, which can be classified for
the purposes of the current study as either (1) unsupervised machine learning algorithms
or (2) supervised machine learning algorithms.
30
Clustering algorithms represent a particular class of unsupervised learning algo-
rithms that will be the primary focus in this dissertation. Clustering algorithms are pro-
cesses that use observed similarities of densities in data to identify patterns and group
similar observations (Berkhin, 2006). Three unsupervised learning methods will be in-
vestigated: (1) K-Means clustering, (2) multivariate normal mixture models, and (3) self-
organization mapping. A category of supervised learning methods whose primary func-
tion is accurate classification will also be investigated. The approaches to be explored are:
(1) K-nearest neighbor (KNN), (2) random forests (RFs), and (3) support vector machine
(SVM). A description of each method is presented followed by some advantages and
disadvantages of each algorithm as they relate to aberrant test taking behavior detection.
2.2.5.1 Unsupervised Machine Learning Methods
K-Means clustering. Although there are several versions of the K-Means algo-
rithm, the current research advocates the version defined by Hartigan and Wong (2012),
which is generally accepted as the preferred K-Means algorithm (Berkhin, 2006; R Core
Team, 2014). K-Means clustering attempts to partition n observations into K clusters in
which each observation belongs to the cluster with the nearest mean. The algorithm be-
gins with a set of K potential centers which can be defined by the researcher or randomly
selected from the data. The choice of initial cluster centers leads to a deterministic par-
titioning of the space. In other words, K-Means will always return the same clustering
solution given the same initial cluster centers (e.g., Steinley, 1985). Since the cluster-
ing solution relies heavily on where the algorithm launches from, especially for small
31
datasets (Lattin, Carroll, & Green, 2003), some have argued that the algorithm be run
multiple times from different starting values to ensure the efficacy of the classification
(e.g., Celebi, Kingravi, & Vela, 2013; Khan & Ahmad, 2004).
Once the centers are selected, the algorithm assigns all the test takers to their closest
centers and recalculates the new centers defined by these clusters. Distance is determined
by a user-specified similarity measure ? often Euclidean distance or Manhattan distance
(Fossey, 2017). The algorithm goes through multiple iterations of checking each test
taker (e.g., response pattern) to see if it should be moved to a different cluster based
on the centers updated coordinates. If so, it changes the test takers cluster membership,
updates the centers coordinates, and continues to the next iteration until it converges on a
solution where no points are being switched between clusters.
K-Means clustering algorithm offers several main advantages and disadvantages on
aberrant behavior detection. Theses includes:
1. K-Means algorithm is easy to implement. It only requires practitioners to specify
number of clusters to initiate the algorithm. Usually, in test security investigation,
we are expecting to separate aberrantly behaved test takers from the normal popu-
lation. Thus, two underlying clusters could be reasonably assumed as the number
of initial clusters. However, it could also be a disadvantage if users have limited
information to determine number of clusters underlying the data.
2. K-Means algorithm could be computationally efficient with a high dimensional
dataset. The algorithm relies on a nonparametric distance measure to classify ob-
servations consuming less computational memory than other parametric methods,
32
which requires estimation of model parameters (Hastie, Tibshirani, & Friedman,
2009). Recently, due to large volumes of process data generated during the com-
puter based testing, K-Means could potentially be useful to analyze high dimen-
sional data for flagging aberrant takers in real time. However, due to the nonpara-
metric nature, K-Means algorithm is sensitive to the initial cluster centers. Many
solutions have been proposed for dealing with this issue (Li, 2011). But, these
extensions could potentially sacrifice certain degree of computation efficiency.
Finite mixture modeling (FMM). Model-based clustering method that might be
useful in identifying aberrant test taking behaviors is finite mixture models, specifically
mixtures of multivariate distributions. Mixtures of multivariate distributions (Everitt,
1981; Titterington & Makov, 1985) have been applied to a wide range of statistical
methodology and take the general form
K
f (s j|?,? ) = ? ?k fk(s j|?k), (2.18)
k=1
where a distribution f is a mixture of K component densities f1, ..., fK and s j is a p-
dimensional vector containing scores for individual j ( j = 1, . . . ,n) on a set of p ob-
served continuous random variables. Vector ? = (?1,. . . ,?K?1)? contains the mixing pro-
portions with the caveat that 0 ? ?k ? 1 for all k = 1, . . . ,K with ?Kk=1 ?k = 1. Vec-
tor ? ? = (? ?1,. . . ,?
?
K?1) contains all unknown parameters in all K subpopulations, where
? ? ? ?k = (?k,vech(?k) . Operator vech(?k) denotes a half-vectorization of a symmetric ma-
trix ?k by stacking only the lower triangular part of ?k. Following McLachlan, Peel, and
Bean (2003), the kth component density of a mixture of multivariate normal distributions
33
is given by
{ }
f (s |? ) = (2?)?p/2|(? |?1/2k j k k exp 12(s j??
?
k) ?k(s j??k) . (2.19)
One of the main advantages of using a finite mixture model is that an FMM would
manifest hidden clusters embedded in the streams of data by using a likelihood ratio
test (LRT: Cox & Hinkley, 1974) or information based model selection criteria such as
Akaike information criterion (AIC) and Bayesian information criterion (BIC: Anderson
& Burnham, 2002). Thus, it would be useful to explore subcategories of aberrant testing
behaviors rather than simply focusing on aberrant and normally behaved groups, which
could potentially provide more insights for practitioners to understand and investigate on
specific behavioral groupings. Also, by assuming a multivariate Gaussian density, FMM
could reflect the volume, shape and orientation of each cluster by estimating their corre-
sponding variance-covariance structures. This piece of information could be potentially
utilized for understanding characteristics of each identified cluster. For instance, if the fit-
ted Gaussian density contours of each cluster are relatively small and separate, this could
be a strong evidence of existence of different clusters. Just the opposite would occur if
the fitted density contours overlapped too much with each other, the final classification
would be doubtful for making final decision of number of clusters. Yet, FMM is sensitive
to violations of distributional assumptions and completely exploratory. If the observations
do not follow Gaussian distributions, the power of identifying underlying clusters could
be decreased.
Self-Organization mapping (SOM). The SOM algorithm (also known as a Koho-
nen Map) is an artificial neural network algorithm where multidimensional data is mapped
34
to a set of k clusters (or nodes). One of the primary reasons SOM is popular is that the
clusters can be mapped to a two-dimensional grid that shows which clusters are similar
to each other. This is a valuable tool of visualizing data and validating clusters (Berkhin,
2006).
The SOM algorithm starts with a large learning rate coefficient, which is used to
shift the cluster centers for all clusters in a large neighborhood, surrounding the win-
ning clusters center, As time (iterations) progress, the neighborhood around each cluster
shrinks to zero so that nearby clusters are not modified when a cluster is updated, and
the clusters themselves are not changed as much by the presentation of new test takers
because the effect of these new test takers is weighted by a decreasing learning algorithm.
This is useful in situations where the researcher presents the same cases to the SOM net-
work over and over again to achieve a more stable estimate of cluster centers. The initial
cluster changes are large, with cluster centers being moved substantially by new test tak-
ers and by changes in neighboring clusters. As the algorithm runs through its iterations,
the learning rate coefficient and the size of the neighborhood shrink until eventually there
are only minute, fine-tuning changes to the winning clusters center (Bullinaria, 2004).
The size of the neighborhood and the rate of decrease can be set by the researcher.
The rate of decrease may be linear or nonlinear, and the neighborhood may exist for all
of the SOM iterations, or it may be defined so that the neighborhood radius shrinks to
zero after a set number of iterations have been completed. For example, in the default
settings of the som package (Yan, 2016) in R (R Core Team, 2017) statistical software,
the neighborhoods radius is chosen to be larger than 2/3 of the unit-to-unit distances for all
of the starting cluster centers. The som package then linearly decreases the radius of the
35
neighborhood over 1/3 of the iterations chosen by the researcher (Wehrens & Buydens,
2007). If 2/3 of the starting cluster centers are 100 Euclidean distance units away from
each other, and the researcher specifies 300 iterations, then the radius of the neighborhood
will decrease by one unit at each of the first 100 iterations, after which only the winning
clusters center will be updated. Once the neighborhood radius diminishes to zero, clusters
near the winning cluster are no longer updated when cases are reassigned, and the SOM
algorithm solution is then identical to the logic used by the K-Means algorithm (Kohonen,
1982).
SOM has several benefits for fraudulent testing behavior detection. First, it dis-
plays complex high-dimensional topological relations of the cluster centers in a two-
dimensional grid, which could be easily visualized and interpreted for test security. Sec-
ond, SOM does not rely on any assumptions about the distributions of the data, and the
solutions are not heavily influenced by outliers (Wehrens & Buydens, 2007). This is be-
cause, unlike K-Means, SOM never calculates a cluster centers coordinates by taking the
mean coordinates of all the test takers assigned to the cluster. Instead, the cluster centers
are moved incrementally depending on the case considered at each iteration.
2.2.5.2 Supervised Machine Learning Methods
K-Nearest neighbor. K-nearest neighbor (KNN) is a nonparametric clustering ap-
proach representative of supervised learning algorithms and was first proposed by Fix
and Hodges (1951). KNN is a straightforward algorithm that attempts to classify new
samples (unlabeled observations) by allocating them to the class of the most similar la-
36
beled cases by training the machine to learn a function thereby capturing the relation
between the labeled outcome variable and independent variables. The algorithm starts by
specifying the size of the neighborhood (K) of a data point by using a distance measure
such as Euclidean distance, Manhattan distance, Murkowski distance and Hamming dis-
tance. The choice of K has a significant effect on the KNN results. When K is small, the
classification decision would be less stable, and the boundary of separating the different
groups would be less linear (James, Witten, Hastie, & Tibshirani, 2013). As K increases,
the classification results would be more stable, and the classification boundary would be
more linear, which leads to low within-group variance but high classification bias (James
et al., 2013). However, this parameter could be tuned to optimize the classification results.
Also, K is usually an odd number. Once K is specified, the KNN classifier would identify
the K points, which are adjacent to a test observation (a new data point) in the training
dataset by computing the defined distance between them by looping through the entire
dataset. The conditional probability for the test observation belonging to a certain class
would then be estimated. Finally, the new data point would be allocated to the class with
the largest probability. The process would be continued until the last test observation is
assigned. Many R-based KNN packages have been created for running the KNN analysis
such as KernelKnn (Mouselimis, 2018), care package (Kuhn, 2017) and class package
(Ripley, 2018). In the current study, the knn function from the class package was selected
because, (1) this package is one of most well-accepted and tested packages for KNN algo-
rithms, and (2) it is also very user-friendly with detailed instructions and documentation
that appear in many data mining training websites.
KNN shares many similar advantages as K-Means algorithm, such as simplicity
37
and flexibility. In addition, many studies have shown that the KNN method is effectively
robust to noisy training data if the training data set is large enough (e.g. Imandoust &
Bolandraftar, 2013; Weinberger & Saul, 2009). Therefore, KNN has the potential to
generate a stable mapping function, which could be utilized for making accurate and
steady classification by limiting the influence of potential outliers. However, it suffers
from its own limitations. KNN is sensitive to redundant and similar features, which could
reduce the classification accuracy (Qian, Yao, & Jia, 2009). In addition, the algorithm has
high computational cost if the training dataset is large due to calculating distance of each
query to all other inputs in the training dataset (Imandoust & Bolandraftar, 2013).
Random forests. A random forest (RF), a representative ensemble method pro-
posed by Breiman (2001), builds a set of classification and regression trees (CART) to
make predictions by aggregating predicted results from each classification tree. CART, a
nonparametric method, recursively segregates the feature space (an n-dimensional vector
space associated with all the predictors) into many small rectangular areas. The CART
algorithm splits predictors in a binary manner meaning each split in the tree-building pro-
cess only generates two sub-nodes from a parent node. In each sub-node, subjects sharing
more homogeneous properties are grouped together. This partitioning process, also called
impurity reduction, minimizes the difference between the averaged impurity in the sub-
nodes and the impurity in the parent node. Several entropy measures, such as the Gini
index, are used to measure the impurity in each sub-node. Each node would be contin-
ually split until some stopping conditions are achieved. Commonly used stopping rules
of the algorithm include (1) the minimum size of subjects left in a node, (2) a minimum
change in the impurity measure after a split, and (3) information criterion such as AIC or
38
BIC. After a tree is built, a finalized classification of all the subjects would be predicted in
each terminal node. For the RF method, a set of CARTs is built instead of using a single
tree to make prediction. The rationale for this is that a classification prediction based on
a single tree would be unstable. For example, if the first splitting variable were chosen
differently, the predicted results would be potentially altered especially with a large num-
ber of predictors. Moreover, for the RF algorithm, a predictor at each node is randomly
selected from the entire feature space for splitting the trees.
In each step of the RF algorithm, either a bootstrap sample or a subset of the entire
dataset is randomly selected. Thus, by building a diverse set of tress serving as a voting
committee would yield more stable and unbiased classification prediction than using a
single tree. Voting here means the final prediction is achieved by averaging (weighted
or unweighted) the predicted result from each tree. Many other aggregated methods have
been developed such as Behavior Knowledge Space (BKS) Method (Y. S. Huang & Suen,
1995), Naive Bayes (NB) combination (Domingos & Pazzani, 1997) and Decision Tem-
plates (Kuncheva, Bezdek, & Duin, 2001). Choice of aggregation method notwithstand-
ing, the ensemble voting method would produce more accurate predictions than using
a single tree (e.g., Bauer & Kohavi, 1999; Breiman, 1998; Dietterich, 2000). The pre-
diction accuracy could also be checked by an index known as the out-of-bag error rate
(Breiman, 1996). Since each tree is built based on either a bootstrapped sample or ran-
domly formed subset of the original dataset, the samples are retained so that tree building
could be utilized for checking the prediction accuracy. The advantage of using an out-of-
bag error rate is that it is a relatively more conservative and precise estimate of the error
rate that is closer to the true classification in the population than the overly optimistic re-
39
sult from the prediction by using the original dataset (e.g., Boulesteix, Strobl, Augustin,
& Daumer, 2008; Breiman, 1996). Many R packages have been created for implementing
RFs algorithms such as Rpart (Therneau, 2018) and tree (Ripley, 2018) and randomForest
(Breiman, 1996). In this study, randomForest is used for conducting the analysis.
RF algorithm has many advantages. Unlike other supervised learning methods, it
provides tree-based data representation, which can facilitate a visual understanding about
underlying characteristics of classified observations. In test security investigations, this
graphical representation could be further utilized for understanding the behavioral fea-
tures of aberrantly behaved test takers. Moreover, it also runs efficiently on large datasets
and provides the rank of importance of all the features. This piece of information could be
helpful to investigate the key factors of classifying aberrant test takers from normally be-
haved population with high efficiency. For instance, by applying RF algorithm, we could
examine momentary responding time to each question which may indicate suspicious
problem solving behaviorbehavior that may reflect a certain degree of pre-knowledge
of the items. Furthermore, the RF algorithm can handle higher order variable interac-
tions reflecting more realistic complex relations among the variables embedded in the
dataset. Though RF is one of the efficient supervised learning algorithms, some studies
have shown that RF can overfit its dataset if the stopping rules are not properly set (e.g.
Daz-Uriarte & De Andres, 2006; Segal, 2004).
Support vector machine. Support Vector Machine (SVM; Vapnik & Lerner, 1963
has gained popularity as a supervised kernel function based classification method used
in diverse scientific fields (e.g., Furey, Cristianini, Duffy, Bednarski, & Haussler, 2000;
Z. Huang, Chen, Hsu, Chen, & Wu, 2004; Meyer, Leisch, & Hornik, 2003). The SVM
40
algorithm attempts to create an optimal separating boundary, a line, plane, or hyper-plane
by using a kernel function (linear or nonlinear) that divides the feature space (an n-
dimensional space for predictors) whose margins are maximal. In this regard, this bound-
ary is the best solution out of an infinite possible number of segregating boundaries. The
optimal separating boundary, also known as the maximal margin hyperplane, is formed
by maximizing the distance between all the training subjects and it. The maximal mar-
gin hyperplane is defined by computing the perpendicular distance from each subject to a
given separating boundary. The smallest such distance is called the margin. As its name
suggests, the maximal margin boundary is that separating hyperplane for which the mar-
gin is largest. The maximal margin here is also known as the hard margin, which means
all the training subjects perfectly lie on either side of the hyperplane without any mis-
classification. Once the maximal margin hyperplane is constructed based on a training
dataset, a new test subject could be classified later based on which side of a hyperplane
it is located. The hard margin plane, however, is quite sensitive to a change in a subjects
data, which may be due to over-fitting the training dataset. Thus, having a hyperplane that
does not perfectly separate all the cases is worthy of attention. Sometimes, this kind of
classification hyperplane is also referred to as the soft margin hyperplane, which is more
robust to the change of an individual subject. The general support vector classifier can be
represented as:
f (X) = b+??iK(xi,xi?), (2.20)
i?S
where K(xi,xi?) is a kernel function that quantifies the similarity of two observations;
S is the collection of indices of these support points, ?i and b are parameters needing
41
to be estimated. A simple binary classification example is introduced to help clarify
the hard and soft approaches. Suppose a set of n training subjects on p variables exists
x n1, ...,xn ? R marked with the labels y1, ...,yn ? {?1,1}, is classified into two groups
by a linear high-dimensional hyper-plane defined as
n
yi(b+?i ? xi jxi? j?) = 0. (2.21)
i=1
In order to find the maximal margin hyper-plane, the equation above is optimized based
on the constraint to maximize M, subject to
n
? ?2i = 1, (2.22)
i=1
and
f (x) = b+??iK(xi,xi?)?M; for all i = 1, . . . ,n, (2.23)
i?S
where M represents the margin of our hyper-plane. This is an example of the hard margin
case requiring each subject in the training set be on the right side of the hyper-plane with
at least an M margin. The soft margin case simply allows the optimization solution to be
extended by again,maximizing M, subject to
n
? ?2i = 1,
i=1
f (X) = b+??iK(xi,xi?)?M(1? ?i),
i?S
?i ? 0 and ? ?i ?C,
i?n
where C is a positive tuning parameter and determines the degree of tolerance of misclas-
sified subjects, which violates the margin. If C = 0, the softer margin would transfer to
42
the hard margin case. The term, ?i (i = 1, . . . ,n), could allow some subjects to be on the
incorrect side of the hyper-plane. For instance, if ?i = 0 , then the ith observation is on
the incorrect side of the hyper-plane.
Among many advantages offered by SVM, one of the main benefits of using it is
that SVM has the flexibility to select different kernel functions to adequately address prac-
tical problems in different modeling scenarios. By applying a proper kernel to the specific
scenario, the performance of SVM could be dramatically improved. For example, poly-
nomial or nonlinear kernel functions may be used when the cluster labels and features are
nonlinearly related. Many kernels have been created for specific cases such as natural lan-
guage processing (e.g., string kernels), speech recognition (e.g., time-alignment kernels),
and image processing (e.g., histogram intersection kernel). SVM is a flexible platform for
identifying aberrances by incorporating more types of data into current detecting frame-
work. For example, writing strings and speech data would be jointly modeled with other
psychometric variables like item response and responding time by applying appropriate
kernel functions. However, to yield accurate results based on SVM, the tuning parameter
and the types of kernel function should be set properly. In this way, it is relatively harder
to be implemented compared with other supervised learning methods.
All the previously mentioned clustering algorithms are processes that use observed
similarities or densities in data to identity patterns or group similar observations (Berkhin,
2006). These data mining methods will be utilized to identify clusters of respondents
(e.g., such as those who exhibit fraudulent test-taking behaviors). The advantages and
disadvantages of applying each algorithm as they relate to aberrant test-taking detection
is presented in table 2.3.
43
44
Table 2.3: Advantages and disadvantages of various data mining methods for detecting aberrant test taking behaviors
Unsupervised Methods Supervised Methods
K-Means Gaussian finite mixture Self-organization mapping K-nearest neighbor Random forest Support vector machine
* Easy to display * Flexible to apply
* Facilitate a
PROS * Easy to implement * Manifest hidden clusters complex relations of * Effectively robust different kernel
visual exposition
clusters functions
* computationally efficient
* No distribution * Runs efficiently on large * Computationally
with a high * Useful to explore subcategories * Easy to implement
assumption dataset efficient
dimensional dataset
* Show cluster features (volume, * Handle higher order variable
shape and orientation) interactions
* Relies on a predefined * Sensitive to violations * Relies on a predefined * Sensitive to redundant
CONS * Over-fitting * Hard to implement
distance of distributional assumptions distance and similar features
* Require users to define
* Require users to determine the various parameters * High computational * Computationally
* Completely exploratory * Hard to implement
number of clusters (e.g., map size, learning cost expensive
rate)
Sensitive to the initial cluster
centers
2.3 Incorporating Biometrics to Detect Aberrant Testing Behaviors
Many methods have been developed to detect various aberrant test-taking behav-
iors. Certainly, these methods have shown success flagging aberrant test takers. However,
these methods are potentially limited by their input data, which is either item response or
response times. To improve detection accuracy beyond the traditional modeling frame-
work, ways of incorporating real-time bio-metric information into traditional detection
methods will be introduced.
2.3.1 Insights into Problem-Solving Using Eye Tracking
Eye tracking as an essential biometric technology will provide unique insights when
assessing students cognitive processes during computerized problem-solving tasks. Eye
tracking also can record temporal and spatial human eye movements, which are a natu-
ral information source for proactive systems that analyze user behavior. Moreover, eye
tracking could also collect information about the location and duration of an eye fixation
within a specific area on a computer monitor and can be critical supplementary data used
in identifying cheating behaviors.
At its core, eye tracking is the measurement of eye activity. Capturing such infor-
mation in the context of large-scale assessment testing scenarios may help address and
answer some interesting questions related to aberrant testing behaviors detection such as:
(a) Where does an examinee look and what does this information tell us about aberrant
testing behavior? (b) When does blinking occur and what information does that convey
about the examinees behavior? (c) How does the pupil react to different stimuli? (d) What
45
are the differences between the eye-gaze patterns of the normally behaved and aberrantly
behaved test takers? In contrast, item responses and response times can not provide info-
mation about the eye-gaze patterns.
Holmqvist et al. (2011) classified most eye tracking indicators into four groups:
eye movement, gaze position, numerosity, and latency. Movement indicators (e.g., sac-
cadic direction, saccadic length) reflect the properties of eye gaze paths such as direction,
amplitude, and velocity (e.g., Lee, Badler, & Badler, 2001; Motter & Belky, 1998; Pon-
soda, Scott, & Findlay, 1995; Tatler & Vincent, 2008). Position indicators (e.g., fixations,
dwells) address the question such as where a test taker looks. Numerosity indicators
(e.g., fixations, dwells, blink rate, and regression rate) quantify eye movement related
events in absolute numbers or proportional rates (Holmqvist et al., 2011). Latency indica-
tors are mostly related to reaction time, which catches the time from the on- or offset of a
stimuli/event to a specific reaction from our eyes (e.g. Born & Kerzel, 2008b; Shepherd,
Findlay, & Hockey, 1986). For example, after showing an item on the computer screen,
how soon does a test taker catch the first keyword? All these eye gaze indicators play
important roles in understanding and classifying different test taking behaviors.
Some important eye-tracking indicators will be discussed in details in the following
subsections
2.3.1.1 Fixation
In educational assessment and testing, eye fixation reflects the degree of the test-
takers attention on specific words embedded in the items. In the context of test-taking
46
behavior, fixation is a measure of the temporary eye stoppage at a word of an item or a
part of a graphical instruction when a test taker is solving a question. Several fixation-
related measures are frequently studied in the literature including fixation counts, fixation
rate, fixation duration, and fixation locations.
Many of these measures are used for assessing subjects? information perception
abilities, such as reading and problem solving. For instance, in reading ability assessment
studies, Born and Kerzel (2008a) found that a reader? fixation duration was expected to be
longer with more difficult and less frequent words compared with commonly used words.
Similar findings reported in usability studies. For instance, Born and Kerzel (1999) found
long fixation could indicate difficulty in extracting information for problem solving. Fur-
thermore, longer fixation also indicates relatively high level of content engagement, which
is also a reflection of high level of interest expressed by a student (Jacob & Levitt, 2003).
2.3.1.2 Pupil Diameter
Pupil diameter could be utilized to reflect the degree of fatigue, level of interest in a
particular learning content, and the amount of workload of the test takers involved a spe-
cific cognitive task. Many studies reported negative correlation between levels of fatigue
and pupil size (e.g. Lowenstein, 1962; Morad, Lemberg, & Dagan, 2000; Yoss, Moyer, &
Hollenhorst, 1970). For instance, Morad et al. (2000) found that measured pupillary diam-
eters differed significantly between fatigue (24 hours sleep deprivation) and clear-headed
groups, reacting to controlled visual stimulus. This difference indicates changes of pupil
diameter can be an objective measure of fatigue. Moreover, some studies have shown that
47
emotional arousal could be an important factor modulating pupils reaction. For instance,
(Zubin & Steinhauer, 1983) found that pupil diameter was enlarged when pleasant and
unpleasant pictures were presented to the experimental participants. Furthermore, other
studies have demonstrated that pupil diameter can be a useful event-related measure of
cognitive load (Hess & Polt, 1964; van Gerven, Paas, van Merrinboer, & Schmidt, 2002).
This effect has been observed for tasks such as content comprehension (Just & Carpen-
ter, 1993), visual searching (Porter, Troscianko, & Gilchrist, 2007), and mental number
calculation (Hess, 1965).
2.3.1.3 Blinking
A seemingly involuntary function of the eye, blinking, is to keep the eyeball moist.
In addition, blinking is highly related to other cognitive functions, like reflex blinking.
Reflex blinks are the reactions to external stimulus for various purposes, such as protect-
ing our eye balls, or maximizing our attention on a subject. Blinking rates, a commonly
used measure of blinking, is defined as the number of blinks per given amount of time.
Studies have shown that blink rate is positively associated with the number of simul-
taneous tasks (Barbato, della Monica, Costanzo, & De Padova, 2012; Colzato, Slagter,
van den Wildenberg, & Hommel, 2009). In contrast, other studies found that people are
more likely to reduce their blink rates when performing visually demanding tasks. For ex-
ample, Benedetto et al. (2011) found that drivers? blink rates decreases with higher visual
demand, which indicates reallocation of potential cognitive recourses. Also, Fairclough
and Venables (2006) reported similar findings. Blink rate negatively correlated with task
48
engagement. Researchers found that this negative relationship may due to the fact that
people are more likely to reduce risk of missing key information during visually engaged
tasks (Drew, 1951; Kennard & Glaser, 1964).
2.3.1.4 Saccades
Saccades, an eye gaze movement measure highly is related to fixation, and reflect
the motion of the eyes from one fixation to another. The amplitude of the movement could
vary from small jumps from one word to another to wide-reaching searches made while
looking around a stadium. This measure could help to understand some general charac-
teristics of eye gaze paths, such as direction, length and dispersion. For instance, many
reading assessment studies (e.g., Kuperman & Van Dyke, 2011; Rayner & Liversedge,
2011) have shown that deficient readers have different eye gaze paths from efficient read-
ers. Their gaze paths are relatively shorter and more scattered compared with efficient
readers (VinuelaNavarro, Erichsen, Williams, & Woodhouse, 2017).
2.3.1.5 Regression
Regression refers to events that involve the motion of the eye in the opposite di-
rection to the text. It often reflects the events related to re-reading and answer checking.
Vitu (1991) classified regression into two types: long-range regression (LRR) and short-
range regression (SRR). LRR means the eye gaze moves oppositely over several words
or even sentences. SRR often refer to the short and rapid backward movements. Born
and Kerzel (2008a) indicated that the occurrence of long-range regressions might due to
49
the fact that readers have missed, forgotten, or been unclear about what they have read.
For example, some studies (e.g., Blanchard & IranNejad, 1987; Booth & Weger, 2013;
Inhoff, Greenberg, Solomon, & Wang, 2009; Rayner, Murphy, Henderson, & Pollatsek,
1989) have shown that when items are less familiar, ambiguous or complex, the regres-
sion rate will increase in order to reinstate or reconfirm a cognitive effort. For SRR, some
studies have indicated that SRRs are highly related to how much effort or care a reader
devoted into a reading task. Coff and O?regan (1987), in their study, found that subjects
were more likely to have more SRRs in order to increase the accuracy of registering the
content information.
The relationship between these eye-tracking measures is shown in Figure 2.1.
Figure 2.1: Relationship between several representative indicators
2.3.2 Representative Ways to Integrate Process Data into Psychometric
Methods to Identify Aberrant Test Takers
Many testing programs have transferred from paper-pencil tests to computer-based
or computer adaptive testing, which allows for simultaneously collecting multimodal data
during the exam. The collected multimodel data includes three types. Product data are the
50
outcomes of the assessment tests such as item responses or test scores. Process data reflect
the process of how a test taker form his/her final answer for an item/task such as response
time and movements of mouse cursor usually recorded in the log-file. Biometric data are
special cases of process data such as the eye-tracking indicator or heartbeats collected via
sensors.
To understand the relationship between the processed data (e.g., RTs) and the prod-
uct data (e.g., item responses), many methods have been proposed. These methods are
trying to incorporate process data into the traditional psychometric models such as the
Rasch model or 2PL IRT model, which are used for measuring latent abilities. In par-
ticular, RT has been used as ancillary information to better understand test takers? per-
formance than solely modeling the item responses. Since biometric data serve the same
role as the RT, some representative methods of how to integrate RT into the traditional
psychometric methods would be introduced. These methods can provide insights of how
to incorporate biometric indicators later.
2.3.2.1 Incorporating RT as a Variable Into the Item Response Model
The simplest way to incorporate RT into traditional item response models is to add
RT as an individual variable. Many methods have been proposed to achieve this goal
(e.g., Luce, 1986; Roskam, 1997; Thissen, 1983; Verhelst, Verstralen, & Jansen, 2013)
Roskam? model. Roskam, one of the pioneers, proposed a model to add log-
transferred RT as a single term into the IRT model. His model could reflect the trade-offs
between the amount of time a test taker spends and the difficulty level of a specific item.
51
The model is defined as 1
Pi(? j) = , (2.24)1+ exp[?(? j + lnTi j?bi)]
where ? j is the person latent ability parameter, lnTi j is the log-transferred response time
term, and bi represents the item specific difficulty parameter. The difference, lnTi j?bi ,
shows the trade-off between how much time a person working on a particular item and
the difficulty level of that item. The interpretation of the difference lnTi j? bi could be
that a test taker would be more likely to spend more time on a harder question than an
easy one, and vice versa.
Thissen? model. Another well-known attempt to incorporate RTs into the IRT
model is a method proposed by Thissen (1983). His model treated the log-transferred
RT as a dependent variable as opposite to an independent variable. The log-transferred
RT is regressed on a parameter structure, which is similar to the parameterization of the
two probability logistic (2PL) IRT model. The difference is that new terms are added to
reflect the working speed for a person j and item i respectively. The model is defined as
lnTi j = ? + ? j +?i??(a 2i? j?bi)+ ?i j, ?i j ? N(0,? ), (2.25)
where ? represents the grand mean level of the population of test takers and test item
domain; ?i and ? j are ?slowness parameters? for item i and person j separately; and,
? is a regression coefficient that indicates the degree of association between the log-
transferred response times and the log odds of a correct response for the person j and the
item i. The log odds of a correct response is calculated based on the 2PL model, which
is parameterized as ai(? j ? bi). ai is the item discrimination parameter, bi is the item
difficulty parameter; ? j represents the person-side latent ability parameter.
52
Later, Ferrando and Lorenzo-Seva (2007) extended Thissen?s model in Equation
2.25 to a new version in order to accommodate the special needs of personality assess-
ment. The updated model is given by
?
lnTi j = ? + ? j +?i??( ai(? j?bi)+ ?i j, ?i j ? N(0,?2) (2.26)
. The difference between the two models in Equations 2.25 and 2.26 is the param-
eterization of the item parameter structure. Instead of using ai(? j ? bi), Ferrando and
?Lorenzo-Seva (2007) use a slightly different item parameter structure, which is defined as
ai(? j?bi). However, these two models both reflect the trade-off between the working
speed and responding accuracy. For example, for ? larger than 0, the model given by
Equation 2.26 shows that test takers with higher abilities would use less time than those
who have relatively lower abilities, and vice vera.
2.3.2.2 Joint Modeling of Item Responses and Response Times
The previous section introduces several representative methods to directly add RT
as ancillary information into the IRT models. RT is either treated as an independent
variable or as a dependent variable. However, instead of treating RT as a covariate, RT
itself could be modeled to either manifest the different test takers?working speed or show
the corresponding characteristics of test items. The item features include responding time
required by an item and the discrimination power of an item . Therefore, by jointly
modeling RTs and item responses, the relationship between the working speed and the
responding accuracy could be further discussed either based on the person-side or the
item-side model parameters.
53
To jointly model item responses and RTs, it is essential to have a model to fit the
item response time data properly. To achieve this purpose, many RT models have been
proposed to improve the model fit by applying various distributions such as the expo-
nential distribution, the log-normal distribution, the gamma distribution, and the Weibull
distribution (e.g., Maris, 1993; Roskam, 1997; Schnipke & Scrams, 1997; Thissen, 1983;
W. van der Linden, Scrams, & Schnipke, 1999). Among these proposed RT models, the
log-normal response time model proposed by van der Linden (2006) draws much atten-
tion among researchers due to the easy interpretation of the model parameters, which
follow the similar structure as the 2PL-IRT model. The log-normal RT model is defined
as follows
( )
? 1
f (ti j,?
i
j,?i,?i) = ? ? [?i{ln ti j? (? 2i? ? j)}] , (2.27)
ti j 2? 2
where the latent parameter, ? j ? ?, represents working speed for test-taker j. The item
parameter ?i ? ? denotes time intensity, or simply, the amount of time required for an-
swering a specific item. the parameter ?i ? ? is an item time discrimination parameter.
The mean value, ln ti j, is parameterized as ?i j = ? i? ? j.
In the following section, methods of jointly modeling of the RTs and item response
are introduced. The differences among these methods are based on the varying assump-
tions of how the item responses are related to the item response times.
Typically, two types of assumptions are made when jointly modeling the item re-
sponses and RTs. van der Linden (2007) assumes that the item responses and response
times are independent after conditioning on their corresponding higher level random ef-
fects, which are the person latent parameters and item parameters. In other words, by
54
modeling the higher level random effects on the person and the item side separately, the
item responses and response times are independent of each other. This is the most appeal-
ing approach to jointly model the item responses and the response times, which is defined
as:
?? ?? ???? ?? ? ???? Yi j ?? ????? j?bi?? ?1 0 ??? N , ?? ???? , (2.28)
logT 2i j ?i? ? j 0 ?i
where Yi j and logTi j are the item responses and the log-transferred response times of per-
son j on item i, respectively. ? j and ? j are the latent person parameters. bi and ?i are the
item difficulty and item time intensity parameters, respectively. ?2i is the time residual
variance of item i. The off-diagonals of the variance-covariance matrix are defined as
0s, which implies conditional independence of item responses and response times after
modeling the patterns existing in the data covered by the mean structures. Figure 2.2 vi-
sualizes the modeling framework under the first assumption.
The second assumption assumes that the existence of the conditional dependencies
(CDs) among the residuals of item responses and the log-transferred response times given
the person and the item structural relationships. This may be due to two sources of vari-
ations when students take their tests: (1) between-person variability across items, and (2)
within-person variability across items.
The between person variation reflects distinct test-takers? responding behaviors. In
stead of assuming all the test takers are homogeneously responding on their tests. In other
words, all the test takers are normally behaved test takers, we assume that some test takers
are answering questions in aberrant ways such as cheating on their tests or responding
55
Figure 2.2: conditional independence of item responses and response times
given latent ability and speediness: ? and ? are the latent ability and the
speediness; T ?i are the log-transferred response times; ?xi are the item re-
sponse residuals
carelessly due to their low motivation. Nevertheless, it is still assumed that each test taker
working on his/her test at a constant speed across all the items with a invariant cognitive
capacity. For example, D. Molenaar, Bolsinova, Rozsa, and De Boeck (2016) proposed
a mixture modeling approach to investigate the intraindividual variation in responses and
response times.
?? ?? ???? ? ? ???? Yi jk ?? ????? jk?bik? N ??? ?, ??1 0 ?????? , (2.29)
logT 2i jk ?ik? ? jk 0 ?ik
where k indicates kth latent class. Yi jk and logTi jk are the item responses and the log-
transferred response times of person j on item i in latent class; and k, respectively; ? jk
and ? jk are the latent person parameters in latent class k. bik and ?ik are the item difficulty
and item time intensity parameters in kth latent class, separately. The ?2ik is the time
56
residual variance of item i in latent class k. Figure 2.3 is a schematic of this modeling
framework that depicts the relations between measured and latent variables..
Figure 2.3: a mixture modeling approach to investigate the intraindividual
variation in responses and response times: Ck indicates the kth latent class;
? and ? are the latent ability and the speediness; T ?i , i = 1, . . . , I are the log-
transferred response times; ?xi , i = 1, . . . , I are the item response residuals
In contrast, the within-person variability across items indicates that a test taker
may change his working speed with variant cognitive capacity on answering different test
items. Simply put, the within-person variability refers to the fact that a test takerperfor-
mance may vary from one item to another. The within-person variability could be caused
by the following reasons, including (1) test fatigue (Ackerman & Kanfer, 2009; Acker-
man, Kanfer, Shapiro, Newton, & Beier, 2010); (2) motivation changes (Wise & Kong,
2005); (3) guessing behaviors (Slakter, 1968); and, (4) application of various problem-
solving strategies (van der Maas & Jansen, 2003).
One representative method to discuss the within-person variation would be the
method proposed by Meng, Tao, and Chang (2015). They assumed that item residual
and response time residual are correlated due to within-person variability. The model can
57
be represented as:
?? ? ???? Yi j ??? ??????? j?bi?
? ? ?
?? ??? 1 ????
?
? N , ?? , (2.30)
logT 2i j ?i? ? j ?1,2 ?i
where ?1,2 indicates the covariance between the item residual and the response time resid-
ual. Figure 2.4 represents a path diagram of this modeling framework that shows the
relations between measured and latent variables.
Figure 2.4: A conditional joint modeling approach for locally dependent item
responses and response times
Thus far, several joint modeling approaches for item responses and response times
were introduced, which provides insights about how to incorporate biometric variables
into psychometric modeling frameworks. For instance, new models based on the gaze
fixation counts collected via an eye tracker could be proposed to reflect the degree of test
engagement when a test-taker solves a set of task questions. Also, the parametrization
used for modeling RTs give a demonstration of how to model other biometric variables,
which could show the individual differences and the item characteristics regarding the
58
interested variables. Moreover, the introduced jointly modeling approaches of item re-
sponses and response times could be further extended to add other biometric variables,
which could provide comprehensive assessment about test takers?performance.
2.4 Future Directions and Challenges
Test security has been researched over the past several decades (Cizek, 1999).
Cheating and other kinds of aberrant test-taking behaviors raise concerns about the va-
lidity of decisions made based on the estimated examinees scores. In order to maintain
the fairness among all test-takers, it is important not only to flag improper behaviors but
to take actions against them. In order to better flag the behaviors with high accuracy,
statistical models would be based not only on one source of information, such as item
response. In the future, all sources of information, such as bio-information technologies
about test-takers, would be aggregated using highly efficient computational methods such
as cloud computing. Mislevy et al. (2016), in the Maryland Assessment Research Center
Conference (MARC), indicated new forms of assessments would involve psychometric
models, bio-information, machine learning, and data mining methods with a high effi-
ciency computation platform. Many methods based on ?big-data? are currently under
development. Man, Harring, and Sinharay (2019) and Thomas (2016) has applied sup-
port vector machine, a data-mining method, to pre-knowledge cheating detection. He and
von Davier (2015) proposed a statistical feature selection learning method by finding the
features from the process data for classifying different learning patterns. More kinds of
testing forms will appear, such as online testing and games-based testing. More novel
59
item types, such as multi-part directional dependent items, will replace currently used
measures. Growth in learning will not be simply measured at several time points; it will
be monitored constantly over time. How to incorporate psychometric methods to main-
tain the validity of the inferences from an assessment is one of the biggest challenges and
directions for the future. Mueller, Zhang, and Ferrara (2016) summarized four challenges
that must be overcome in the next generation of test-security research. The first is deal-
ing with low signal-to- noise ratio (SNR). Aberrant test-taking behaviors do not always
indicate those behaviors that are harmful to the inferential claims (i.e., validity) of the
assessment, such as tiredness or creative responding. Finding ways to better classify with
high sensitivity is an important research direction to be studied. The second challenge is
in understanding effect sizes. How far can the data differ from the expectation and truly
be considered cheating? How could we transfer abstract metrics into more understand-
able ones? The third challenge is one that still haunts psychometrics and that is how to
explain and convey results to laypersons. Stakeholders have different interests and lev-
els of understanding of how aberrant test-taking behaviors affect their decisions. How to
present methods and results in a logical and comprehensible manner must be considered
and taken seriously. To this end, more data visualization methods need to be developed to
aide in this endeavor. The fourth challenge is how to incorporate justice into the decision-
making process. This would necessarily improve by advancing methods that increase the
accuracy of classification between aberrant and normal test-takers. As important in this
endeavor is controlling the rates of false negatives. By having more accurate measures
and more informative evidence, results from psychometric models could be incorporated
into this justice system.
60
2.5 Conclusion
The present literature review has provided a focused overview of some aberrant
test-taking behavior detection methods, including unexpected score gain analysis, erasure
analysis, similarity analysis, person fit analysis, data mining methods and eye-tracking
measures. There remain many different shortages and limitations associated with the pro-
posed methods, primarily because real data analytic situations are often more complicated
and chaotic than models can predict. In 1976, George Box said, all models are wrong but
many may nonetheless be useful, especially when parsimonious. (p. 202). All the mod-
els or indices reviewed have their own advantages to help us understand the underlying
issues. Although none of the methods are perfect, they provide the foundation for im-
provements in each aspect of test security related issues to be made. Future research in
the field of large-scale testing will be both challenging and full of promise.
61
Chapter 3: Methodology
In this chapter, the experimental design and the new methods for incorporating bio-
information and their usage for detecting aberrant behaviors will be introduced. First,
the experimental design for data collection will be illustrated. Next, three negative bino-
mial distribution-based visual fixation counts models will be presented. This model will
be used for assessing the visual attention differences among test-takers. Furthermore, a
jointly modeling approach of integrating product data, process data and biometric infor-
mation will be shown. By joint modeling the three types of information, we can assess
test-takers? performance in a comprehensive way. Lastly, various data mining methods
will be used for classifying different types of test takers.
3.1 Experimental Design
3.1.1 Data Collection
In the proposed study, 298 students, who were over 18 years old were invited to
participate in the eye-tracking lab to take one of the required exams that would mimic the
taking of a high-stakes assessment. All the participants were enrolled through the UMD
Psychology SONA system to avoid any selection bias based on race, gender and major,
62
etc. The SONA system is a specific online platform for enrolling participants for various
psychological studies. Also, only those participants were recruited who did not suffer
from blindness having either normal or corrected vision. All the enrolled participants
took an exam, which contained 20 multiple choice items from a high-stake test offered
by the ETS. Ten items were verbal reasoning questions, and ten items were quantitative
methods questions.
3.1.2 Experimental Conditions
There were three conditions in the proposed research design: (1) participants in
the control condition would not receive any test preparation materials, (2) participants re-
ceived questions that were similar to their exam, and (3) participants in the third condition
would receive similar exam questions and the answer key. Participants were randomly as-
signed into different experimental conditions in order to minimize targeted internal and
external experimental threats. The data from all three conditions were then combined to
allow the researcher to conduct blind statistical classification of examinees. Background
information including motivation, test anxiety, Big Five personality traits, morality, and
religiosity, for example, were also be collected. Most of the variables would be later
used as the input data for classifying different test taking behaviors. This study was fully
approved by the University of Maryland institutional review board (IRB).
63
3.1.3 Data Recording
All the test items were clearly presented as slides, which were converted to a .pdf
file. Each slide only contained one item. Line space was at least doubled to accommodate
the eye-tracker?s accuracy level (0.5-1 degree of visual angle accuracy).
Test takers? eye movements were recorded at 60 Hz with the Gazepoint eye-tracking
system. It was placed on a firm large table under a monitor (1024 by 768 resolution; 17-
inch LCD). The eye-tracker has 0.5-1 degree of visual angle accuracy. The recording
area was about 20-25 squares meters without windows to minimize direct and ambient
sunlight. The recording room was inside a suite with limited surrounding noise.
3.2 New Test Engagement Model Based on Visual Fixation Counts
To accurately classify different type of test takers, it is important to select an eye-
tracking variable indicating the degree of visual efforts a test taker puts on an item.
Among all the collected eye-tracking variables, eye fixation counts could be used to
understand the visual engagement when a test taker perform his/her test (e.g., Jacob &
Levitt, 2003; Poole, Ball, & Phillips, 2004). For example, in the human-computer inter-
action and usability study, Poole et al. (2004) indicated that increased gaze fixation counts
on an interested visual area show that it is more essential, more noticeable to the subject
than other visual areas. Similar results were reported by Justice and Lankford (2002) and
by Roy-Charland, Saint-Aubin, Klein, and Lawrence (2006). With this systematic rela-
tion between test engagement and visual fixation counts, a model measuring the cognitive
connection between test takers latent visual engagement and the observed visual fixation
64
counts appears warranted. This model could potentially help to understand individual vi-
sual effort differences, which may prove to be an important feature in detecting aberrant
test takers from the normal behaved ones.
To model the relation between gaze fixation counts and test engagement, a negative
binomial fixation (NBF) model was proposed and fitted to real data gathered as part of an
experiment. A Bayesian estimation approach via Markov chain Monte Carlo (MCMC)
was used to estimate model parameters.
In this study, the negative binomial distribution was chosen as a link function for
modeling the visual fixation counts. Unlike the Poisson distribution for count data which
forces the variance to follow the mean, the negative binomial distribution allows for the
data to be overdispersed?variance is unequal to the mean. In addition, the negative bino-
mial distribution was sufficiently flexible so that, for example, the mean structure could
be parameterized in useful ways that incorporate latent person as well as item parameters.
Several structures for the latent person parameters was introduced that are parsimonious,
and unlike other studies (Fox & Marianti, 2016) that assume constant engagement levels
across items, accommodates systematic change across items reminiscent of the implied
mean structure in latent growth models.
3.2.1 The Negative Binomial Fixation Model
The NBF model was designed to reflect item quality and a test taker?s engagement
level with a number of items, i (i = 1, . . . , I), on an assessment. The proposed NBF model
follows a negative binomial distribution, which can be parameterized in various ways. A
65
flexible, yet conventional, parameterization is to define the negative binomial distribution
as the number of failures (X) before the rth success in which the probability mass function
(pmf) is defined as:
P(X = x| ?(x+ s)s, p) = ps(1? p)x, (3.1)
x!?(s)
where p is the probability of success in each Bernoulli trial (p ? [0,1]) and s > 0 denotes
the shape parameter. The expectation of the random variable X : E(X) = s(1? p)/p, and
the variance of X : Var(X) = s(1? p)/p2.
Instead of parameterizing the negative binomial distribution with regard to s and p,
a convenient parameterization utilizes the relation between p and the expectation of the
negative binomial distribution, ? . Through algebraic manipulation, parameter p can be
expressed as a combination of ? and s as
s
p = . (3.2)
s+?
In Equation 3.2, the mean ? of the negative binomial distribution can be further
decomposed into a structure that separates the latent person effect (? j) and item parameter
effect (mi) as
?i j = exp(mi +? j), (3.3)
where ?i j = exp(mi?? j)> 0; ? ? R; and m ? R.
In summary, the NBF mod(el is expressed as: ) ( )
?(x + s si xi j
P x |s m ? i j i
) si exp(mi +? j)
( i j i, i, j) = . (3.4)xi j!?(si) exp(mi +? j)+ si si + exp(mi +? j)
Parameter mi is associated with the test and can be interpreted as the visual inten-
sity for item i. The presumption is that this parameter represents the amount of cognitive
66
engagement a student tends to exert on a test. Person-specific parameter, ? j for each of
the J test takers ( j = 1, . . . ,J), denotes the overall test engagement level test taker j, and
is assumed, at least initially, to be constant across all the items. Furthermo?re, a discrim-
ination parameter, ?i, for item i is defined as the inverse of ?i, and ?i = ? ?+?2i i?/si.
Thus, ?i reflects the overall dispersion of the fixation counts on item i. Larger values of
?i lead to steeper slopes of the pmf of the negative binomial distribution while smaller
values of ?i correspond to shallower slopes. Thus, given any value of the engagement
intensity parameter mi, the difference (i.e., ??) between any two values of engagement
level, say ?1 and ?2, from two test-takers would be larger indicating that the item is more
discriminating than an item having a smaller ?i value.
To properly apply this model, several assumptions need to be satisfied. First, it is
assumed that fixations counts for each item solely reflect the different levels of visual
engagement as students perform their tests. For instance, by taking an average across
all the test takers, an item with more fixation counts would indicate a higher level of
visual effort required for solving that item than an item with fewer fixation counts. Also,
fixation counts are assumed independent of each other by conditioning on the latent test
engagement parameter (?). Furthermore, test engagement level (? j) is assumed to be
constant across items. However, this assumption can be relaxed by parameterizing the
mean structure incoperating the linear or quadratic changes across items.
Figure 3.1 represents the fixation count model (constant test engagement across all
the tasks). Tests are solved in a sequential order from item 1 to item I. As is customary
with path diagrams from structural equation modeling, circles indicate latent variables, in
this case ?, denoting latent engagement. The squares are measured indicators of the latent
67
variable. In the case of Figure 3.1, these represent fixation counts collected at each item
point. Small arrows showing measurement errors are attached to the observed indicators.
To give some idea of the distribution of fixation counts, histograms for two items (i.e.,
item 1 and item 3) across test takers used in the upcoming example, with superimposed
normal densities are displayed in Figure 3.2.
Figure 3.1: A graphical representation of the negative binomial fixation model: The circle
indicates the latent variable in the model, while ? stands for latent test engagement. The
squares represent the fixation counts collected at each item point, which are the indicators
to measure latent engagement. Small arrows showing measurement errors are attached to
the observed indicators.
3.2.2 Negative Binomial Fixation Model with Linear Trend
The NBF model shown in the previous section assumed constant engagement level
for individual test taker across all items. However, this assumption may be unrealistic
because it is likely that test takers may change their responding behaviors at different
stages of their tests Wise and Kong (2005). For instance, test takers may feel fatigue and
be careless towards end of their tests, or, they may start guessing more at end of a test due
to the time pressure to finish all the questions. Thus, a more flexible model is required
that would accommodate with the changes of test engagement.
68
Figure 3.2: Two items with fitted fixation counts
An NBF model with flexible linear trend (NBF-LT) is proposed. The mean structure
displayed in Equation 3.5 is reparameterized by adding a test-specific trend indicator,
which takes on the same form as a latent growth model Bollen and Curran (2006),
?i j = exp(mi?X? j). (3.5)
In this parameterization, intercept and slope parameters are elements of vector, ? j, where
? j = (?0 j, ?1 j)T is assume?d to follow a bivariate normal distribution. That is
????0 j?
? ?? ?? ??
?? ??????0? j = ? N ?????? ?2? ?0 ?0?1?????? . (3.6)
?1 j 0 ? 2?1?0 ??1
For test taker j, parameter ?0 j represents the initial engagement level at item 1
and parameter ?1 j is the slope parameter, which permits constant change in engagement
across the items for different test takers. The means of ?0 j and ?1 j are fixed to 0. Letting
the growth parameter means be zero can facilitate the interpretations of individual-specific
intercepts and slopes. For example, E[?0 j] = 0 is the population expectation of initial
69
engagement level and constrains 0 to be a reference starting level. Thus, the sign of
?0 j indicates whether a person has greater or less initial engagement compared to the
reference level. By fixing the expectation of ?1 j to 0, the sign of ?1 j indicates whether
engagement is increasing or decreasing across items.
Elements of the I?2 design matrix X are formulated to correspond to linear growth
and are defined as ?? ?T??1 1 1 ? ? ? 1X = ??? , (3.7)
I1 I2 I3 ? ? ? II
where all the test takers? fixation counts are recorded for the same items, Ii. Usually,
I1, . . . , II take values corresponding to the sequence of answering the questions such as
1,2, . . . , I. Figure 3.3 shows a graphic of the NBF model with flexible linear trend.
Figure 3.3: A graphical representation of the negative binomial fixation model: The circle
indicates the latent variable in the model, while ? stands for latent test engagement. The
squares represent the fixation counts collected at each item point, which are the indicators
to measure latent engagement. Small arrows showing measurement errors are attached to
the observed indicators.
70
3.2.3 Negative Binomial Fixation Model with Quadratic Trend
To accommodate with curvilinear trends in engagement, the NBF-QT model was
proposed. This model extension can help capture nonlinearities in engagement while test-
takers perform their tests. The sign of the coefficient of the quadratic term indicates the
concavity (i.e., curve?s orientation) of the curve. A positive quadratic coefficient would
result in trends that are convex (i.e., open up). In contrast, a negative coefficient would
result in trends that are concave (i.e., opens downward). Accommodating this elaboration
can be done in a straightforward way by extending the mean structure of the NBF-LT
model in Equation 3.8. In the NBF-QT model, parameter vector, ? j, now has three
elements with the inclusion of a quadratic parameter, ? j = (?0 j, ?1 j, ?2 j)T . Clearly, ? j
now follows a multivariate normal distribution as
?? ? ?? ? ? ?????? 0 j? ?
? 0
? ??
?? ?
?? ????
?
????
? ????? ?
?? ?2?0 ??????
? j =?? ?1 j?? N??0? , ???? 2 ??? ? , (3.8)1 0 ??1 ??????
?2 j 0 ? 2?2?0 ??2?1 ??2
with I?3 design matrix X is now defined to accommodate the quadratic growth parameter
as
?? ?T??1 1 1 . . . 1??
??
X = ??
?
I I ?1 2 I3 . . . II??? (3.9)
I21 I
2 2 2
2 I3 . . . II
71
3.3 A Three-way Joint Modeling Approach of Item Response, Response
Time and Fixation Counts
Various ways of integrating process data have been introduced in Chapter 2. In-
spired by the works cited in Chapter 2, especially the work of van der linden (2007),
in this proposal, a trivariate joint modeling approach for item responses, RTs and fixa-
tion counts is proposed. This trivariate joint modeling approach delineates the trade-offs
among the responding accuracy, working speediness and visual test engagement. The pro-
posed joint modeling is an extension of the hierarchical modeling framework proposed by
W. J. van der Linden (2006a). In this joint modeling approach, the one-parameter logis-
tic (1-PL) model, the log-normal RT model and the NBF model are specified separately
at level one. The variance-covariance structure of the person and item parameters are
jointly estimated at level two. A Bayesian estimation approach is used to investigate the
proposed hierarchical model.
3.3.1 Measurement Models at Level 1
For item responses, a one-parameter logistic (1-PL) model (Lord, 1952) is used. 1-
PL model describes the relation between an item response of an examinee and one general
latent ability trait, which is formulated as
1
P(ui j = 1|? j;bi) = ? ? ? (3.10)1+ e D( j bi)
where P(ui j = 1|? j;bi) is the probability of a correct response to item i, i = 1, . . . , I by
person j, j = 1, . . . ,J; bi is the location parameter (the difficulty parameter) for item i,
and ? j is a general latent trait for person j. D is a scaling constant, which is fixed as 1.7.
72
In addition to the 1-PL model, the log-normal RT model (W. J. van der Linden,
2006b) formulated in Equation 2.27 will be utilized to reflect the relationship between
the response times and latent working speed. Moreover, the NBF model formulated in
Equation 3.5 will be used to cover the association between the visual fixation counts and
the latent test visual engagement.
3.3.2 Modeling Item Domain and Person Domain models at Level 2
The second-level models incorporates two correlational structures to account for
the dependencies on both the item and person parameters, respectively.
3.3.2.1 Modeling Person Domain Parameters
In this joint modeling approach, the person domain covers three latent person-side
variables, which are latent ability ? , working speed ? , and visual engagement ? . The
relation among these three person-side latent variables for the population of test takers is
assumed to follow a multivariate normal distribution such that
?p = (? ,?,?)T ?MV N(?p,?p), (3.11)
with mean vector, ?p = (?? ,? ?? ,??) , and covariance matrix
? ?
??? ?2? ??
?p =????
?
? 2 ??? ?? ??? . (3.12)
? ? 2?? ?? ??
The parameters, ?? ,? represent the linear dependencies between the latent ability
and the speediness of the test-taker. ?? ,? represent the relation between the ability and the
73
test visual engagement of the test-taker. ??,? represent the association between speediness
and the test visual engagement of the test-taker. The sign of the parameters estimates
indicate the trade-offs among all these latent variables. For instance, A negative value
of ?? ,? indicates that test-takers who solve a task more quickly also have lower latent
ability (Bolsinova, De Boeck, & Tijmstra, 2017; De Boeck, Chen, & Davison, 2017;
W. J. van der Linden, 2006a).
3.3.2.2 Modeling Item Domain Parameters
To account for the item parameter dependencies in this joint modeling approach, a
multivariate normal distribution is defined for the item parameters, ?I = (bi,?i,mi)?, such
that
?I ?MV N(?I,?I), (3.13)
where the mean vector and symmetric covariance matrix, ?I and ?I , are defined respec-
tively as ? = (? ,? ,? )?I b ? m and
?? ??? ?2
?I =????
b ???
? 2 ?b? ? . (3.14)? ???
? 2b? ??m ?m
These moments are a restrictive version of parameter vector ?GI =(bi,?i,mi,?i,? ?i) ,
in which ?GI = (?bi,??i,?mi,?
?
?i,??i) and
74
? ?
??? ?2? b ????
??
? ?2 ?
?GI???
b? ? ?
? ??? ? ?2 ?? bm ?m m ?? ?????b? ??? ? 2 ?m? ?? ???
?b? ??? ?m? ? ?2?? ?
respectively. Restrictions are put on these item parameters such that the only parameters
to be estimated will be item location, time intensity, and item visual engagement intensity.
For instance, studies by Bolt and Lall (2003), Fox, Entink, and Avetisyan (2014), and
Wang and Nydick (2015) show that estimating the correlation among the item slopes, item
time discrimination, and item visual engagement discrimination could potentially lead to
model over-fitting. The estimation precision of person-side parameters would be reduced
due to the lower degrees of freedom induced by needlessly estimating these correlations.
Figure 3.3 displays the graphical representation of the trivariate jointing modeling of item
response, response time, and visual fixation counts.
The proposed trivariate joint model can help to integrate eye-tracking indicator (vi-
sual fixation) into traditional psychometric modeling framework. With this joint model-
ing framework, we could have a comprehensive picture of test takers? cognitive processes
essential to understanding the underlying problem-solving process that is impossible to
assess from item responses alone. In addition, the current joint modeling approach can
be extended to incorporate other essential biometric indicators. Thus, the current joint
modeling approach serves as a elementary foundation for bridging biometric and psycho-
metric information.
75
Figure 3.4: Trivariate joint model approach of item response, response time, and visual
fixation counts
3.3.3 Model Parameter Estimation
In this study, Bayesian estimation of model parameters is implemented in Just An-
other Gibbs Sampler (JAGS; Plummer, 2015), which is housed in the R2jags package
(Su & Yajima, 2015). Convergence is assessed via the coda package. Two chains using
96,000 total iterations with thinning of 2 to alleviate auto-correlation among draws, were
executed. Model parameter estimates and standard deviations were summarized based on
the posterior densities using the final 4,000 iterations after burning-in 92,000. The po-
tential scale reduction factor (PSRF) was used for evaluating convergence for all model
parameters (Gelman, Carlin, Stern, & Rubin, 2003). For the current study, a PSRF value
of 1.1 or less for each model parameter was used as the arbiter indicating convergence.
Constraints for Modeling Identification
76
To properly identify the scales of the latent variables, model constraints are needed
either on the item side (fixing the summation of item thresholds to zero) or the person-
side (fixing the expectation of the latent ability parameter to zero). In this study, the model
identification scales were fixed on the person-side by following the convention used for
IRT model estimation (Volodin & Adams, 1995; Wu, Adams, Wilson, & Haldane, 1998).
For the 1-PL model, the population mean of the latent ability, ? , was set to 0 (Lord,
1952), and, the item discrimination parameter for each item was fixed to unity. For the
log-normal RT model, the population mean of latent speediness, ? , was constrained to
0 as well (W. J. van der Linden, 2006c). For the NBF model, the mean of the latent
person-side visual engagement parameter ? is also set to zero (Man & Harring, 2019).
?? = ?? = ?? = 0. (3.15)
Prior Distributions
The prior distribution of item parameters, ?I referring to Equation 3.13, for the
proposed model is assumed to be trivariate normal. A Gamma distribution is assumed
for the time discrimination parameter [i.e., ?i ? Gamma(1,1)]. This is the inverse of the
variances of the log-times on different items (?2? ) based on the RT model: log(Ti j) ?
N(?i? ? j,?2? ). In addition, the fixation dispersion parameter for each item [i.e., si ?
IG(1,1), i = 1, ..., I] is assumed to follow an inverse Gamma distribution as well. Hyper-
priors are defined as
?d ? N(0,2), ?? ? N(4.0,2), ?m ? N(0,1) ?I ? IW (II,?) ,
where II is an 3 by 3 identity matrix, and ? is the degree of freedom, which in this case is
equal to 3.
77
Similarly, the prior specification for the person parameters, ?p refering to Equation
3.11, of the three-way joint model follows a trivariate normal distribution. And, the ?I
fixed as 0s. And, ?? ????
?2?
? ?? ??p =?? ?2 ?
?
?? ? ???? IW (IP,?) .
??? ? ?2?? ?
The joint posterior probability for the proposed model can be represented as
I J
p(?p,?I|u, log(T),c) ? ??p(ui j, log(Ti j),ci j|? j,?i)p(? j|?p,?p)p(?i|?I,?I)
i=1 j=1
p(?d)p(?? )p(?m)p(?I|?)p(?p|0,?p)p(?p|?),
3.3.4 Evaluating Model-data Fit: Posterior Predictive Model Checking
In this study, posterior predictive model checking (PPMC) was used for evaluating
whether the proposed model adequately accounted for the variability existing in the data.
Specifically, PPMC was used to check our model-data fit (see, e.g., Gelman, Meng, &
Stern, 1996; Levy, 2009; Rubin, 1996; Sinharay, Johnson, & Stern, 2006).
Introduction of the Method
Let ?=(?T T Tp ,?I ) be the vector of parameters, we are interested in estimating, and
let y be the set of observed data (e,g., item responses, response times, and visual fixation
counts). Thus, the likelihood based on the conditional distribution of the data given model
parameters could be expressed as p(y|?), and the prior distributions of all the model
parameters could be denoted as p(?). By applying Bayes? rule, the posterior distribution
for a given set of parameters could be expressed as
p(?| ? ? p(y|?)p(?)y) . (3.16)
? p(y|?)p(?)d?
78
To check the model-data fit by PPMC, predicted data are generated from the joint
posterior distribution. The generated replicated dataset is denoted as ypredr for r = 1,
2,...,R; where R indicates the number of draws from the joint posterior distribution. The
distribution of predicted data, named as the posterior predictive distribution of predicted
data (see, Equation 3.16), could be utili?zed for checking the data model fit.
p(ypred|y) = p(ypred|?)p(?|y)d?. (3.17)
Model fit is evaluated by comparing the differences between the predicted data
ypredr for r = 1, 2,...,R, and the observed data, y. A small difference would be indicative
of satisfactory data-model fit. Instead of directly comparing the predicted data and the
observed data, a discrepancy measure, T (?), a function of data and model parameters, is
usually computed, which summarizes the data and the corresponding model parameters
(Gelman et al., 1996).
The model-data fit can be evaluated by comparing the difference between the T (ypred,?)
and T (y,?), which are calculated based on predicted and realized data, respectively. In
practice, a posterior predictive p-value (PPP-value) is defined as the probability of ob-
taining the predicted data that is more extreme than the observed data. The estimated
PPP-value is the proportion of T (ypred,?) equal to or larger than T (y,?) over the R
draws. A PPP-value close to 0 or 1 is indicative of poor model-data fit since the predicted
data ypredr is more extreme than the observed data, y. The posterior predictive p-value
(PPP-value) is defined as
? ?
p = p(T (ypred,?)? T (y,?)) = I pred predT (ypred ,?)?T (y,?)p(y |?)p(?|y)dy d?,
(3.18)
79
where I. is the indicator function. To compute the data-model fit for the proposed model
by applying the PPMC method, Sinharay et al. (2006) suggested the following three-step
procedure outlined in Patz and Junker (1999):
1. Draw the item parameter and person parameter estimates for the proposed model
from the posterior distribution (see, Equation 3.16).
2. Draw ypred from the proposed model given by Equation 3.17 based on the drawn
item parameter and person parameter estimates in step 1.
3. Compute the values of observed and predictive discrepancy measures (e.g., item-
fit statistics or descriptive statistics only based on data) from the above draws of
parameters and data set.
The data-model fit can be evaluated based on the computed PPP-values, which are given
by the Equation 3.18. Figure 3.4, a modification of a schematic presented by Sinharay et
al. (2006), graphically demonstrates the detailed procedure of using the PPMC method to
evaluate the data-model fit.
Discrepancy Measures for the Proposed Models
Three statistics will be introduced in this section. Those three statistics will be used
as different discrepancy measures, T (?), to evaluate the item by person-level date-model
fit for item responses, response times, and visual fixation counts, separately. Specifically,
the values of T (ypred,?) and T (y,?) will be calculated based on the predicted dataset and
observed dataset based on three statistics. Then, PPP-values will be calculated preferably
based on the discrepancies between T (ypred,?) and T (y,?) as Figure 2 demonstrated.
The three item-fit statistics are: (1) the W index (Wright & Stone, 1979); (2) the L index
80
Figure 3.5: Graphical demonstration of posterior predictive model checking (PPMC)
Method.y, observed data; ypred, predicted data; ? , model parameters; p(?), prior dis-
tributions of model parameters; p(?|y), posterior distributions of model parameters; T (.)
discrepancy measures.
(Marianti, Fox, Avetisyan, Veldkamp, & Tijmstra, 2014); and, newly proposed (3) M
index, which will be discussed in detail subsequently.
Item response based W statistic. The W index is computed from performing a
residual analysis from applying the Rasch model (Rasch, 1960) to a set of examinees?
item responses (Wright & Stone, 1979). As a consequence of this parsimoniously param-
eterized model, analyses require relatively small sample sizes (i.e., the number of exam-
inees) to produce reasonable data-model fit (Linacre & Wright, 1994). The computation
of the W index follows
81
[Yi j?Pi(?)]2Wi j = , (3.19)Pi(?)[1?Pi(?)]
where Pi(?) is the probability of correctly answering item i given the ability estimate, ? ,
Yi j is the dichotomous response (0, 1) of item i for a specific person j.
RT-based L statistic. Marianti et al. (2014) suggested an RT-based item-fit statistic,
named the L statistic. Parameters used for calculating the L statistic are estimated based
on the RT model proposed by W. J. van der Linden (2006c). The L statistic is formulated
as
[ln(t )?? + ? ]2
L i j i ji j = ?2
, (3.20)
ei
where ti j is the response time for test taker j on an item i, ?i is the time-intensity parameter
that is the averaged population time required for answering that item, ? j is the speediness
parameter for each test taker, and ?ei is defined as 1/?i.
Visual fixations based M statistic. To evaluate the data model fit based on the vi-
sual fixation counts, a visual fixation counts based item-fit M statistic is proposed. The M
statistic is a residual-based model-fit measure, which is constructed from a summation of
the variance weighted squared residuals defined as the differences between the observed
outcome, ci j, and predicted value, E(ci j). (Cochran, 1952; Fox & Marianti, 2017b; ?).
The M statistic is formulated as
[ci j? exp(m 2M i
?? j)]
i j = 2 , (3.21)?i j
where ci j is the visual fixation counts for test taker j on an item i, mi is the visual-intensity
parameter that is the averaged population visual efforts required for answering that item,
82
mi is the individualized visual engagement parameter, and ?2i j is the variance of the visual
fixation counts, which is defined as ?2i j = exp(mi?? j)+ exp(2(mi?? j))/si.
Those three statistics were utilized as different discrepancy measures (see Figure
3.5), to calculate the PPP-values evaluating the item by person-level date-model fit for
item responses, response times, and visual fixation counts, respectively. Having a PPP-
values close to 0 based on a discrepancy measure would indicate problematic data-model
fit, and implies the proposed model fails to sufficiently regenerate the data (Sinharay et
al., 2006). In the results section, the item-wise data-model fit for the item responses,
response times, and visual fixation counts will be calculated by averaging over all the
persons? PPP-values for each item, and the results will be reported.
3.4 Integration of Bio- and Psychometrical Infomation into Machine Learn-
ing Methods for Detecting Aberrant Behaviors
In this study, data mining algorithms introduced in Chapter 2, a class of methods
for clustering observations, will be a useful platform for combining various information
that can detect different types of aberrant behaviors. The sensitivity will be tested to
detect aberrant behavior will be potentially increased by incorporating not only process
and biometrical data as inputs into these algorithms, but also indices based on traditional
approaches. Additionally, in contrast to applications involving traditional IRT-based and
RT methods, data mining algorithms will be used to examine both linear and nonlinear
relations among variables, thereby increasing its flexibility to benefit from modeling inter-
actions between background, psychometric, and biometric data. To classify three types of
83
test-takers mentioned in the section 3.1, different detection methods will be applied. Two
representative data mining methods: unsupervised (K-means clustering) and supervised
(random forest) learning methods will be investigated. In addition, an item response-
based person-fit index Ht and a response time based person-fit index lt will be calculated.
A real dataset will be analyzed to compare the various detection methods.
3.4.1 Data Normalization
Data normalization, also called feature scaling, is a process to transfer the range
of different independent variables or features onto a common scale. This process will be
performed for the supervised learning methods in this study (see, Vapnik 1963, for a dis-
cussion of the advantages for supervised learning methods). For traditional and unsuper-
vised learning methods, however, data normalization will not be performed because they
are either invariant to monotonic transformations of individual features or because they
can change the original characteristics of the data. Also, many traditional methods are
parameter model-based clustering methods, which do not rely on the geometric distance
measures for classification. Thus, it is not necessary to implement feature normalization
(e.g., Dubes & Jain, 1988; Hastie et al., 2009; Strobl, Malley, & Tutz, 2009).
Four types of scaling methods have drawn much attention in practical usage. These
are scaling by variance, mean normalization, scaling by minimum and maximum values,
and scaling to unit length (Fukunaga, 2013). In this study, variables were scaled by its
minimum and maximum values, implying that all independent variables will be scaled to
the range [0, 1]. An advantage of this type of scaling is that it can accommodate binary
item responses.
84
3.4.2 Feature Selection
The purpose of the feature selection is to choose the essential featuers to improve
the overal classification accuracy. After data normalization, to best capture the hid-
d{en insights from }the dataset, and make inferences from the model; a set of features
x{1},x{2} . . . ,x{m} , also called independen{t variables or attribu}tes, will be selected from
the total number of potential input features x{1},x{2} . . . ,x{M} , where m < M . This is
essentially a filtering process (see discussion in John, 1994; Koller, 1996; Miller, 1990).
Implementing this selection process increases the interpretability of the model making it
becomes less complex and more parsimonious. In this study, two filtering methods will
be used as feature selection methods. These methods are: (1) Pearson correlation be-
tween any pair of input variables, and (2) the variable importance index (VII). VII works
by randomly permuting the values of a feature (input) variable, which brakes the origi-
nal relation between the variable and other variables. Then, the permuted feature is used
again with other unpermuted features for making predictions. If the prediction accuracy
decreases, the gap before and after the permutation on a specific variable, averaged across
all the trees, will be used as a measure of variable importance. Usually, the permutation
importance is calculated based on the out-of-bag (OOB) subjects, which are the samples
left behind for training the classification trees. The OOB samples can be utilized as a test
dataset for evaluating the prediction accuracy.
The VII for each tree t is defined as
1
V I n(X ) = ? tj t (er?rOOBt? errOOBn t)t
where the er?rOOBt represents the classification errors based on the permuted OBB sam-
85
ples for a specific tree t. errOOBt denotes the classification errors based on the OBB
sample without any permutations for the tree t. The VII for a feature is averaged impor-
tance score across all the trees built in the forest.
Based on the methods mentioned above, a set of variables that will be considered
in the analyses are listed in the Table 3.1.
Table 3.1: Input psychological and biological variables for data mining methods
Psychological variables Biological variables
Item responses Fixation duration (sec)
Item response times for each item Number of fixations
Total response time for the entire test Average time to 1st review (sec)
Averaged revisits (average number
Self-reported motivation indicators of revisits made to the AOI)
Latent speediness Latent visual engagement levels
Efforts of test preparation indicators Revisit indicator to the AOI (0/1)
Ten-item personality inventory
3.4.3 Outcome Measures and Expected Results
Based on the methods mentioned above, a final set of variables will be selected
from results of the two feature selection methods. For each of the proposed methods,
a method-based classification of aberrant and non-aberrant test takers will be obtained.
Sensitivity and specificity will be used as outcome measures for evaluating the perfor-
mance of different methods. Sensitivity is defined here as the percentage of test takers
that are identified as aberrant and that are classified as aberrant by the particular method.
It will be calculated as 100? [T P/(T P+FP)]. TP stands for the true positive (i.e., TP
are those test takers correctly classified as aberrant) and FP stands for the false positive
(i.e., FP are those test takers incorrectly classified as non-aberrant). Specificity on the
other hand, is defined here as percentage of test takers that are identified as non-aberrant
86
(normal) and that are classified as non-aberrant by the particular method. Again, speci-
ficity will be computed as 100? [T N/(T N +FN)]. TN stands for the true negative, and
FN stands for the false negative. As a mean of comparison of the performance of aberrant
test taking behavior detection, the sensitivity and specificity rates will be reported and
compared across the proposed methods.
3.5 Research Significance
First of all, this study will explore the ways of incorporating biological informa-
tion into traditional psychological methods by developing new models and utilizing data-
mining algorithms to better understand test takers behaviors. This work extends method-
ological work has the potential not only to aid administrators of large scale assessments to
ferret out aberrant behaving examinees, but also can lead to future research in the area of
test security. Second, this study has the potential to create and develop methods that may
very well flag aberrances with increased accuracy. An advantage of the newly proposed
statistical methods is that they would not be based solely on one source of information,
such as item responses, but rather on multiple sources of information about test-takers
including data stemming from bio-information technologies and integrating log file infor-
mation. Ideally, all of this information will be the inputs to be aggregated using highly
efficient computational methods such as cloud computing in the data-mining framework.
Third, through the methodological investigations and analyses using empirical data, the
signal-to-noise ratio (SNR) could be increased, which means greater and more accurate
classification with high sensitivity to detect different types of aberrant testing behaviors.
87
Fourth, the performances of the different methods (e.g., item response based person-fit
analysis, response time based fraud detection methods, K-means clustering, random for-
est) in detecting different types of aberrant behaviors such as pre-knowledge cheating and
copy cheating in terms of classification accuracy will be further manifested by this study.
88
Chapter 4: Results
In this chapter, the results of the study presented in Chapter 3 will be elaborated.
First, data visualization and exploratory data analysis (EDA) will be conducted. Then, the
results of three innovative eye-gaze fixation models are presented. In addition, a proposed
three-way factor model ? jointly modeling item responses, RTs, and visual fixation counts
? will be fitted to the data in different experimental conditions. Therefore, the behavioral
pattern differences across various experimental conditions are assessed. Lastly, results
from implementing both unsupervised and supervised learning methods that classify types
of test-takers will be presented.
4.1 Summary Statistics of the Collected Data across conditions
A total of N = 335 university students who had normal or corrected vision were
recruited for the study. Students were asked to take a test consisting of I = 10 questions
related to verbal reasoning. The test material used for the current study followed the struc-
ture of a high-stakes credentialing exam. Data from subjects who did not complete the
designed tasks were excluded from the following analysis, leaving N = 298 participants
in the study. Table 4.1 lists the numbers of subjects in each condition.
89
Table 4.1: Number of subjects in each condition
Condition 1 Condition 2 Condition 3
,Number of subjects 93 98 107
Note: Condition 1: participants in the control condition who did not receive any test preparation
materials. Condition 2: participants received items that were similar to their exam. Condition 3:
participants in the third condition would receive similar exam questions and the answer key.
The collected dataset includes 103 variables, which measure visual engagement,
working speed, responding accuracy, content revisits, test anxiety, and personality. The
variable names are listed in Table A.1 (see Appendix A). Table B.1 (see Appendix B)
shows the summary statistics of all the variables across the different experimental condi-
tions.
4.2 Data Visualization Across Different Experimental Conditions
Based on the descriptive statistics of the collected data, it is not hard to gain in-
sights about the group differences by comparing the means for each variable across three
experimental conditions. In order to have better understanding about the data and to
properly model it for accurate inferences, eventually, the collected data will be explored
by showing the bivariate scatterplots of the major variables, which are quite useful and
straightforward for interpreting tends and the associations among the key variables. All
the scatterplots were created based on the total scores for each individual, see Figure 4.1.
For instance, on the top left of Figure 4.1, the total scores were calculating by summing
up the 10 item scores. By visualizing the key variables, it is helpful to figure out the most
appropriate means for answering our research questions.
90
Figure 4.1: Scatterplots of essential variables under condition 1. The
variable names showing in the matrix from the top left to the bottom
right are: total.score, total.gaze, total.time, total.revisits, total.anxiety.score,
WTAS.score, personality score. The distribution of each variable is listed on
the diagonal of the plot matrix. The bivariate scatterplots are listed on the
off-diagonal.
Figure 4.1 displays the scatterplots for the essential variables for assessing the be-
havioral patterns of test-takers in condition 1. For the first three panels showing the dis-
tributions, listed from the top left on the diagonal of the matrix, the total scores, total gaze
counts, and total response times are essentially normally distributed. These factors will
be jointly modeled to uncover associations among the latent constructs endorsed by these
91
indicators. For other Likert-scaled indicators measuring test-anxiety and personality, al-
though the distributions look erratic, they will be used to as the input features for each
of the data mining methods. Because the data mining methods are non-parametric, they
depend less on the underlying distribution of the input features.
Figure 4.2: Scatterplots of essential variables under condition 2.
Figure 4.2 shows the scatterplots for the same set of variables demonstrated in Fig-
ure 4.1 for assessing test-takers behavioral patterns under condition 2. Focusing on the
distributions in the first three panels listed from the top left on the diagonal (total scores,
total gaze counts, and total responding times) their panel plots show bimodal and skewed
92
distributions, which are different from the ones shown in condition 1. The bimodal distri-
bution may indicate a mix of two groups of test-takers with different test-taking strategies,
responding to the items in different ways. In addition, the total gaze and total response
time are skewed to the right, which means, on average, test-takers tend to spend a shorter
time finishing the items on their tests. For other Likert-scaled indicators measuring test-
anxiety and personality, the distributions became more skewed and peaked compared than
the ones demonstrated in Figure 4.1.
Figure 4.3: Scatterplots of essential variables under condition 3.
93
Figure 4.3 demonstrates test-takers behavioral patterns under condition 3. In gen-
eral, all the distributions listed on the diagonal are relatively more skewed with less vari-
ability. Looking at the first three panels listed from the top left, their distributions are very
skewed with high peaks, which indicate the responding behavioral patterns of test-takers
under condition 3 are dramatically different from test-takers in the other conditions. The
results show that test-takers in this group correctly answered the items more rapidly with
less visual attention. Also, all the test-takers in condition 3 behaved more alike.
4.3 Negative Binomial Visual Fixation Models
In this section, visual fixation, an essential eye-tracking indicator, is modeled to
reflect the degree of test engagement when a test taker solves a set of questions. Three
negative binomial models demonstrated in Chapter 3 were evaluated for modeling visual
fixation counts produced by test takers answering questions. The three models are: 1) neg-
ative binomial fixation model (NBFM); 2) negative binomial fixation model with linear
trend (NBFM-LT); and, negative binomial fixation model with quadratic trend (NBFM-
QT).
4.3.1 Item Parameter Estimates
Table 4.2 illustrates the parameter estimates of 10 items over the three proposed
models. The item parameter estimates reflecting the visual intensity, m?, varied from 3.113
to 5.419. Item 1 was the least engagement-intensive item (m? = 3.194). In contrast, item
10 demanded the most visual effort from test-takers (m? = 5.407). One thing to note is
94
that the item visual intensities were higher for the last three items. This confirmed our
presumptions in light of the fact that the last three questions were reading comprehension
questions, which required more visual effort for test-takers. Also, the results indicated
that item 1 was the most discriminating item on the test. The results were consistent
across the three proposed models.
Table 4.2: Item parameter estimates
Model NBFM NBFM-LT NBFM-QT
Item m ? m ? m ?
1 3.194(.029) 0.187(.010) 3.113(.053) 0.187(.010) 3.209(.023) 0.186(.010)
2 3.829(.024) 0.137(.007) 3.745(.051) 0.136(.008) 3.841(.017) 0.138(.007)
3 3.792(.027) 0.113(.009) 3.708(.052) 0.114(.010) 3.803(.022) 0.111(.009)
4 4.264(.044) 0.036(.003) 4.180(.062) 0.037(.003) 4.276(.040) 0.036(.003)
5 4.744(.022) 0.068(.007) 4.660(.050) 0.066(.007) 4.755(.015) 0.069(.006)
6 4.417(.032) 0.045(.004) 4.334(.055) 0.046(.004) 4.428(.028) 0.044(.004)
7 3.990(.035) 0.064(.005) 3.907(.056) 0.063(.005) 4.001(.029) 0.065(.005)
8 5.255(.020) 0.054(.006) 5.172(.049) 0.052(.006) 5.267(.115) 0.057(.005)
9 5.248(.025) 0.030(.003) 5.165(.051) 0.031(.003) 5.259(.018) 0.030(.002)
10 5.407(.020) 0.050(.005) 5.324(.049) 0.050(.006) 5.419(.012) 0.046(.004)
Note: NBFM: negative binomial fixations model; NBFM-LT: negative binomial fixations model
with linear trend; NBFM-QT: negative binomial fixations model with quadratic trend. m: visual
intensity parameter showing how much visual effort required for answering an item. ?: visual
discrimination parameter.
4.3.2 Person Parameter Estimates
Table 4.3 presents the estimated variance-covariance matrix of person parameters
of the three proposed models. In general, overall small variability was seen in random
test takers engagement. The variance of the test engagement fitted with the NBF model
was 0.029 (SD = 0.004), which indicates a noteworthy contrast in individuals degrees
of test engagement. Figure 4.4 displays the constant test engagement for each test-taker
across the 10 items. For the NBF-LT model, the variability of the initial test engagement
was 0.086 (SD = 0.03), and the variance of the slopes of test engagement was 0.017 (SD
= 0.003). Moreover, the estimate of covariance between the initial test engagement and
95
the slope parameters based on the NBF-LT model was negative, which was -0.016 (SD =
0.006, Cor.= -0.418). This result shows that test takers who were highly engaged initially
also exhibited increase in engagement with a lower growth rate than those whose initial
engagement levels were lower as demonstrated in Figure 4.5. For the NBF-QT model, the
variance of the initial test engagement was 0.053 (SD = 0.08), the variance of the slope
was 0.003 (SD = 0.002), and the quadratic term variance was 0.0003 (SD = 0.0002).
However, the fitted quadratic term was negligible. The individual engagement non-linear
trajectories were presented in Figure 4.6.
Table 4.3: Variance covariance estimates
Model NBFM NBFM-LT NBFM-QT
Par. Means(SD) Means(SD) Means(SD) Means(SD) Means(SD) Means(SD)
?0 .029(.004) .086(.03) .053(.008)
?1 -.016 (.006) .017(.003) 0 .003(.002)
?2 0 0 .0003(.0002)
In summary, three negative binomial distribution-based fixations models were pro-
posed. The first model named as NBF model was defined by assuming constant engage-
ment levels across all the items. A slope term and a quadratic term were added to the first
model as two extensions. The NBF-LT model and the NBF-QR model used a parsimo-
nious parameterization of the mean structure to capture changes in engagement exhibiting
either linear or nonlinear trends. Results revealed measurement quantities and individual
differences in their test engagement during problem-solving. Two item engagement pa-
rameters: engagement intensity and discrimination parameters, were designed for reflect-
ing the visual efforts associated with an item. The estimated person parameter revealed
individual test engagement differences across items. In the following section, the NBFM
model will be utilized to jointly model visual fixation counts, item responses, and RTs,
96
which might help comprehensively assess the test-takers? performance.
Figure 4.4: Individual test engagement estimates based on the negative bino-
mial fixation model.
Figure 4.5: Individual test engagement estimates based on the negative bino-
mial fixation model with linear trend.
97
Figure 4.6: Individual test engagement estimates based on the negative bino-
mial fixation model with quadratic linear trend.
4.4 Three-way Factor Model Parameter Estimates
In this section, as an example, hypothesized normally behaved test-takers from
condition 1 were analyzed to gain insights on test-takers behavioral characteristics and
item features based on the newly proposed three-way factor model, which jointly analyze
item responses, RTs, and visual fixation counts simultaneously. Item parameter estimates
based on the three measurement models were reported, which include item features such
as item difficulty, time intensity, and visual intensity. Additionally, the associations be-
tween all the person-side latent constructs are discussed, demonstrating the test-takers
behavioral characteristics. Additionally, the trade-offs among all the item-side parame-
ters are discussed in a subsequent section.
98
4.4.1 Item Parameter Estimates
Item parameter estimates for condition 1 data, of N = 93 subjects who did not
receive any test preparation materials, are presented below. The summarized item features
across the three measurement models are reported in Table 4.4.
Table 4.4: Item parameter estimates of three-way factor model
Model 1-PL RT NBFM
Item b ? ? m ?
1 -1.05 (.241) 1.92 (.046) 0.43 (.032) 3.20 (.026) 0.19 (.010)
2 -0.52 (.235) 2.65 (.027) 0.21 (.017) 3.86 (.022) 0.14 (.007)
3 -0.56 (.228) 2.59 (.029) 0.23 (.018) 3.81 (.023) 0.11 (.009)
4 0.31 (.227) 3.02 (.043) 0.36 (.028) 4.28 (.043) 0.04 (.003)
5 0.97 (.237) 3.58 (.027) 0.21 (.016) 4.77 (.020) 0.07 (.007)
6 -0.14 (.215) 3.09 (.036) 0.32 (.024) 4.42 (.030) 0.05 (.004)
7 0.50 (.222) 2.67 (.039) 0.36 (.027) 4.03 (.033) 0.06 (.005)
8 0.99 (.240) 3.98 (.025) 0.18 (.014) 5.28 (.017) 0.05 (.006)
9 0.61 (.217) 3.97 (.025) 0.20 (.016) 5.26 (.022) 0.03 (.003)
10 0.09 (.215 ) 4.14 (.024) 0.18 (.014) 5.42 (.018) 0.05 (.005)
1-PL IRT Model. The 1-PL IRT model was fit to the data to estimate the item
difficulty. Table 4.4 shows that the item difficulty parameter estimates, b?, varied from -
1.03 to 0.98, which based on comparing the estimate to its standard deviation of posterior
distribution (or looking at the 95% credible interval), were statistically significant from
zeros. Among all the items, item 1 was the easiest item while item 8 was the most difficult
item, which was expected since item 8 was a reading comprehension question while item
1 was a sentence equivalence question, which only consists of a single sentence and one
blank.
Lognormal RT Model. In terms of the RT model, the time intensity parameter
estimates, ?? , ranged from 1.92 to 4.41, which were all statistically significant from zeros
demonstrated in Table 4.4. On average, test-takers spent the least time responding to item
99
1 (??1 = 1.92). In contrast, item 10 required the most amount of time on average for test-
takers to answer. In terms of visual discrimination parameter ? varied from 0.03 to 0.19,
the results indicated that item 1 was the most discriminating item on the test while item
10 was the least one.
NBF Model. For the NBF model, Table 4.4 outlines the parameter estimates of 10
items. The visual intensity parameter, m?, varied from 3.113 to 5.419. Item 1 was the least
engagement-intensive item (m?= 3.20). In contrast, item 10 required the most visual effort
from test-takers (m? = 5.407). which matched our expectations since the item 10 was a
reading comprehension question requiring a lot of visual effort for test-takers to answer.
Also, the results indicated that item 1 was the most discriminating item on the test.
4.4.2 Variance-Covariance Estimates
Table 4.5 exhibits the results of the parameter estimates of the variance-covariance
of person- and item-domain at the structural level of the three-factor model (see Figure
3.4). This is of interest due to the structure-level item domain variance-covariance matrix
between all the item parameters indicating pair-wise associations between item responses,
RTs, and visual fixation counts.
4.4.3 Item-Side Variance-Covariance Structure
The estimated covariance between item difficulties and item visual intensities was
0.43 (Cor. = 0.59) with a 95% credible interval of 0.02 to 0.56 indicating that item dif-
ficulties were positively correlated with item visual intensities for the current test (see
Figure 4.7). The estimated covariance between item difficulties and item time intensities
was 0.42 (Cor. = 0.59) with a 95% credible interval of 0.29 to 0.87, which shows sig-
100
Table 4.5: Variance-covariance estimates of three-way model
Item Item Parameters Person Parameters
Variance-CovarianceParameters Variance-CovarianceParameters
Mean CI Mean CI
?2b 0.701 (0.022,0.839) ?
2
? 0.488 (0.233,0.850)
?2m 0.737 (0.292,0.875) ?2? 0.017 (0.013,0.007)
?2? 0.731 (0.280,0.874) ?
2
? 0.021 (0.015,0.029)
?b,? 0.421 (0.292,0.875) ??? 0.001 (-0.020,0.023)
?b,m 0.426 (0.022,0.555) ??? 0.000 (-0.026,0.027)
?? ,m 0.606 (0.189,0.733) ??? 0.003 (-0.001,0.007)
nificant association between the item difficulties and time intensities for the given test.
Moreover, the estimated covariance between item visual intensities and item time inten-
sities was 0.61 (Cor. = 0.83) with a 95% credible interval of 0.19 to 0.73, which is also
showing significant result that the item time intensities were positively correlated with
item visual intensities (see Figure 4.7).
Figure 4.7: Scatter-plots for item parameter estimates. A loess non-parametric smoothed
curve is plotted for each scatter-plot
101
4.4.4 Person-Side Variance-Covariance Structure
The person-side covariance ?? ,? , ?? ,? , and ??,? , illustrated in Table 4.5, were es-
timated to be 0.001 (95% credible interval: 0.026 to 0.027; Cor. = 0.005); -0.001 (95%
credible interval: -0.023 to 0.020; Cor. = -0.011); and -0.003 (95% credible interval:
-0.007 to 0.001; Cor. = -0.159) respectively, see Figure 4.8. Remarkably, all the person-
side covariance estimates were not statistically significant from 0. The non-significant
correlations could be a result of the subjects lacking motivation required to finish the
designed assessment (Wise & Kong, 2005)
Figure 4.8: Scatter-plots for person parameter estimates. A loess non-parametric
smoothed curve is plotted for each scatter-plot
4.4.5 Accessing the Item-Wise Data Model Fit
Posterior predictive model checking (PPMC; Gelman, Carlin, Stern, Dunson, Ve-
htari, & Rubin, 2014) was used to evaluate model-data fit. Specifically, three item-fit
statistics based discrepancy measures were used to calculate the item-wise data model fit
for item responses, RTs, and visual fixation counts separately. Recall that the three item
102
fit statistics introduced in Chapter 3 are: 1) W index (Wright Stone, 1979); 2) lt index
(Marianti, Fox, Avetisyan, Veldkamp, Tijmstra, 2014); and, newly proposed 3) Mc index.
Figure 4.9 shows the PPMC-values for W , lt , as well as Mc. In general, comparison
of the PPMC-values for the three models across 10 items shows satisfactory data model
fit. All the PPMC-values were above the 0.05 level, which is the cut-off of evaluating
the model data fit. A PPMC value greater than 0.05 indicates that there are no systematic
differences between the realized and predictive values, and thus an adequate data model
fit. However, the IRT model was the least satisfactory because its PPMC-values over 10
items were systematically lower than the ones calculated based on the RT model and NBF
model. Whereas, the PPMC-values calculated based on the W were still above the 0.05
threshold indicating agreeable fit. In addition, Figure 4.9 further demonstrates the details
of item-wise fits for the given data set, which summarize the PPMC-values calculated over
2000 iterations. The three dash horizontal lines denote the percentiles of PPMC-values
ranging from 0.05, 0.5 and 0.95, respectively. PPMC-values lying below the dash line at
0.05 levels would indicate the non-satisfactory data-model fit. All the PPMC-values in
Figure 1 were above 0.05, which confirms that the proposed three-way joint model fits
the data set well.
In all, the results are suggestive and demonstrated several interesting findings. First,
given the condition 1 dataset, the fitted three-factor model reveals that the three latent di-
mensions were not statistically significantly correlated to each other, which demonstrates
weak trade-offs among the accuracy, working speed, and visual engagement of test-takers
when their eyes were being tracked. Second, the estimated structure of the measurement
features of the proposed model was instructive for the practitioners in the testing industry.
103
IRT model RT model NBF model
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Figure 4.9: Posterior predictive p-values for 1-PL IRT model, log-normal response time
model, and negative binomial visual fixation counts model over 10 items. The three dash
horizontal lines denote 0.05, 0.5 and 0.95, respectively. The box-plots represent the item
by person-level PPP values. The whiskers indicate the minimum and maximum PPP
values for each item.
The results show that item difficulty, time intensity, and visual intensity were positively
related to each other, which indicates that difficult items require more time and visual
efforts for the test takers to answer. By running the three-way factor model, we could po-
tentially comprehensively evaluate the test-takers performance in technology-enhanced
environment such as game base testing or scenario-based virtual reality learning tasks.
4.5 Assessing Test-Taking Behaviors Across Different Experimental Con-
ditions
To understand and evaluate the pattern differences in test-taking behaviors across
distinct experimental conditions, a multiple-group joint three-way factor model of item
responses, RTs, and visual fixation counts were fitted separately to the data in different
conditions. Parameter estimates of the level-1 measurement models across the three con-
ditions were reported. Moreover, the distinctions of the associations of the person-side
104
PPP?value
0.0 0.2 0.4 0.6 0.8 1.0
PPP?value
0.0 0.2 0.4 0.6 0.8 1.0
PPP?value
0.0 0.2 0.4 0.6 0.8 1.0
and the item-side parameters were reported by showing the corresponding covariance es-
timates across the contrasting experimental conditions.
4.5.1 Impact of Having Pre-knowledge of Test Items on Item Character-
istics
To evaluate the impact of having pre-knowledge of test questions on the proper-
ties of test items (see Figure 4.10), Table 4.6 displays a comparison of item parameter
estimates of the proposed model with respect to the three experimental conditions. In
general, item difficulties (b?), time intensities (?? ), and visual intensities (m?), on average,
tend to show lower values in the condition 3 than the other two conditions (see Table 4.6).
This is potentially attributable to the fact that test-takers tend to spend less time, and less
visual effort on a test with which they were more familiar by practicing the similar items
in advance.
Item difficulty estimates across conditions. In general, items, on average, appeared
to be much easier in condition 3 than the other two conditions. Across item difficulties,
b? ranged from -1.06 to 0.99 in the condition 1, varied from -1.55 to 0.79 in the condition
2, and fluctuated from -4.09 to -0.24. Intriguingly, the difference (bd f f (1,2)) in item diffi-
culties between the condition 1 and condition 2 is not as large as the difference (bd f f (1,3))
between condition 1 and condition 3 (see Figure 4.10), which means practicing items
beforehand without knowing the answer keys has limited impact on item difficulties. In
contrast, the item difficulties would decrease greatly if the test-takers practice the equiva-
lent items with keys.
Time intensity estimates across conditions. Similarly, test-takers who practiced
105
the items or knew the answer keys beforehand tend to take less time to finish their tests.
By averaging the time intensities across the 10 items, ??? (the averaged time intensity) is
3.21 in the condition 1; 2.367 in the condition 2, and 2.102 in the condition 3 (see Table
4.6). By taking the exponential of each averaged time intensity estimate, the unit of ???
were converted into seconds. On average, test-takers in condition 1 took about 25 sec to
finish an item, ones in condition 2 used about 11 sec, and ones in condition 3 took about 8
sec. The results show that, on average, that test-takers in condition 3 who were practicing
items beforehand with answer keys worked three times faster than the ones in condition 1
who did not receive any test preparation materials on answering an item.
Visual intensity estimates across conditions. A trend of visual intensities similar to
the summarized response patterns in the previous session was observed, which indicates
test-takers familiar with the items tend to put less visual effort on searching for informa-
tion to answer the questions (see Figure 4.10). By averaging the visual intensities across
the 10 items, m?? (the averaged visual intensity) is 4.427 in the condition 1; 3.707 in the
condition 2, and 3.505 in the condition 3 (see Table 4.6). By taking the exponential of
each averaged visual intensity estimate, the unit of m?? were converted into counts. In gen-
eral, test-takers in condition 1 generated about 84 fixation counts to finish an item, ones
in condition 2 produced about 40 fixation counts, and ones in condition 3 created about
33 fixations. The results show that, on average, that test-takers in condition 3 put much
less visual effort than the ones in the other two conditions on solving questions.
106
Table 4.6: Item parameter estimates across different experimental conditions
Condition Model 1-PL RT NBFM
Item b sd ? sd ? sd m sd ? sd
1 -1.05 0.24 1.92 0.05 0.43 0.03 3.20 0.03 0.03 0.012
2 -0.52 0.24 2.65 0.03 0.21 0.02 3.85 0.02 0.03 0.006
3 -0.6 0.23 2.59 0.03 0.24 0.02 3.80 0.02 0.04 0.011
4 0.31 0.22 3.02 0.04 0.36 0.03 4.27 0.04 0.05 0.003
C1 5 0.97 0.24 3.57 0.03 0.21 0.02 4.76 0.02 0.03 0.007
6 -0.14 0.22 3.08 0.04 0.33 0.02 4.42 0.03 0.03 0.004
7 0.50 0.22 2,67 0.04 0.36 0.03 4.02 0.03 0.03 0.005
8 0.99 0.24 3.98 0.03 0.18 0.01 5.27 0.01 0.04 0.007
9 0.6 0.22 3.97 0.02 0.21 0.02 5.26 0.02 0.04 0.003
10 0.09 0.23 4.14 0.02 0.18 0.01 5.42 0.01 0.04 0.005
1 -0.67 0.24 1.81 0.06 0.44 0.03 3.07 0.06 0.12 0.011
2 -1.55 0.29 1.81 0.06 0.39 0.03 3.03 0.05 0.13 0.012
3 -0.48 0.22 1.99 0.05 0.34 0.03 3.22 0.05 0.12 0.010
4 0.13 0.23 2.43 0.06 0.45 0.03 3.73 0.06 0.05 0.005
C2 5 0.64 0.24 2.74 0.06 0.42 0.03 4.01 0.05 0.05 0.004
6 -0.21 0.23 2.34 0.06 0.42 0.03 3.66 0.05 0.06 0.005
7 0.78 0.24 2.21 0.06 0.42 0.03 3.48 0.06 0.07 0.006
8 0.66 0.23 2.90 0.07 0.56 0.04 4.32 0.06 0.03 0.002
9 0.79 0.24 2.78 0.08 0.63 0.05 4.38 0.07 0.02 0.002
10 -0.17 0.23 2.66 0.08 0.62 0.05 4.17 0.07 0.02 0.002
1 -4.09 0.55 1.69 0.06 0.4 0.03 2.98 0.06 0.13 0.012
2 -3.76 0.51 1.51 0.06 0.41 0.03 2.80 0.06 0.15 0.014
3 -1.91 0.31 1.92 0.08 0.59 0.04 3.36 0.07 0.05 0.005
4 -1.72 0.28 2.29 0.08 0.73 0.05 3.82 0.08 0.03 0.003
C3 5 -1.92 0.29 2.46 0.06 0.41 0.03 3.77 0.06 0.05 0.004
6 -1.81 0.30 2.16 0.06 0.43 0.03 3.52 0.06 0.07 0.006
7 -1.23 0.28 2.11 0.06 0.41 0.03 3.40 0.06 0.08 0.007
8 -2.39 0.33 2.32 0.07 0.52 0.04 3.80 0.06 0.04 0.004
9 -0.24 0.25 2.23 0.07 0.56 0.04 3.76 0.07 0.04 0.004
10 -0.81 0.28 2.33 0.07 0.55 0.04 3.84 0.07 0.04 0.004
107
Figure 4.10: Item parameter estimates across distinct experimental conditions, and neg-
ative binomial visual fixation counts model. Red: Condition 1; Black: condition 2, and
Blue: condition 3.
4.5.2 Impact of Having Pre-knowledge of Test Items on Test-Takers Be-
havior
Table 4.7 shows the impact of having pre-knowledge of test items on the test-
takers behaviors. The behavioral pattern differences were demonstrated via comparison
of the three person-side covariances, indicating association among the interested latent
constructs (ability, working speed, and visual engagement) across the three experimental
conditions. As a trend, as students gain more pre-knowledge of the test items the correla-
tion between latent ability and working speed increased from 0.005 in condition 1 (95%
credible interval: -0.023 to 0.020) to 0.672 (95% credible interval: 0.496 to 0.621) in
108
condition 3. The increased correlation between latent ability and working speed might be
caused by test-takers in condition 3 receiving practice items with answer keys. Therefore,
they answered more items correctly than the ones who did not receive any test preparation
materials.
Table 4.7: Person-side correlation matrix estimates
Condtions C1 C2 C3
Paramter Mean CI Mean CI Mean CI
Cor? ,? -0.011 (-0.244,0.227) -0.193 (-0.437,-0.108) -0.678 (-0.812,-0.505)
Cor? ,? 0.005 (-0.239,0.251) 0.24 (-0.020,0.327) 0.672 (0.496,0.810)
Cor?,? -0.152 (-0.359,-0.080) -0.899 (-0.935,-0.886) -0.91 (-0.942,-0.867)
Note: C1: condition 1; C2: condition 2; C3: condition 3; CI: credible interval; Cor.: Correlation.
In terms of changes in the trade-offs between the latent ability and visual engage-
ment across conditions, Figure 4.11 shows that test-takers who were familiar with the test
items tended to put less visual efforts on answering items. The correlation between those
two latent constructs dropped from -0.011 in condition 1 (95% credible interval: -0.244
to 0.227) to -0.678 in the condition 3 (95% credible interval: -0.812 to -0.505). Similarly,
negative trade-offs between the working speed and visual engagement was observed. The
correlation (r? ,? ) decreased from -0.152 in the condition 1 (95% credible interval: -0.359
to 0.062) to -0.910 in the condition 3 (95% credible interval: -0.942 to -0.867). This re-
sult infers that as test-takers knew the answer keys of practice items, they favored quickly
answering the questions without elaborately paying attention to the content (See. Figure
4.11).
To summarize, the item parameters at the measurement level were significantly
affected by the amount of pre-knowledge test-takers had, especially when the test-takers
practiced equivalent items with answer keys. Correspondingly, the associations among
109
Figure 4.11: Scatterplots for person-side parameter estimates. A loess non-parametric
smoothed curve is plotted for each scatterplot
person-side latent constructs (e.g., latent ability, working speed, and visual engagement)
were greatly affected by the pre-knowledge, as well. The ability estimates of test-takers
with pre-knowledge on test-items were positively correlated with their working speed.
Their abilities were negatively associated with their visual engagement levels, and their
working speed was negatively correlated with their visual engagement levels. In other
words, test-takers might be inclined to finish their tests quickly by paying less attention
to the content of items, and answering most items correctly.
In contrast, when the test-takers had access to test-preparation materials without
110
keys, their ability estimates were not correlated significantly with their working speed or
visual engagement. However, a strong negative correlation between the working speed
and visual engagement could still be expected. For instance, a high ability test taker sta-
tistically was likely to work either quickly or slowly on their test. In addition, the ones
finishing their tests quickly paid less visual attention to the content compared to the ones
who worked slowly. This is of interest because testing companies could potentially tackle
identification of suspicious aberrant test-takers by matching their behavioral characteris-
tics with our findings mentioned above.
4.6 Use of Person-fit Statistics to Classify Different Responding Behav-
iors
PFS?s are widely utilized in the industry as approaches to identify how aberrant
test takers behaved during their tests. In order to show differences in the performance
between the PSF?s and the data mining methods on separating aberrant cases from the
normally behaved ones, the results of two representative PFS?s studied previously were
evaluated: an item response based PFS called l?z statistic,and a RT based PFS, called
lt statistic. The l?z statistic was calculated by using the lzstar function in the R package
PerFit (Tendeiro, Meijer, & Niessen, 2016). Commonly, PFSs are computed when a small
portion of aberrances exists in a dataset. To show the shortages of using PFSs to classify
cases showing more than two types of aberrant-responding behaviors , the current dataset
mixing of cases from three conditions were directly fitted to the lzstar function.
In this study, the results reported based on the PFS?s would be questionable because
there are significantly fewer hypothesized normally behaved cases in condition one than in
111
the other two aberrant conditions, which violates the basic assumption of using PFS?s. As
a result, it is hard to come up trustworthy cut-offs values since they were highly dependent
on the underlying ability distribution of the test-takers, which was heavily contaminated
by the large number of aberrant cases. Table 4.8 shows the sensitivity and specificity
rates for using the two representative PFS?s. It can be seen in Table 4.8 that having a
large number of aberrant cases resulted in substantially low values in both sensitivity
and specificity rates. Figure 4.12 indicates that PFS-based methods failed to separate
the normally behaved subjects from the aberrances ( l?z PFS: left panel; l
t statistic: right
panel).
Table 4.8: Sensitivity and specificity for PFS IRT- and RT-based methods
% Consistent Decision l?z PFS l
tPFS
Sensitivity 0.04 0.00
Specificity 0.88 0.89
Overall accuracy 0.30 0.28
Figure 4.12: PFS?s performance of classifying different type of responding behaviors. l?Z
PFS is on the left, lt PFS is on the right. The blue line indicates the cut-off
112
4.7 Use of Data Mining Methods to Classify Different Responding Be-
haviors
In this section, the use of representative data mining methods as an alternative group
of methods to the PFS?s was examined with a focus on classifying different types of test-
takers who belong to distinct experimental conditions. Although the previously inves-
tigated item response and RT-based methods are popular in the industry as approaches
to identifying aberrant test-taking behavior (e.g., pre-knowledge and copy cheating), the
data mining-based methods have yet to be fully investigated. To show the benefits of us-
ing data mining methods over the traditional PFS?s, two groups of data mining methods
introduced in Chapter 3 were used to classify different responding behavior, which are:
1) unsupervised learning methods, and 2) supervised learning methods.
To properly use the unsupervised and supervised learning methods, data normal-
ization was performed to put all the input variables onto the same scale by using the
maximum-minimum method mentioned in Chapter 3. The final set of features was se-
lected based on two methods discussed in Chapter 3: 1) Pearson correlation between any
pair of input variables and 2) the variable importance index (VII).
To achieve the optimal classification accuracy, the feature selection was conducted.
Among 60 total features, 13 features were highly correlated (r ? 0.9), shown in Figure
4.13. Additionally, the rank of importance of features calculated based on the VII method
was demonstrated in Figure 4.14. Among all the features, the top ranked features weighed
heavily on classifying different responding behaviors were related to: 1) total scores, 2)
revisits, 3) latent visual engagement, 4) fixation counts, and 5) RTs. In contrast, the per-
113
sonality measures and Westside test anxiety measures were less important for separating
aberrances from normally behaved test-takers. Based on the VII values, the last 8 features
were removed from the feature set to achieve optimal results: 1) WTAS1, 2) WTAS9, 3)
WTAS8, 4) WTAS5, 5) Conscientious, 6) WTAS 10, 7) Agreeableness, and 8) WTAS6.
In summary, 52 total features were selected for the rest of the analysis.
Anxiety 1
WTAS10
WTAS8
WTAS6
WTAS7
WTAS1
WTAS2 0.8
WTAS3
WTAS4
WTAS5
WTAS9
Revisits.1
Revisits.5 0.6
Number.fixation.6
Review.Time.6
Number.fixation.5
Review.Time.5
Number.fixation.2
Review.Time.2
Revisits.2 0.4
Revisits.7
Revisits.8
Revisits.6
Number.fixation.8
Review.Time.8
total.time 0.2
total.gaze
Number.fixation.9
Review.Time.9
Number.fixation.10
Review.Time.10
Revisits.9 0
Revisits.10
Revisits.3
Revisits.4
Number.fixation.4
Review.Time.4
Number.fixation.3
Review.Time.3 ?0.2
Number.fixation.1
Review.Time.1
v.engagement
Number.fixation.7
Review.Time.7
item1 ?0.4
item3
ability
speed
item9
item10
item6 ?0.6
item7
item2
item4
item5
cond1
item8
total.score_c1 ?0.8
Emotional
Agreeableness
Openness
Extraversion
Conscienctious ?1
Figure 4.13: Pair-wise correlations between features
114
Anxiety
WTAS10
WTAS8
WTAS6
WTAS7
WTAS1
WTAS2
WTAS3
WTAS4
WTAS5
WTAS9
Revisits.1
Revisits.5
Number.fixation.6
Review.Time.6
Number.fixation.5
Review.Time.5
Number.fixation.2
Review.Time.2
Revisits.2
Revisits.7
Revisits.8
Revisits.6
Number.fixation.8
Review.Time.8
total.time
total.gaze
Number.fixation.9
Review.Time.9
Number.fixation.10
Review.Time.10
Revisits.9
Revisits.10
Revisits.3
Revisits.4
Number.fixation.4
Review.Time.4
Number.fixation.3
Review.Time.3
Number.fixation.1
Review.Time.1
v.engagement
Number.fixation.7
Review.Time.7
item1
item3
ability
speed
item9
item10
item6
item7
item2
item4
item5
cond1
item8
total.score_c1
Emotional
Agreeableness
Openness
Extraversion
Conscienctious
Figure 4.14: Feature importance
115
4.7.1 Representative Unsupervised Learning Methods
K-Means. After normalizing data and selecting essential features, a representative
unsupervised learning method was performed to cluster subjects who belong to different
test-taking experimental conditions. The focus was twofold here: (1) to evaluate the im-
provement in classification accuracy in using unsupervised learning methods, and (2) to
reduce the error rates of misidentifying aberrances. The specificity and sensitivity were
calculated along with the accuracy rate across different conditions, and results were sum-
marized based on 10-folds cross-validation. After applying the K-means method, three
clusters of test-takers were identified based on their responding behavioral characteristics.
This is because the elbow point of the dashed line in Figure 4.15, showing that the total
within-cluster sum of squares, drifted much less after the elbow point, which is K = 3 (the
optimal number of clusters). Within each cluster, the sums of squares were 293.9 (cluster
1), 191.55 (cluster 2), and 342.24 (cluster 3) separately. The ratio of the within-cluster
sum of squares to the total sum of squares is 0.73.
Table 4.9: Classification accuracy for K-Means methods with three groups
True label
1 2 3
Sensitivity 0.989 0.575 0.877
Specificity 0.946 0.945 0.823
Overall accuracy 0.812
An evaluation of the classification of all the test-takers based on the K-Means
method is demonstrated in Table 4.9. Table 4.9 indicates that subjects who belong to
condition 1 and 3 were well classified with sensitivity rates of 98.9% and 87.7%. How-
ever, 33 subjects with the correct label of condition 2 were classified incorrectly into
116
Figure 4.15: Number of optimal clusters based on K-Means method
condition 3, which indicates the K-means method is limited to differentiate the condition-
2 and 3 subjects with the selected input features. However, Table 4.10 shows that both
sensitivity (99.4%) and specificity (89.3%) were high after combining conditions 2 and 3
into one class, all of which have certain amount of pre-knowledge on test items.
Table 4.10: Sensitivity and specificity for K-means methods with two groups
% Consistent Decision K-Means
Sensitivity 0.994
Specificity 0.893
Overall accuracy 0.96
As an example, Figure 4.16 visualizes segregations among the three clusters as a
result of applying K-Means method to the data based on two features: 1) the number of
fixations, and 2) the number of revisits across ten items. From Figure 4.16, item 9 shows
a clear boundary between the subjects who belong to condition 1 (circles) and the others
(condition 2 - crosses and condition 3 - triangles).
117
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number.fixation.1 Number.fixation.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number.fixation.3 Number.fixation.4
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number.fixation.5 Number.fixation.6
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number.fixation.7 Number.fixation.8
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Number.fixation.9 Number.fixation.10
Figure 4.16: Segregations among the three groups based on K-means method. Condition
1 marked as circles; condition 2 marked as crosses, and condition 3 marked as triangles.
118
Revisits.9 Revisits.7 Revisits.5 Revisits.3 Revisits.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Revisits.10 Revisits.8 Revisits.6 Revisits.4 Revisits.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
These results provide some insight on how the K-means method performs more
precisely compared to traditional PSFs. First, the K-means method has much higher
power to separate the three clusters with high balanced accuracy (0.994) compared to the
PFSs (l?z : 0.32, and lt : 0.28). Moreover, after combining all the subjects in condition 2 and
3, the overall predictive accuracy rate of the K-means method was 0.96. Second, K-means
could be used to identify more than two clusters of subjects who had different behavioral
characters. Whereas, the PFSs could only be used to separate the aberrant cases from
the normally behaved test-takers. This means that the K-means method has an invaluable
advantage to be used to identify aberrances when multiple aberrant behaviors are mixed
together. Importantly, real world data is even more complex than the experimental data,
and requires the unsupervised learning method like K-means to help the practitioners to
make fine decisions about who behaved aberrantly or not.
4.7.2 Representative Supervised Learning Methods
In this section, two selected supervised learning methods, K-nearest neighbors
(KNN) and random forest (RF), were performed to classify subjects in various exper-
imental conditions in order to see whether it is possible to obtain higher classification
accuracy rates compared to the unsupervised learning method, like K-means, as well as
further reduce error rates. The specificity, sensitivity, and accuracy rates are of interest
here.
KNN. The KNN algorithm was used to predict the class membership of a subject
by identifying other cases closest to it that show similar behavioral pattern. Starting the
algorithm requires specifying the number of neighborhoods (K) as a tuning parameter.
119
Figure 4.17 shows that by calculating error rate across different neighborhood values (K)
given the dataset, three clusters should be expected since it yields the lowest classification
error rate (error rate = 0.157).
Figure 4.17: The optimal number of neighborhood based on the KNN algorithm.
The performance of the KNN method is summarized in Table 4.11. Subjects who
were in condition 1 and 3 were accurately classified with high sensitivity rates of 99.9%
and 90.6% respectively. However, the sensitivity rate for condition 2 was relatively lower,
about 64%, which indicates it is challenging to use the KNN method to separate the
subjects in condition 2 from condition 3. This can be attributed to the fact that subjects in
condition 2 and 3 both practiced the items before taking their real tests. As a result, they
could possibly behave similarly due to having pre-knowledge of the items, which made it
120
more difficult for KNN method to differentiate between these two conditions. However,
if we combine the condition- two and three as one group, Table 4.12 shows that there are
substantial increases in both sensitivity (99.4%) and specificity (93.3%).
Table 4.11: Classification accuracy for KNN methods with three groups
True label
1 2 3
Sensitivity 0.99 0.643 0.906
Specificity 0.933 0.944 0.88
Overall accuracy 0.841
Table 4.12: Sensitivity and specificity for KNN Methods with two groups
% Consistent Decision K-Means
Sensitivity 0.99
Specificity 0.933
Overall accuracy 0.951
As demonstrated in Figure 4.18, the grouping result of using the KNN method was
demonstrated based on two features: 1) the number of fixations, and 2) the review time
across ten items. It can be seen from Figure 4.17, for instance, that item 2 shows a
clear separation between the subjects who belong to condition 1 (circles) and the others
(condition 2 - crosses and condition 3 - triangles).
In general, the results for the KNN method were similar to the results of for K-
means, with a high accuracy rate of classifying different responding behaviors. However,
one thing to note is that the specificities were relatively higher with the KNN method
as compared to the K-means method. This means that the supervised learning methods
have promising power to differentiate various types of responding behaviors, while at the
same time protecting normally behaved ones from being misidentified as aberrant cases
121
in practice.
Random Forest. To make results easy to explain to practitioners, another super-
vised learning method was used to classify different types of responding behaviors. Using
the RF method, decision trees can be displayed graphically. This graphical display easily
shows the critical features for yielding the final clusters.
To generate valid results based on the RF method, two parameters need to be tuned
or defined. One is the number of trees in the forest, and another is the number of variables
(mtry) that need to be randomly considered at each splitting node. Usually, the default
setting for mtry is calculated as the square root of the total number of features. Therefore,
in this study, mtry is equal to 7 (rounded down). The number of trees is tuned by comput-
ing the classification error, which is demonstrated in Figure 4.19. It can be shown from
Figure 4.19 that having 304 trees yields the lowest classification error.
The performance of the RF method is summarized in Table 4.13. It can be seen that
subjects who are in conditions 1 and 3 were accurately classified by the RF algorithm at
high sensitivity rates, 99% and 87% respectively. The sensitivity rate for condition 2 is
87%, which is about 23% higher than the sensitivity rate calculated based on the KNN
method. Also, after combining conditions 2 and 3 as one group, Table 4.14 shows that
both sensitivity and specificity are high, which indicates that the RF method successfully
identified the subjects who had pre-knowledge of test items. The RF method yields the
highest overall accuracy rate compared to other methods, approximately 98.4%.
A classification tree built based on the training dataset is plotted in Figure 4.20.
As shown in Figure 4.20, the tree splits from the top node, which is the number of fix-
ations for item 10. Under that node, we can see two options: either yes marked as Y
122
0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6
Number.fixation.1 Number.fixation.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5
Number.fixation.3 Number.fixation.4
0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4
Number.fixation.5 Number.fixation.6
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Number.fixation.7 Number.fixation.8
0.0 0.2 0.4 0.6 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6
Number.fixation.9 Number.fixation.10
Figure 4.18: Segregations among the three groups based on KNN method. Condition 1
marked as circles; condition 2 marked as crosses, and condition 3 marked as triangles.
123
Review.Time.9 Review.Time.7 Review.Time.5 Review.Time.3 Review.Time.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8
Review.Time.10 Review.Time.8 Review.Time.6 Review.Time.4 Review.Time.2
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.6
Figure 4.19: Number of trees in random forest.
Table 4.13: Classification accuracy for RF Methods with three groups
True label
1 2 3
Sensitivity 0.99 0.862 0.866
Specificity 0.977 0.936 0.943
Overall accuracy 0.905
or no labeled as N. If the cases satisfied the condition listed at the node, then they were
assigned to the left side of the node, and if not, they were assigned to the right side. The
splitting process continues until some stopping condition was met as mentioned in Chap-
ter 3. At the bottom level, the final predicted cluster membership was assigned to each
case. The membership was assigned a value of 0, 0.5 and 1, indicating condition 1, 2 or 3
respectively. By following all of the conditions (nodes) from the top to the bottom, prac-
titioners could gain insights about the behavioral characteristics of a group who behaved
124
Table 4.14: Sensitivity and specificity for RF methods with two groups
% Consistent Decision K-Means
Sensitivity 0.99
Specificity 0.977
Overall accuracy 0.984
aberrantly in their tests. For example, as shown in Figure 4.20, when a test-taker put low
visual effort into answering item 10 (Number.fixation.10 < 0.34), performing carelessly
over the entire test, but answered item 8 correctly, this person would most likely be clas-
sified as an aberrant test-taker who had a lot of pre-knowledge about the items. This is
significant as one seeks for a method to accurately flag the aberrantly behaved test-takers
with interpretable graphs.
Figure 4.20: Classification Tree as a demonstration of classifying different types of re-
sponding behaviors. A value of 0, 0.5 and 1, indicating condition 1, 2 or 3 respectively.
125
In summary, a comparison of the capacities on clustering different types of re-
sponding behaviors across different statistical methods was presented. The results suggest
that the overall sensitivity, specificity, and overall accuracy rates based on the supervised
learning methods were relatively higher than the ones based on the traditional IRT- and
RT-based methods as well as the unsupervised learning methods. In addition, it is chal-
lenging to use IRT- and RT-based methods to accommodate complex datasets with large
numbers of aberrances. In contrast, data mining methods are able to overcome that lim-
itation and accurately classify different types of test-taking behaviors. Particularly, the
specificity base of the RF method was much higher than other methods, which implies
that the RF method could potentially protect normally behaved test takers from incorrect
classified as wrongdoers.
126
Chapter 5: Discussion
In this study, many methods were created, developed, and investigated that incor-
porate bio-information technology, namely eye tracking, in the classification of different
types of responding behaviors in computer-based testing scenarios. This study explores
the potential to combine psychometric and biometric information to assess test-takers be-
haviors.
First, the collected experimental data was visualized and summarized. Next, three
innovative gaze fixation based models were proposed. The first model, named the NBF
model, was defined by assuming constant engagement levels across all the items. A slope
term and a quadratic term were added to the first model as two extensions to the base
model. The NBF-LT model and the NBF-QR model used a parsimonious parameteriza-
tion of the mean structure to capture changes in engagement exhibiting either linear or
nonlinear trends. To properly identify the scale of the latent variables, the expectations of
the person-side latent variables ? were fixed as 0s, which were aligned with the previous
research from Fox and Marianti (2016). The proposed model helped to understand indi-
vidualized differences in test engagement levels and show item characters, including how
much visual effort was required for answering an item and its discriminating power.
Second, A three-way a hierarchical joint model was proposed to jointly model item
responses, RTs, and visual fixation counts. A 1-PL IRT model was used to model the
127
item responses, a log-normal response time model was utilized for modeling RTs without
truncation, and a negative binomial visual fixation counts model was selected to model
the gaze fixations. These three measurement models were jointly modeled at the lower
level of the hierarchy. At the higher level of the hierarchical model, the mean vectors
and variance-covariance structures were estimated for both person and item parameters,
respectively. This modeling approach permits the evaluation of trade-offs among respond-
ing accuracy, working speed, and visual engagement as reflected by the person-domain
model.
Third, a three-way joint model was fit to the data across different experimental
conditions. Thus, behavioral pattern differences across different experimental conditions
were uncovered along with the gaps of item parameters estimates. The results show that
pre-knowledge had large effects on the item characteristics. In addition, the associations
among person-side behavioral constructs (e.g., latent ability, working speed, and visual
engagement) were greatly affected by pre-knowledge. With pre-knowledge on test-items,
the ability estimates of test-takers were positively correlated with their working speed. In
contrast, ability estimates were negatively associated with test-takers visual engagement
levels, and their working speeds were negatively correlated with their visual engagement
levels. One thing to note is that when the test-takers had no access to test-preparation
materials, their ability estimates were not correlated with their working speed nor with
their visual engagement levels. This is of interest because testing companies could poten-
tially tackle identification of suspicious aberrant test-takers by matching their behavioral
characters against these findings.
Lastly, representative data mining methods were utilized to classify different types
128
of test-takers using multimodal data including estimates from the joint modeling previ-
ously mentioned. These newer methods are well suited to tackle identification of various
test-taking behaviors as they can incorporate vast amounts of data coming from numer-
ous rich sources (e.g., process data, biometrical data, and psychometric data). This point
is particularly salient as many testing administrations are moving away from pencil-and-
paper assessments toward computer-based environments. The findings from this study
showed that the data mining methods investigated here gave relatively high detection
rates (sensitivity) compared with traditional methodsespecially the supervised methods.
Traditional methods were able to flag aberrant test takers who have pre-knowledge of the
test items without incorrectly classifying normally-behaved test-takers as aberrances.
The current study successfully, as an attempt, integrated biometric and psycho-
metric information with various machine learning methods to classify different types of
responding behaviors. To better understand the individual differences in their test en-
gagement during problem-solving, three models were proposed to manifest some unique
insights about test takers problem-solving patterns. Then, the proposed joint-modeling
approach marries various measurement models from the field of psychometrics with data
mining methods from the field of machine learning. Additionally, this method could
be used for any scenario in which researcher believes that biometric and psychometric
information may be essential in the classifications of various behaviors, not limited to
educational testing. Yet, there were limitations in the application of this method in the
current study, which are discussed in more detail below.
129
5.0.1 Limitations for the Current Work
All the proposed models were estimated within a Bayesian framework using an
MCMC algorithm. This estimation procedure appeared to be effective at recovering item
and person parameters. However, a more comprehensive simulation study would need to
be conducted to investigate conditions found in practice.
Also, as with the majority of studies, the findings of this study have to be seen
in light of some limitations. Conclusions about the manifested test-takers? behavioral
patterns across different experimental conditions was merely based on the current research
design, which is subject to the several limiting factors such as sample size and number of
test items. Therefore, practitioners or applied researchers should be cautious as to what
findings could be properly generalized to industry practice.
An additional limitation of applying the results of this study is that the proposed
models require collecting gaze data via eye tracking devices, which are expensive and
difficult to use. However, the proposed models could be widely applied to analyze mul-
timodal data, including eye-tracking data, to evaluate students task performance in a
technology-enhanced simulation based testing system.
To take advantage of supervised methods for classifying different types of respond-
ing behaviors, a true class of membership labels need to be created in the first place. For
practical use, test companies need to know the true labels of different types of responding
behaviors based on a serious investigation in order to build a blacklist to train the models.
To overcome this shortage, traditional and unsupervised learning methods could be used
to build preliminary labels. In the end, all the results could be taken into account to make
130
final decisions on test behaviors.
5.0.2 Recommendations for Future Directions
Several model extensions could be further considered beyond those presented here.
An interesting next elaboration might be to extend the three-way joint model by incorpo-
rating finite mixtures. This is vital for educational and psychological testing, where test
takers show different problem-solving behaviors such as carelessness and copy cheating,
or implement unintentionally distinguished strategies in different learning groups (e.g,
male / female, low- / high-achievement students, etc). The latent groups that would be
uncovered from such an analysis could promote a better clustering of different types of
test takers with different behavioral patterns. A second elaboration is to develop response
models for polytomous or graded responses. This is important for psychological testing
where items are Likert-scaled. Another methodological extension would be to carry out
a sensitivity analysis to measure the impact of various prior distributions on parameter
estimation for the proposed models.
In addition, data mining methods are able to work with complex datasets with large
sample sizes and high dimensional features, but also can be surprisingly useful in the
case where sample size is small compared to the number of features. In this latter case,
traditional parametric methods may be useless or may yield unstable parameter estimates
due to the limited sample size. With the development of TELS, data mining methods can
play an important role to analyze such data with high efficiency, as shown in this study.
To classify various test-taking behaviors with high accuracy, numerous sources of
information regarding test-takers, including information originating from biometric tech-
131
nologies, need to be aggregated utilizing profoundly proficient computational methods
like cloud computing. Also, other types of biometric information could be integrated into
the current modeling framework such as Electroencephalogram (EEG), facial expression
recognition, and body and body part movement. While writing this thesis, the COVID-19
virus is spreading rapidly over the world. As a consequence, many schools decided to
switch to online instruction and learning, which results in an increscent need of having
an online protocol platform to security the online delivered exams. Thus, it is essential to
continuously develop new measures and tools integrating all kinds of psychometric and
biometric information to secure the exams rendered in differ forms.
Mislevy (2016) showed that new types of educational assessments would include
psychometric models, biometrics, machine learning, and data mining methods utilizing
a high-effectiveness computation system. Through methodological examinations and in-
vestigations utilizing empirical data, the signal-to-noise ratio (SNR) could be improved.
This implies that grouping accuracy could be improved, gaining sensitivity to nuanced
behaviorial patterns.
Obviously, reality is significantly more unpredicatable than any model could suf-
ficiently catch. A high-efficiency modeling structure like that used by many machine-
learning methods could be utilized to uncover unpredicted, concealed patterns in a large
amounts of informationletting the information represent itself. This might be the best
route for classifying different test-taker behaviors, which can be very challenging to rec-
ognize from typical test-taking behaviors. However, test companies also need to be alert
to the possibilities of overextending the statistical results to judge a test-taker?s suspicious
testing behavior without a follow-up panoramic investigation. Put the matter another way,
132
test companies need to seek specific guidance from experts on how to prevent potential
epical false-positives.
133
Appendix A: List of Variable Names
134
Table A.1: Variable names
Variable (abbreviation) Full Terminology in the Data and Description
No.fixation The number of visual fixation counts.
Response time: time used by a test-taker to answer
Res.Time an item.
P-value: a measure of item difficulty, which the
P-value proportion of test-takers who answered an item
correctly. A high P-value indicates high easiness.
The number of revisits: the frequency of saccades
No.Revisits move back to the previously viewed area of
interest (AOI).
Total.Score Total scores: the number of items answered correctly.
Total time: the total time spent by a test-taker to finish
Total.Time the test.
Total time: the total number of visual fixation counts
Total.gaze generated by a test-taker while performing the test.
Latent visual engagement: an individualized parameter
showing the visual engagement level for each test-taker
V.Engmt performing a test, which is estimated from the negative
binomial visual engagement model.
Latent ability: an individualized ability parameter
indicating cognitive ability of decoding questions,
Ability which is estimated from the one-parameter logistic
model.
Latent speediness: a parameter representing how fast
Speed a test-taker work on his/her test, which is estimated
based on the lognormal response time model.
WTAS WTAS: Westside test anxiety scale.
Extraversion: one of the big five personality traits.
Extraversion The scores are calculated by averaging item 1 and
item 6 in the ten-item personality inventory - (TIPI).
Agreeableness: one of the big five personality traits.
Agreeableness The scores are calculated by averaging item 2 and
item 7 in the ten-item personality inventory - (TIPI).
Conscientious: one of the big five personality traits.
Conscientious The scores are calculated by averaging item 3 and
item 8 in the ten-item personality inventory - (TIPI).
Emotional: one of the big five personality traits.
Emotional The scores are calculated by averaging item 4 and
item 9 in the ten-item personality inventory - (TIPI).
Openness: one of the big five personality traits. The
Openness scores are calculated by averaging item 5 and
item 10 in the ten-item personality inventory - (TIPI).
135
Appendix B: Summary Statistics of All the Variables
136
Table B.1: Summary Statistics of All the Variables
Experimental Conditions
Condition 1(N=93) Condition 2(N=98) Condition 3(N=107)
Mean SD Med Mean SD Med Mean SD Med
No.fixation.1 24.6 5.4 25 22.3 10 20 20.2 9 18
No.fixation.2 47.4 6.7 47 21.9 10.4 20.5 17 9.3 15
No.fixation.3 45.1 8.9 43 27.3 16.1 24 32.3 28.8 18
No.fixation.4 72.4 31.3 62 48.9 42.5 37.5 51.9 47.5 28
No.fixation.5 117.7 15 122 61.1 38.4 51 49.9 45 39
No.fixation.6 83.4 22.1 80 43.3 29 35 36.8 31.6 28
No.fixation.7 55.8 14.4 56 35.3 29.7 28 32.3 22.7 26
No.fixation.8 195.9 18.8 197 84.5 56.1 75 55.8 68.9 35
No.fixation.9 193 34.6 182 83.4 66.2 64.5 48.6 45.8 33
No.fixation.10 226.4 25.9 228 75.7 71.2 51.5 53.5 50.5 35
Res.Time.1 7.3 2 7.9 6.8 3.7 5.8 5.8 2.7 5.2
Res.Time.2 14.3 2.2 14.3 6.9 3.7 5.7 5 3.2 4.1
Res.Time.3 13.7 3.1 12.9 8.5 5.4 7.1 9.3 8.6 5.2
Res.Time.4 22.1 9.8 18.6 15.1 14 10.6 15.1 14.4 7.9
Res.Time.5 36 4.6 37 18.8 12.4 14.6 14.7 13.1 10.9
Res.Time.6 23 7.1 21.3 12.7 8.7 10.2 10.5 9.4 7.6
Res.Time.7 15.1 4.2 14.6 10.9 9.9 8.5 9.8 7.6 7.4
Res.Time.8 54 4.2 55 23.6 16.7 20.1 14.9 18.9 8.9
Res.Time.9 53.7 9.4 51 22.9 19.4 17 12.6 12.7 7.8
Res.Time.10 63.4 6.8 64.3 20.8 20.6 12 14.2 14.8 8.8
P-value.1 0.7 0.5 1 0.6 0.5 1 1 0.1 1
P-value.2 0.6 0.5 1 0.8 0.4 1 1 0.2 1
P-value.3 0.6 0.5 1 0.6 0.5 1 0.8 0.4 1
P-value.4 0.4 0.5 0 0.5 0.5 0 0.8 0.4 1
P-value.5 0.3 0.5 0 0.4 0.5 0 0.8 0.4 1
P-value.6 0.5 0.5 1 0.6 0.5 1 0.8 0.4 1
P-value.7 0.4 0.5 0 0.3 0.5 0 0.7 0.5 1
P-value.8 0.3 0.5 0 0.4 0.5 0 0.9 0.3 1
P-value.9 0.4 0.5 0 0.3 0.5 0 0.5 0.5 1
P-value.10 0.5 0.5 1 0.6 0.5 1 0.6 0.5 1
No.Revisits.1 5.2 3.1 5 3.7 3.4 3 3.5 2.4 3
No.Revisits.2 13.9 6 14 3.6 3.4 3 2.9 2.2 3
No.Revisits.3 10.9 4.3 10 3.5 2.9 3 5.2 6.1 3
No.Revisits.4 16.6 7.9 16 8.1 8.1 5.5 9.7 10.8 5
No.Revisits.5 20.3 8 19 11.2 8.4 9 9.1 12.2 6
No.Revisits.6 28.5 10.3 28 7.6 9.5 6 6.4 5.4 5
No.Revisits.7 19 8.8 19 6.2 7.4 4 5.2 5.2 4
No.Revisits.8 60.2 17 58 14.1 13.2 10 9.8 17.7 5
No.Revisits.9 52 19.5 51 16.8 13.4 14.5 10.9 11.9 8
No.Revisits.10 60.8 24 55 11.8 11.1 8.5 9.2 10.7 6
Total.Score 10.2 3.3 10 9.9 3.4 10 15.9 2.6 16
Total.Time 608 81.7 599.5 276.6 166.1 250 197.6 143.8 147.3
Total.gaze 2117.5 229.1 2142 949.4 514.4 841.5 697.2 430.2 560
V.Engmt 0 0.1 0 0 0.4 0 0 0.5 0.1
Ability 0 0.5 0 0 0.7 0 0 1.1 0
Speed 0 0.1 0 0 0.5 0.1 0 0.5 0.1
Anxiety 3.4 1 4 3 1.1 3 2.8 1.1 3
WTAS1 2.7 1.2 3 2.7 1.1 3 2.7 1.1 3
WTAS2 3.4 1.2 4 3.4 1.2 3 3.5 1.2 4
WTAS3 2.9 1.1 3 2.8 1.2 3 2.9 1.1 3
WTAS4 2.6 1.2 2 2.5 1.1 2 2.7 1 3
WTAS5 2.8 1.1 3 2.8 1 3 2.8 1 3
WTAS6 2.3 1.2 2 2.3 1.1 2 2.3 1.1 2
WTAS7 2.4 1.3 2 2.5 1.2 2 2.5 1.1 2
WTAS8 2.9 1.4 3 3 1.1 3 3.2 1.2 3
WTAS9 3.9 1.2 4 3.8 1.2 4 3.9 1.1 4
WTAS10 2.6 1.2 3 2.8 1.2 3 2.4 1.2 2
Extraversion 4.2 1.5 4 4.2 1.5 4 4.5 1.5 4.5
Agreeableness 4.7 1.3 5 5 1.2 5 4.9 1.1 5
Conscientious 5.1 1.3 5.5 5.4 1.2 5.5 5.4 1.2 5.5
Emotional 3.6 1 3.5 3.7 1 3.5 3.9 0.9 4
Openness 5 1.3 5 5.5 1 5.5 5.2 1.1 5.5
Note: SD represents standard deviation; Med represents median.
137
References
Ackerman, P. L., & Kanfer, R. (2009). Test length and cognitive fatigue: an empirical
examination of effects on performance and test-taker reactions. Journal of Experi-
mental Psychology: Applied, 15, 163-168.
Ackerman, P. L., Kanfer, R., Shapiro, S. W., Newton, S., & Beier, M. E. (2010). Cog-
nitive fatigue during testing: An examination of trait, time-on-task, and strategy
influences. Human Performance, 23, 381-402.
Anderson, D., & Burnham, K. (2002). Avoiding pitfalls when using information-theoretic
methods. The Journal of Wildlife Management, 66, 912-918.
Angoff, W. H. (1974). The development of statistical indices for detecting cheaters.
Journal of the American Statistical Association, 69(345), 44-49.
Barbato, G., della Monica, C., Costanzo, A., & De Padova, V. (2012). Dopamine ac-
tivation in neuroticism as measured by spontaneous eye blink rate. Physiology
Behavior, 5(2), 332-336.
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algo-
rithms: Bagging, boosting, and variants. Machine Learning, 36(2), 105-139.
Bay, L. (1995). Detection of cheating on multiple-choice examinations. In Annual
Meeting of the American Educational Research Association. San Francisco, CA.
Belleza, F. S., & Belleza, S. F. (1989). Detection of cheating on multiple-choice tests by
using error-similarity analysis. Teaching of Psychology, 16(3), 151-155.
Belov, D. I., & Armstrong, R. D. (2010). Automatic detection of answer copying
via kullback-leibler divergence and k-index. Applied Psychological Measurement,
34(6), 379-392.
138
Benedetto, S., Pedrotti, M., Minin, L., Baccino, T., Re, A., & Montanari, R. (2011).
Driver workload and eye blink duration. Transportation Research Part F: Traffic
Psychology and Behaviour, 14(3), 199-208.
Berkhin, P. (2006). A survey of clustering data mining techniques. In J. Kogan &
T. M. Nicholas C. (Eds.), Grouping multidimensional data (p. 25-71). Berlin, Ger-
many: Springer.
Bishop, S., & Egan, K. (2016). Detecting erasures and unusual gain scores. Handbook of
Quantitative Methods for Detecting Cheating on Tests, 193.
Bishop, S., Liassou, D., Bulut, O., & Seo, D. G. (2011). Modeling erasure behavior.
In Annual meeting of the National Council on Measurement in Education. New
Orleans, LA.
Blanchard, H. E., & IranNejad, A. (1987). Comprehension processes and eye movement
patterns in the reading of surpriseending stories. Discourse Processes, 10, 127-138.
Bollen, K. A., & Curran, P. J. (2006). Latent curve models: A structural equation
perspective. Hoboken, NJ: Wiley.
Bolsinova, M., De Boeck, P., & Tijmstra, J. (2017). Modelling conditional dependence
between response time and accuracy. Psychometrika, 82, 1126-1148.
Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory
multidimensional item response models using markov chain monte carlo. Applied
Psychological Measurement, 27, 395-414.
Booth, R. W., & Weger, U. W. (2013). The function of regressions in reading: Backward
eye movements allow rereading. Memory Cognition, 41, 82-97.
Born, S., & Kerzel, D. (1999). Computer interface evaluation using eye movements:
139
methods and constructs. International Journal of Industrial Ergonomics, 24(6),
631-645.
Born, S., & Kerzel, D. (2008a). Eye movements in reading and information processing:
20 years of research. Psychological Bulletin, 124(3), 372-382.
Born, S., & Kerzel, D. (2008b). Influence of target and distractor contrast on the remote
distractor effect. Vision Research, 48(28), 2805-2816.
Boulesteix, A. L., Strobl, C., Augustin, T., & Daumer, M. (2008). Bagging predictors.
Evaluating Microarray-Based Classifiers: An Overview, 6, 77-97.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123-140.
Breiman, L. (1998). Arcing classifier (with discussion and a rejoinder by the author). The
Annals of Statistics, 26(1), 801-849.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.
Bullinaria, J. A. (2004). Introduction to neural networks. In M. N. Jos & M. F. Santos
(Eds.), School of computer science (p. 512-523). Birmingham, UK: Springer.
Cannell, J. J. (1988). Nationally normed elementary achievement testing in america?s
public schools: How all 50 states are above the national average. Educational
Measurement: Issues and Practice, 7(2), 5-9.
Celebi, M. E., Kingravi, H. A., & Vela, P. A. (2013). A comparative study of efficient
initialization methods for the k-means clustering algorithm. Expert Systems with
Applications, 40(1), 200-210.
Cizek, G. J. (1999). Cheating on tests: How to do it, detect it, and prevent it. Routledge.
Cizek, G. J., & Wollack, J. A. (2017). Handbook of quantitative methods for detecting
cheating on tests. New Yor, NY: Routledge.
140
Cochran, W. G. (1952). The ?2 test of goodness of fit. The Annals of Mathematical
Statistics, 315?345.
Cody, R. P. (1985). Statistical analysis of examinations to detect cheating. Academic
Medicine, 60(2), 136-7.
Colzato, L. S., Slagter, H. A., van den Wildenberg, W. P., & Hommel, B. (2009). Closing
ones eyes to reality: evidence for a dopaminergic basis of psychoticism from spon-
taneous eye blink rates. Personality and Individual Differences, 46(3), 377-380.
Cox, D. R., & Hinkley, D. V. (1974). Theoretical statistics. London: Chapman Hall.
Coff, C., & O?regan, J. K. (1987). Reducing the influence of non-target stimuli on saccade
accuracy: Predictability and latency effects. Vision Research, 27, 227-240.
Crawford, C. C. (1930). Dishonesty in objective tests. The School Review, 38(10),
776-781.
Dai, Y. (2013). A mixture rasch model with a covariate: A simulation study via bayesian
markov chain monte carlo estimation. Applied Psychological Measurement, 37(5),
375396.
De Boeck, P., Chen, H., & Davison, M. (2017). Spontaneous and imposed speed of cog-
nitive test responses. British Journal of Mathematical and Statistical Psychology,
70, 225-237.
Dickenson, H. F. (1945). Identical errors and deception. The Journal of Educational
Research, 38(7), 534-542.
Dietterich, T. G. (2000). Ensemble methods in machine learning. In B. Dash S. Sub-
udhi (Ed.), International workshop on multiple classifier systems (p. 1-15). Berlin,
Germany: Springer.
141
Domingos, P., & Pazzani, M. (1997). On the optimality of the simple bayesian classifier
under zero-one loss. Pattern Recognition, 19(2-3), 103-130.
Drasgow, F., Levine, M. V., & Williams, E. A. (1985). Appropriateness measurement with
polychotomous item response models and standardized indices. British Journal of
Mathematical and Statistical Psychology, 38(1), 67-86.
Drew, G. C. (1951). Variations in reflex blink-rate during visual-motor tasks. Quarterly
Journal of Experimental Psychology, 3(2), 73-88.
Dubes, R. C., & Jain, A. K. (1988). Algorithms for clustering data. Upper Saddle River,
NJ: Prentice-Hall.
Daz-Uriarte, R., & De Andres, S. A. (2006). Gene selection and classification of microar-
ray data using random forest. BMC Bioinformatics, 7, 1471-2105.
Everitt, B. S. (1981). A monte carlo investigation of the likelihood ratio test for the num-
ber of components in a mixture of normal distributions. Multivariate Behavioral
Research, 16, 171-180.
Fairclough, S. H., & Venables, L. (2006). Prediction of subjective states from psy-
chophysiology: A multivariate approach. Biological Psychology, 71(1), 100-110.
Ferrando, P. J., & Lorenzo-Seva, U. (2007). An item response theory model for in-
corporating response time data in binary personality items. Applied Psychological
Measurement, 31, 525-543.
Fix, E., & Hodges, J. L. (1951). Discriminatory analysis-nonparametric discrimination:
consistency properties. International Statistical Review, 3, 238-247.
Fossey, W. A. (2017). An evaluation of clustering algorithms for modeling game-based
assessment work processes (PhD thesis).
142
Fox, J. P., Entink, R. K., & Avetisyan, M. (2014). Compensatory and noncompensatory
multidimensional randomized item response models. British Journal of Mathemat-
ical and Statistical Psychology, 67, 133-152.
Fox, J. P., & Marianti, S. (2016). Joint modeling of ability and differential speed using
responses and response times. Multivariate Behavioral Research, 51, 540-553.
Fox, J. P., & Marianti, S. (2017a). Person-fit statistics for joint models for accuracy and
speed. Journal of Educational Measurement, 54(2), 243-262.
Fox, J.-P., & Marianti, S. (2017b). Person-fit statistics for joint models for accuracy and
speed. Journal of Educational Measurement, 54(2), 243-262.
Frary, R. B., Tideman, T. N., & Watts, T. M. (1977). Indices of cheating on multiple-
choice tests. Journal of Educational Statistics, 2(4), 235-256.
Fukunaga, K. (2013). Introduction to statistical pattern recognition. Cambridge, MA:
Academic Press.
Furey, T. S., Cristianini, N., Duffy, N., Bednarski, & Haussler, D. (2000). Support vector
machine classification and validation of cancer tissue samples using microarray
expression data. Bioinformatics, 16, 906-914.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis.
New York: Chapman & Hall.
Gelman, A., Meng, X. L., & Stern, H. (1996). Posterior predictive assessment of model
fitness via realized discrepancies. Journal of Educational and Behavioral Statistics,
6, 733-760.
Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review,
9, 139-150.
143
Hanson, B. A., Harris, D. J., & Brennan, R. L. (1987). A comparison of several statistical
methods for examining allegations of copying. American College Testing Program
Iowa City.
Hartigan, J. A., & Wong, M. A. (2012). A k-means clustering algorithm. Applied Statis-
tics, 28, 100-108.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning.
New York. NY: Springer.
He, Q., & von Davier, M. (2015). Identifying feature sequences from process data in
problem-solving items with n-grams. In Quantitative psychology research (p. 173-
190). Springer, Cham.
Hendrawan, I., Glas, C. A., & Meijer, R. R. (2005). The effect of person misfit on
classification decisions. Applied Psychological Measurement, 29(1), 26-44.
Hess, E. H. (1965). Attitude and pupil size. Scientific American, 212(4), 46-55.
Hess, E. H., & Polt, J. M. (1964). upil size in relation to mental activity during simple
problem-solving. Science, 143(3611), 1190-1192.
Ho, A. D. (2008). The problem with proficiency: Limitations of statistics and policy
under no child left behind. Educational Researcher, 37(6), 351-360.
Holland, P. W. (1996a). Assessing unusual agreement between the incorrect answers
of two examinees using the k-index: Statistical theory and empirical support. ETS
Research Report Series, 1, 141.
Holland, P. W. (1996b). Assessing unusual agreement between the incorrect answers of
two examinees using the K-index: Statistical theory and empirical support. ETS
Research Report Series, 1, 1-41.
144
Holmqvist, K., Nystrm, M., Andersson, R., Dewhurst, R., Jarodzka, H., & Van de Weijer,
J. (2011). Eye tracking: A comprehensive guide to methods and measures. OUP
Oxford.
Huang, Y. S., & Suen, C. Y. (1995). A method of combining multiple experts for the
recognition of unconstrained handwritten numerals. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 17, 90-94.
Huang, Z., Chen, H., Hsu, C. J., Chen, W. H., & Wu, S. (2004). Credit rating analysis
with support vector machines and neural networks: a market comparative study.
Decision Support Systems, 37, 543-558.
Huber, P. J. (1991). Robust regression: asymptotics, conjectures and monte carlo. The
Annals of Statistics, 1, 799-821.
Imandoust, S. B., & Bolandraftar, M. (2013). Application of k-nearest neighbor (knn)
approach for predicting economic events: Theoretical background. International
Journal of Engineering Research and Applications, 3, 605-610.
Inhoff, A. W., Greenberg, S. N., Solomon, M., & Wang, C. A. (2009). Word inte-
gration and regression programming during reading: A test of the ez reader 10
model. Journal of Experimental Psychology: Human Perception and Performance,
35, 1571-1584.
Jacob, B. A., & Levitt, S. D. (2003). Rotten apples: An investigation of the prevalence
and predictors of teacher cheating. The Quarterly Journal of Economics, 118(3),
843-877.
Jacob, B. A., & Levitt, S. D. (2004). To catch a cheat: the pressures of accountability
may encourage school personnel to doctor the results from high-stakes tests. here?s
145
how to stop them. Education Next, 1(4), 68.
James, G., Witten, D., Hastie, T., & Tibshirani. (2013). An introduction to statistical
learning. New York, NY: Springer.
Just, M. A., & Carpenter, P. A. (1993). The intensity dimension of thought: pupillo-
metric indices of sentence processing. Canadian Journal of Experimental Psychol-
ogy/Revue Canadienne de Psychologie Exprimentale, 47(2), 310-333.
Justice, M., & Lankford, C. (2002). Pilot findings. Communication Disorders Quarterly,
1, 11-21.
Karabatsos, G. (2003). Comparing the aberrant response detection performance of thirty-
six person-fit statistics. Applied Measurement in Education, 16(4), 277-298.
Kennard, D. W., & Glaser, G. H. (1964). An analysis of eyelid movements. Journal of
Nervous and Mental Disease, 139(1), 31-48.
Kerr, D., & Chung, G. K. (2012). Identifying key features of student performance in
educational video games and simulations through cluster analysis. Journal of Edu-
cational Data Mining, 4, 144-182.
Khan, S. S., & Ahmad, A. (2004). Cluster center initialization algorithm for k-means
clustering. Pattern Recognition Letters, 25(11), 1293-1302.
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps.
Biological Cybernetics, 43, 59-69.
Kuhn, M. (2017). caret: Classification and regression training. R package version 6.0-71.
[Computer software manual]. Retrieved from https://CRAN.R-project.org/
package=som
Kuncheva, L. I., Bezdek, J. C., & Duin, R. P. (2001). Decision templates for multiple
146
classifier fusion: an experimental comparison. Pattern Recognition, 34(2), 299-
314.
Kuperman, V., & Van Dyke, J. A. (2011). Effects of individual differences in verbal
skills on eye-movement patterns during sentence reading. Journal of Memory and
Language, 65(1), 42-73.
Lattin, J. M., Carroll, J. D., & Green, P. E. (2003). Analyzing multivariate data. Pacific
Grove, CA: Thomson Brooks/Cole.
Lee, S. P., Badler, J. B., & Badler, N. I. (2001). Eyes alive. ACM Transactions on
Graphics (TOG), 21(3), 637-644.
Levy, M. R. J. . S. S., R. (2009). Posterior predictive model checking for multidimension-
ality in item response theory. applied psychological measurement. Applied Psycho-
logical Measurement, 33, 519-537.
Li, C. S. (2011). Cluster center initialization method for k-means algorithm over data sets
with two clusters. Procedia Engineering, 24, 214.
Linacre, J., & Wright, B. (1994). Chi-square fit statistics. Rasch Measurement Transac-
tions, 8(2), 350.
Loevinger, J. (1948). The technic of homogeneous tests compared with some aspects of
?scale analysis? and factor analysis. Psychological Bulletin, 45, 507.
Lord, F. M. (1952). A theory of test scores. Richmond, VA: Psychometric Corporation.
Lowenstein, L. I., O. (1962). Models for speed and time-limit tests. In H. Dawson (Ed.),
The eye (p. 187-208). Academic Press: New York.
Luce, R. D. (1986). Response times: Their role in inferring elementary mental organiza-
tion (no. 8). New York: Oxford University Press.
147
Magis, D., Rache, G., & Bland, S. (2012). A didactic presentation of snijderss lz* index
of person fit with emphasis on response model selection and ability estimation.
Journal of Educational and Behavioral Statistics, 37, 57-81.
Man, K., Harring, J., & Sinharay, S. (2019). Use of data mining methods to detect test
fraud. Journal of Educational Measurement, 56, 251-279.
Man, K., & Harring, J. R. (2019). Negative binomial models for visual fixation counts
on test items. Educational and Psychological Measurement, 79(4), 617635. Re-
trieved from https://doi.org/10.1177/0013164418824148 doi: 10.1177/
0013164418824148
Marianti, S., Fox, J. P., Avetisyan, M., Veldkamp, B., & Tijmstra, J. (2014). Testing for
aberrant behavior in response time modeling. Journal of Educational and Behav-
ioral Statistics, 39, 426-451.
Maris, E. (1993). Additive and multiplicative models for gamma distributed random
variables, and their application as psychometric models for response times. Psy-
chometrika, 58, 445-469.
McLachlan, G. J., Peel, D., & Bean, R. W. (2003). Modelling high-dimensional data by
mixtures of factor analyzers. Computational Statistics Data Analysis, 41, 379-388.
Meijer, R. R. (1994). The number of guttman errors as a simple and powerful person-fit
statistic. Applied Psychological Measurement, 18, 311-314.
Meijer, R. R. (1997). Person fit and criterion-related validity: An extention of the schmitt,
cortina, and whitney study. Applied Psychological Measurement, 21, 99-113.
Meijer, R. R., & Sijtsma, K. (1995). Detection of aberrant item response patterns: A
review of recent developments. Applied Measurement in Education, 8, 261-272.
148
Meng, X. B., Tao, J., & Chang, H. H. (2015). A conditional joint modeling approach
for locally dependent item responses and response times. Applied Psychological
Measurement, 52, 1-27.
Meyer, D., Leisch, F., & Hornik, K. (2003). The support vector machine under test.
Neurocomputing, 55, 169-186.
Mislevy, R. J., Corrigan, S., Oranje, A., DiCerbo, K., Bauer, M. I., von Davier, A., &
John, M. (2016). Psychometrics and game-based assessment. Technology and
Testing: Improving Educational and Psychological Measurement, 23-48.
Molenaar, D., Bolsinova, M., Rozsa, S., & De Boeck, P. (2016). Response mixture mod-
eling of intraindividual differences in responses and response times to the hungarian
wisc-iv block design test. Journal of Intelligence, 4, 10-25.
Molenaar, W., & Hoijtink, H. (1990). The many null distributions of person fit indices.
Psychometrika, 55(1), 75-106.
Morad, Y., Lemberg, H., & Dagan, Y. (2000). Pupillography as an objective indicator of
fatigue. Current Eye Research, 21, 535-542.
Motter, B. C., & Belky, E. J. (1998). The zone of focal attention during active visual
search. Vision Research, 38(7), 1007-1022.
Mouselimis, L. (2018). Kernelknn: Kernel k nearest neighbors. R package version 1.0.8.
[Computer software manual]. Retrieved from https://CRAN.R-project.org/
package=KernelKnn
Mroch, A. A., Lu, Y., Huang, C. Y., & Harris, D. J. (2014). Patterns of examinee erasure
behavior for a large-scale assessment. test fraud: Statistical detection and method-
ology. Test fraud: Statistical Detection and Methodology, 1, 137-148.
149
Mueller, L., Zhang, Y., & Ferrara, S. (2016). What have we learned? In Handbook of
quantitative methods for detecting cheating on tests (p. 373-390). Routledge, New
York: NY.
Nering, M. L., & Meijer, R. R. (1998). A comparison of the person response function
and the lz person-fit statistic. Applied Psychological Measurement, 22(1), 53-69.
Patz, R. J., & Junker, B. W. (1999). Applications and extensions of MCMC in IRT:
Multiple item types, missing data, and rated responses. Journal of Educational
and Behavioral Statistics, 24(4), 342-366. Retrieved from https://doi.org/
10.3102/10769986024004342 doi: 10.3102/10769986024004342
Plummer, M. (2015). Jags: Just another gibbs sampler. Retrieved from http://mcmc
-jags.sourceforge.net/ (version 4.0.0)
Ponsoda, V., Scott, D., & Findlay, J. M. (1995). A probability vector and transition
matrix analysis of eye movements during visual search. Acta Psychologica, 88(2),
167-185.
Poole, A., Ball, L. J., & Phillips, P. (2004). In search of salience: A response-time and
eye-movement analysis of bookmark recognition. In S. Fincher, P. Markopoulos,
D. Moore, & R. Ruddle (Eds.), People and computers xviii design for life. Lon-
don:Springer.
Porter, G., Troscianko, T., & Gilchrist, I. D. (2007). Effort during visual search and
counting: Insights from pupillometry. The Quarterly Journal of Experimental Psy-
chology, 60(2), 211-229.
Primoli, V., Liassou, D., Bishop, N. S., & Nhouyvanisvong, A. (2011). Erasure de-
scriptive statistics and covariates. In Annual Meeting of the National Council on
150
Measurement in Education. New Orleans, LA.
Qian, Y., Yao, F., & Jia, S. (2009). Band selection for hyperspectral imagery using affinity
propagation. IET Computer Vision, 3, 213-222.
Qualls, A. L. (2001). Can knowledge of erasure behavior be used as an indicator of
possible cheating? Educational Measurement: Issues and Practice, 20(1), 9-16.
Rasch, G. (1960). Studies in mathematical psychology: I. probabilistic models for some
intelligence and attainment tests. Oxford, England: Nielsen Lydiche.
Rayner, K., & Liversedge, S. P. (2011). Linguistic and cognitive influences on eye
movements during reading. In S. P. Liversedge, I. D. Gilchrist, & S. Everling (Eds.),
The oxford handbook of eye movements (p. 751-766). New York, NY, US: Oxford
University Press.
Rayner, K., Murphy, L. A., Henderson, J. M., & Pollatsek, A. (1989). Selective attentional
dyslexia. Cognitive Neuropsychology, 6, 357-378.
Reise, S. P. (1990). A comparison of item-and person-fit methods of assessing model-data
fit in IRT. Applied Psychological Measurement, 14(2), 127-137.
Ripley, B. (2018). tree: Classification and regression trees. R package version 1.0-37.
[Computer software manual]. Retrieved from https://CRAN.R-project.org/
package=tree
Romero, C., Gonzlez, P., Ventura, S., Del Jess, M. J., & Herrera, F. (2009). Evolution-
ary algorithms for subgroup discovery in e-learning: A practical application using
moodle data. Expert Systems with Applications, 36, 632-1644.
Roskam, E. E. (1997). Models for speed and time-limit tests. In W. J. van der Linden &
R. K. Hambleton (Eds.), Handbook of modern item response theory (p. 187-208).
151
New York: Springer.
Roy-Charland, A., Saint-Aubin, J., Klein, R. M., & Lawrence, M. (2006). Eye move-
ments as direct tests of the go model for the missing-letter effect. Perception Psy-
chophysics, 3, 324337.
Rubin, D. B. (1996). Comment: On posterior predictive p-values. Statistica Sinica, 6,
787-792.
Saupe, J. L. (1960). An empirical model for the corroboration of suspected cheating on
multiple-choice tests. Educational and Psychological Measurement, 20(3), 475-
489.
Schnipke, D. L., & Scrams, D. J. (1997). Modeling item response times with a two-state
mixture model: a new method of measuring speededness. Journal of Educational
Measurement, 34, 213-232.
Schoenig, R., Geraets, J., & Mulkey, J. (2016). The test security framework: Why different
tests need different test security requirements. Retrieved from http://mcmc-jags
.sourceforge.net/
Segal, M. R. (2004). Machine learning benchmarks and random forest regression. Divi-
sion of Biostatistics, University of California, San Francisco, CA.
Shepard, L. A. (1990). Inflated test score gains: Is the problem old norms or teaching the
test? Educational Measurement: Issues and Practice, 9(3), 15-22.
Shepherd, M., Findlay, J. M., & Hockey, R. J. (1986). The relationship between eye
movements and spatial attention. The Quarterly Journal of Experimental Psychol-
ogy Section A, 38(3), 475-491.
Sijtsma, K. (1986). A coefficient of deviance of response patterns. Kwantitatieve Metho-
152
den, 7(22), 131-145.
Sijtsma, K., & Meijer, R. R. (1992). A method for investigating the intersection of item
response functions in mokken?s nonparametric IRT model. Applied Psychological
Measurement, 16(2), 149-157.
Sinharay, S. (2017). Detection of item preknowledge using likelihood ratio test and score
test. Journal of Educational and Behavioral Statistics, 42, 46-68.
Sinharay, S. (2018). Are the nonparametric person-fit statistics more powerful than their
parametric counterparts? revisiting the simulations in karabatsos (2003). Applied
Measurement in Education, 31, 98-98.
Sinharay, S., & Johnson, M. S. (2017). Three new methods for analysis of answer
changes. Educational and Psychological Measurement, 77(1), 54-81.
Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of
item response theory models. Applied Psychological Measurement, 30(4), 298-321.
Retrieved from https://doi.org/10.1177/0146621605285517 doi: 10.1177/
0146621605285517
Skorupski, W. P., & Egan, K. (2011). Detecting cheating through the use of hierarchical
growth models. In Annual Meeting of the National Council on Measurement in
Education. New Orleans, LA.
Slakter, M. J. (1968). The effect of guessing strategy on objective test scores. Journal of
Educational Measurement, 5, 217-222.
Snijders, T. A. B. (2001). Asymptotic null distribution of person fit statistics with esti-
mated person parameter. Psychometrika, 66, 331-342.
Sotaridona, L. S., & Meijer, R. R. (2002a). Statistical properties of the K-index for
153
detecting answer copying. Journal of Educational Measurement, 39(2), 115-132.
Sotaridona, L. S., & Meijer, R. R. (2002b). Statistical properties of the k-index for
detecting answer copying. Journal of Educational Measurement, 39(2), 115-132.
Sotaridona, L. S., & Meijer, R. R. (2003). Two new statistics to detect answer copying.
Journal of Educational Measurement, 40(1), 53-69.
Sotaridona, L. S., van der Linden, W. J., & Meijer, R. R. (2006a). Detecting answer copy-
ing using the kappa statistic. Applied Psychological Measurement, 30, 412431.
Sotaridona, L. S., van der Linden, W. J., & Meijer, R. R. (2006b). Statistical methods
for the detection of answer copying on achievement tests. Applied Psychological
Measurement, 41(5), 361-37.
Steinley, D. (1985). K-means clustering: A half-century synthesis. British Journal of
Mathematical and Statistical Psychology, 59, 1-34.
Stonehill, R. M. (1988). Norm-referenced test gains may be real: A response to john
jacob cannell. Educational Measurement: Issues and Practice, 7(2), 23-24.
Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: ratio-
nale, application, and characteristics of classification and regression trees, bagging,
and random forests. Psychological Methods, 14, 323-333.
Su, Y. S., & Yajima, M. (2015). R2jags: Using R to run JAGS. (version 0.5)
Tatler, B. W., & Vincent, B. T. (2008). Eyes alive. Journal of Eye Movement Research,
2(2), 37-44.
Tatsuoka, K. K., & Tatsuoka, M. M. (1983). Spotting erroneous rules of operation by the
individual consistency index. Journal of Educational Measurement, 20, 221-230.
Tendeiro, J. N., Meijer, R. R., & Niessen, A. S. M. (2016). Perfit: An r package for
154
person-fit analysis in IRT. Journal of Statistical Software, 74, 1-27.
Therneau, . A. B., M.T. (2018). Rpart: Recursive partitioning and regression trees.
(version 4.1-13)
Thissen, D. (1983). Timed testing: An approach using item response theory. In D. J. Weiss
(Ed.), New horizons in testing: Latent trait test theory and computerized adaptive
testing (p. 179-203). New York: Academic Press.
Thomas, S. L. (2016). The use of data mining techniques to detect cheating. In The six-
teenth annual maryland conference: Data analytics and psychometrics: Informing
assessment practices. College Park, MD.
Titterington, D. M., & Makov, U. E. (1985). Statistical analysis of finite mixture distri-
butions. Hoboken, NJ: John Wiley Sons.
Trabin, T. E., & Weiss, D. J. (1983). The person response curve: Fit of individuals to
item response theory models. In New horizons in testing (p. 83-108). Elsevier.
van der Linden, W., Scrams, D., & Schnipke, D. L. (1999). Using response-time con-
straints to control speedness in computerized adaptive testing. Applied Psycholog-
ical Measurement, 23, 195-210.
van der Linden, W. J. (2006a). A hierarchical framework for modeling speed and accuracy
on test items. Psychometrika, 72, 287-308.
van der Linden, W. J. (2006b). A lognormal model for response times on test items.
Journal of Educational and Behavioral Statistics, 31(2), 181-204.
van der Linden, W. J. (2006c). A lognormal model for response times on test items.
Journal of Educational and Behavioral Statistics, 31, 181-204.
van der Linden, W. J., & Guo, F. (2008). Bayesian procedures for identifying aberrant
155
response-time patterns in adaptive testing. Psychometrika, 73, 365384.
van der Linden, W. J., & Sotaridona, L. (2006). Detecting answer copying when the
regular response process follows a known response model. Journal of Educational
and Behavioral Statistics, 31(3), 283-304.
van der Maas, H. L., & Jansen, B. R. (2003). What response times tell of childrens
behavior on the balance scale task. Journal of Experimental Child Psychology, 85,
141-177.
van Gerven, P. W. M., Paas, F. G. W. C., van Merrinboer, J. J. G., & Schmidt, H. G.
(2002). Cognitive load theory and aging: effects of worked examples on training
efficiency. Learning and Instruction, 12(1), 87-105.
van Krimpen-Stoop, E. M. L. A., & Meijer, R. R. (2001). Cusum-based person-fit statis-
tics for adpative testing. Journal of Educational and Behavioral Statistics, 26,
199-217.
Verhelst, N. D., Verstralen, H. H., & Jansen, M. G. H. (2013). A logistic model for time-
limit tests. In Handbook of modern item response theory (p. 169-185). Springer,
New York, NY.
VinuelaNavarro, V., Erichsen, J. T., Williams, C., & Woodhouse, J. M. (2017). Saccades
and fixations in children with delayed reading skills. Ophthalmic and Physiological
Optics, 37, 531-541.
Vitu, F. (1991). The existence of a center of gravity effect during reading. Vision Re-
search, 31, 1289-1313.
Volodin, N., & Adams, R. (1995). Identifying and estimating a D-dimensional item
response model. In International Objective Measurement Workshop. University of
156
California, Berkeley, California.
Wang, T., & Nydick, S. W. (2015). Comparing two algorithms for calibrating the re-
stricted non-compensatory multidimensional IRT model. Applied Psychological
Measurement, 39, 119-134.
Wehrens, R., & Buydens, L. M. (2007). Self-and super-organizing maps in R: the kohonen
package. Journal of Statistical Software, 21, 1-19.
Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest
neighbor classification. Journal of Machine Learning Research, 10, 207-244.
Wise, S. L., & Kong, X. (2005). Response time effort: A new measure of examinee
motivation in computer-based tests. Applied Measurement in Education, 18, 163-
183.
Wollack, J. A. (1997). A nominal response model approach for detecting answer copying.
Applied Psychological Measurement, 21(4), 307-320.
Wollack, J. A., & Cizek, G. J. (2016). Section IIa detecting similarity, answer copying,
and aberrance. In Handbook of quantitative methods for detecting cheating on tests
(p. 39-114). Routledge.
Wollack, J. A., & Fremer, J. J. (2013). Introduction: The test security threat. New York:
Chapman & Hall.
Wright, B. D., & Stone, M. H. (1979). Best test design.
Wu, M., Adams, R., Wilson, M., & Haldane, S. (1998). Conquest: Generalized item re-
sponse modeling software [computer software and manual]. Camberwell, Victoria:
Australian Council for Educational Research.
Yan, J. (2016). som: Self-organizing map. R package version 0.3-5.1. [Computer software
157
manual]. Retrieved from https://CRAN.R-project.org/package=som
Yoss, R. E., Moyer, N. J., & Hollenhorst, R. W. (1970). Pupil size and spontaneous
pupillary waves associated with alertness, drowsiness, and sleep. Neurology, 20,
545-545.
Zopluoglu, C. (2016). Classification performance of answer-copying indices under dif-
ferent types of IRT models. Applied Psychological Measurement, 40(8), 592-607.
Zopluoglu, C., & Davenport, E. C. (2012). The empirical power and type i error rates
of the GBT and ? indices in detecting answer copying on multiple-choice tests.
Educational and Psychological Measurement, 72(6), 975-1000.
Zubin, J., & Steinhauer, S. R. (1983). The metamorphosis of schizophrenia: from chronic-
ity to vulnerability. Psychological Medicine, 13, 551-571.
158