ABSTRACT Title of Dissertation: AN EVALUATION OF CLUSTERING ALGORITHMS FOR MODELING GAME-BASED ASSESSMENT WORK PROCESSES W. Austin Fossey, Doctor of Philosophy, 2017 Dissertation directed by: Professor Laura Stapleton and Professor Tracy Sweet, Department of Human Development and Quantitative Methodology Game-based assessments (GBAs) use game design elements to make assessments more engaging for students and capture response data about work processes. GBA response data are often too complex to plan for every potential response pattern, so some researchers have turned to exploratory cluster analysis to classify students’ work processes. This paper identifies the design elements specific to GBAs and investigates how well k-means, self-organizing maps (SOM), and robust clustering using links (ROCK) clustering algorithms group response patterns in prototypical GBA response data. Results from a simulation study are discussed, and a tutorial is provided with recommendations of general considerations and best practices for analyzing GBA data with clustering algorithms. AN EVALUATION OF CLUSTERING ALGORITHMS FOR MODELING GAME- BASED ASSESSMENT WORK PROCESSES by William Austin Fossey Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2017 Advisory Committee: Professor Laura Stapleton, Chair Professor Jeffrey Harring Professor Kent Norman Dr. Andre Rupp Professor Tracy Sweet © Copyright by William Austin Fossey 2017 ii Table of Contents Table of Contents ................................................................................................................ ii List of Tables ...................................................................................................................... v List of Figures ................................................................................................................... vii Chapter 1: Introduction ....................................................................................................... 1 Evidence from Work Processes in GBAs ....................................................................... 4 The Challenge of Interpreting Work Process Data in GBAs .......................................... 6 Data Collection Considerations for Clustering GBA Data ............................................. 8 Study Purpose ............................................................................................................... 12 Chapter 2: Game-Based Assessment ................................................................................ 15 Architectural Game Design Elements ........................................................................... 15 Experiential Game Design Elements ............................................................................ 19 GBAs: Assessments with Construct-Irrelevant Relatedness Elements ........................ 25 The Use of GBAs in Education .................................................................................... 26 Challenges of Using Games for Assessment ................................................................ 30 Designing GBAs with Evidence-Centered Design ....................................................... 33 Measurement Models: Principled versus Empirical Approaches in GBAs .................. 35 Clustering Algorithms ................................................................................................... 41 Chapter 3: Feasibility Study.............................................................................................. 60 Simulated Data for the Feasibility Study ...................................................................... 60 iii GBA Data for the Feasibility Study .............................................................................. 62 Procedure ...................................................................................................................... 66 Performance of the Clustering Algorithms for Datasets S1 and S2.............................. 75 Performance of the Clustering Algorithms for Datasets M1 and M2 ........................... 82 Discussion of the Feasibility Study............................................................................... 87 Chapter 4: Methods for Full Study .................................................................................. 90 Simulation Design Summary ........................................................................................ 92 Data Generation ............................................................................................................ 93 Number of Clusters for the Simulation ......................................................................... 95 Number of Cases in the Simulated Clusters ................................................................. 96 Data Type and Response Probabilities in Simulated Clusters ...................................... 97 Amount of Overlap Between Clusters .......................................................................... 99 Number of Response Variables in the Simulation ...................................................... 102 Methods and Outcome Measures ................................................................................ 104 Tutorial Example ........................................................................................................ 108 Chapter 5: Simulation Study Results .............................................................................. 111 Research Question 1: Kappa Differences by Varying Individual Conditions ............ 114 Research Question 2: Conditions Under Which Kappa is Expected to Exceed 0.80 . 122 Research Question 3: Conditions In Which Algorithms Can Validate Each Other ... 125 Research Question 4: Finding the Number of Clusters with DBI .............................. 127 iv Simulation Study Results Summary ........................................................................... 132 Chapter 6: Tutorial Example Study ................................................................................ 134 Beanstalk Game .......................................................................................................... 135 Beanstalk Student Data ............................................................................................... 137 Step 1: Recode the Data and Select Variables ............................................................ 138 Step 2: Select Clustering Algorithms .......................................................................... 142 Step 3. Plan Cross Validation ..................................................................................... 143 Step 4: Run the Algorithms......................................................................................... 144 Step 5: Validating Across Algorithms ........................................................................ 150 Step 6: Validating Against Holdout Sample and External Criterion .......................... 153 Step 7: Interpreting, Labeling, and Grouping Clusters ............................................... 161 Step 8: Students’ Changes in Cluster Membership Over Time .................................. 164 Tutorial Example Summary ........................................................................................ 168 Chapter 7: Discussion ..................................................................................................... 170 Summary of Results .................................................................................................... 170 Study Limitations ........................................................................................................ 174 Implications................................................................................................................. 177 Future of GBA ............................................................................................................ 180 Appendix A: R Scripts for Full Study............................................................................. 185 References ....................................................................................................................... 196 v List of Tables Table Page 1 S1 Simulated Data Parameters……………………………………………... 61 2a S2 Simulated Data Parameters………………………………………….….. 61 2b Example of S2 Simulated Data (First Six Cases) …………………………. 61 3 Excerpts of Number Jumble Log File Data (First Five Moves)………...…. 64 4 Excerpt of M1 Data (First Six Cases) ……………………………………... 66 5 Excerpt of M2 Simulated Data (First Six Cases) ………………………….. 67 6 Example of a Pivot Table Showing Classification Agreement Between Two Clustering Solutions………………………………………………….. 73 7 S1: True Cluster Recovery by Algorithm…………………………….……. 76 8 S2: True Cluster Recovery by Algorithm………………………………….. 77 9 S1: Cluster Agreement Between Pairs of Algorithms……………………... 77 10 S2: Cluster Agreement Between Algorithms…………………………......... 79 11 DBI Results by Solution for S1 and S2……………………………………. 80 12 M1: Cluster Agreement Between Algorithms in a Two Cluster Solution…. 85 13 M2: Cluster Agreement Between Algorithms in a Two Cluster Solution…. 86 14 M2: Cluster Agreement Between Algorithms in a Three Cluster Solution... 87 15 DBI Results by Solution for M1 and M2…………………………………... 87 16 Total Number of Variables by Number of Clusters………………………... 105 17 Simulation Design of Four Varied Parameters Resulting in 48 Cells…………….. 109 18 Actual Number of Datasets Generated per Cell……………………………. 113 19 Correlations between RMSD and Kappa by Study Cell and Clustering Algorithm…………………………………………………………………... 121 vi 20 Percentage of Solutions with Kappa ≥ 0.80 by Study Cell and Clustering Algorithm…………………………………………………………………... 123 21a Algorithms’ Agreement (Kappa) by k (Level 2-11) ………………………. 152 21b Algorithms’ Fit (DBI) by k (Level 2-11) ………………………………….. 152 vii List of Figures Figure Page 1 Example of the Classic 15 Puzzle………………………………………….. 3 2 Example of a Time-Based Connection Element’s Influence on Game Performance……………………………………………………………….... 19 3 Conceptual Assessment Framework (CAF) Structure……………………... 35 4 Visualization of the Self-Organizing Map Learning Process………………. 55 5 Level Mul_1 from Number Jumble……………………………………….... 63 6 Example of Recoding Sequence of Two-Dimensional Cluster Solutions….. 72 7a Cluster Center Magnitudes by Variable and Clustering Solution in S1……. 81 7b Cluster Center Magnitudes by Variable and Clustering Solution in S2……. 81 8 Mean Kappa Rates by Algorithm for Different Numbers of Clusters k……. 116 9 Mean Kappa Rates by Algorithm for Two Different Average Cluster Sizes Nk…………………………………………..………………………………… 117 10 Mean Kappa Rates by Algorithm for Two Different Variances in the Data’s Cluster Sizes Nk.………………………………………….. ………………… 118 11 Mean Kappa Rates by Algorithm for Three Different Probabilities of Response in the Relevant Variables Defining Clusters……………………... 120 12 Kappa Distributions by Cluster Size, for Each Combination of Response Probability and k Clusters…………………………………………………… 124 13 Percentage of Datasets Correctly Identified, Overestimated, and Underestimated, as Determined by Davies-Bouldin Index (DBI)…………... 129 14 DBI Values for Three Clustering Algorithms when Varying the Number of Requested Clusters from k = 17-23 Clusters………………………………... 132 15 Screenshot from Beanstalk………………………………………………….. 136 16 Screenshot of Beanstalk Log File Dataset.………………………………….. 138 17 DBI Values by Clustering Algorithm and Number of Clusters (k) for Beanstalk Level 3-2…………………………………………………………. 153 viii 18 DBI Comparisons between Training Sample and Holdout Sample………… 156 19 Correct Outcome Rates (Rates of Beating the Level) for the Six Clusters Identified in Beanstalk Level 2-9…………………………………………… 158 20a Characterization of Flower Placement for Students in Cluster 4 on Beanstalk Level 2-9…………………………………………………………. 159 20b One Possible Correct Solution for Cluster 4 in Beanstalk Level 2-9……….. 159 20c The Other Possible Correct Solution for Cluster 4 in Beanstalk Level 2-9…. 159 21 Flower Placement Rates for Cluster 4 in Beanstalk Level 2-9……………… 160 22 Group Transition Map of Students’ Clusters Across 75 Levels of Beanstalk. 167 1 Chapter 1: Introduction While educational games are not new, the field has seen tremendous growth as technology has expanded the capabilities and availability of educational games. This growth mirrors a larger trend in gaming, where blockbuster video game franchises are being joined by “casual games” (games that are played for only a few minutes at a time, often on mobile devices), which are played by 40% of U.S. households (Schleiner, 2001). Though skepticism persists, proponents of educational games observe that games can model learning in an engaging, challenging, and entertaining medium that can complement or enhance traditional learning methods (Teknibaş, 2014), and there is growing research demonstrating the benefits of games for student learning and classroom teaching (e.g., Shapiro, Teknibaş, Schwartz, & Darvasi, 2014). Part of the appeal of educational games is that they provide an environment where students can apply learning in a relevant way, but they can also record data and adapt the experience so that it keeps students engaged. If the right data are collected, games may even provide a view into how students learn (Shapiro, 2014b). As the boundaries between games and learning continue to blur, another movement is gaining momentum in the assessment industry which blurs the lines between learning and assessment. Like educational games, this movement to use assessments as instructional devices exists at both a grassroots level in classrooms and in organized research initiatives like Educational Testing Services’ (ETS) Cognitively Based Assessments of, for, and as Learning (CBAL) project (Bennett, 2010). In this sphere, assessment designers are building assessments which not only measure the 2 outcomes of learning, but which also support instruction and serve as learning tools in their own right. Given the current trends of increasing use of game-based learning and assessments as learning tools, it is not surprising that there is growing interest in game- based assessments (GBAs). Understanding the intersection of games and assessments is important for understanding the context of the issues GBA designers face. An assessment is any instrument that collects evidence from a sample of students’ behavior and provides a quantitative value from which an inference about the students can be made (Crocker & Algina, 2008). The definition of a game is not quite as clear cut, and perspectives on games can vary based on how the games are designed and played (Mislevy et al., 2014; Salen & Zimmerman, 2004). A game can become an assessment if it can yield quantitative data to support valid inferences about students, but making the leap from collecting gameplay data and evaluating the student results can be daunting because even the simplest games require a complex set of actions, decisions, strategizing, and mistakes in gameplay. Consider the simple, classic 15 Puzzle, which is a puzzle where the player slides tiles around to get the numbered tiles into the correct order (see Figure 1). Only one open space is available at any given time, so a player has a maximum of four possible moves at each step of his or her gameplay work process. Although the game is simple, a good player can think strategically about how to slide tiles to position them such that the puzzle can be solved in fewer moves—a goal which served as the test case for early computer search algorithm research (e.g., Brüngger, Marzetta, Fukuda, & Nievergelt, 1999). 3 7 8 9 1 11 14 4 13 3 2 10 15 12 6 5 Figure 1. Example of the classic 15 Puzzle where players must correctly order the tiles. There are an astounding 16!/2 possible configurations for the 15 Puzzle, and the most efficient solution for the hardest initial board state requires 80 moves (Brüngger et al., 1999). If one were to record that gameplay work process, one might imagine a dataset where one simply records the tile that was moved, its original position, and its new position; however, one can go further and create variables recording the sequence of moves, the state of the puzzle board at each move, the time it took for the player to make his or her decision, etc. All of these variables may be useful evidence for evaluating what the player was thinking and how he or she played the game, and these data could serve to distinguish expert players from novices. To add to the complexity of the gameplay data themselves, there is the added question of whether the game (used as an assessment instrument) is designed to provide any evaluation of the gameplay work process. The 15 Puzzle can only inform the player if the puzzle has been solved, but there is no mechanism for reporting how efficiently the player solved that puzzle. Mislevy et al. (2014) noted that the GBA only needs to collect evidence that can be used for assessment inferences—scoring and evaluation can occur outside of the game. This distinction implies that GBAs can take several forms, as is shown in the following section. 4 Evidence from Work Processes in GBAs Mislevy et al. (2014) defined three GBA paradigms: assessment outside the game, assessment inside the game as pre-specified work products, and assessment inside the game using evidence from work processes. This third GBA paradigm uses assessment inside the game, but the data used for the assessment inferences come from data streams that reflect the work processes or gameplay of the students in the game (Shute, 2011). Students make many gameplay choices that manifest as patterns of responses and interactions with the game’s architecture, and those response patterns are collected as evidence for assessment, but it is not necessarily feasible to plan for every possible response in a complex game. To understand the potential of this third paradigm, it is worth considering what a work process represents. Newell and Simon (1972) described a work process arrangement as three parts of a problem: givens, goals, and operations. The givens are the information and resources provided to the student. The goals are the desired end state of the assessment, which may be described in instructions to the student. The operations are the work processes in which the student engages in their attempt to use the givens to get to the goals. In the 15 Puzzle example, the givens are the state of the puzzle board at each move, the goal is to put the numbered tiles on order, and the operations consist of both the observed movement of the tiles as well as the unobservable decision-making processes and strategizing that the player conducts before moving a tile. The entire work process consists of taking the givens, performing operations, and arriving at a goal state. 5 Gameplay certainly embodies the Newell and Simon’s (1972) definition of work processes, and definitions of gameplay echo their assertion that work processes consist of givens, operations, and goals. For example, Suits (2005) defined gameplay as: … [An] attempt to achieve a specific state of affairs (prelusory goal), using only means permitted by rules (lusory means), where the rules prohibit use of more efficient in favour of less efficient means (constitutive rules), and where the rules are accepted just because they make possible such activity (lusory attitude). (p.54- 55) What Suits’ definition reveals about gameplay is that players accept the imposition of game rules which make it more difficult to engage in certain work processes, but the assumption is that the added challenge and experience of the game rules is entertaining and engaging. For example, if the goal of the 15 Puzzle is to put the tiles in order, the most efficient way to do this is to crack the case and arrange the tiles by hand, but players accept the rules of the game that limit them to moving only one tile at a time into the one available empty space. But why do researchers care about work processes? Why is it valuable to identify how students take the givens and perform operations to reach a goal state? While many traditional assessments use the goal state (e.g., correct or incorrect) for assessment evidence, there is one common assessment use case for which the work process itself is valuable evidence: differentiating between novice and expert work processes. This application manifests frequently in literature relating to complex assessment designs, where researchers are either trying to discriminate between groups of students (e.g., Bruer, 1993; Stevens et al., 2012; Nash & Shaffer, 2011), or they are trying formatively 6 to help a novice student become an expert (e.g., Kerr, Chung, & Iseli, 2011; Stevens & Casillas, 2006). In the third GBA paradigm described above, the GBA provides an environment with some rules and constraints that govern the student’s actions. If the game environment becomes more complex and open, vast differences may exist in students’ work processes—both in terms of operations and goals. These GBAs also open up possibilities for interpreting sub-processes or sequences within a student’s work process as he or she performs gameplay operations to reach a goal (Mislevy et al., 2014). Multiplayer environments expand the possibilities to include gameplay where students interact together while working on individual goals or where students collaborate on a single group goal, possibly with competitive aspects of the play. Regardless of the scenario, the focus on work processes in this third GBA paradigm represents an exciting frontier in assessment design, as well as a challenge for process data collection and interpretation, and this GBA paradigm is the focus of this paper. The Challenge of Interpreting Work Process Data in GBAs Using work process data streams for assessment evidence is not a new concept. Assessment designers have successfully built complex performance assessments, simulations, and training systems for many years, especially in the certification and licensure fields (e.g., Braun, Bejar, & Williamson, 2006; Levy & Mislevy, 2004; Margolis & Clauser, 2006; Mislevy, 1996). Though these assessments are excellent examples of assessing work processes, they are not games. They are meant to assess and instruct, but they are not designed to be especially fun or engaging, and while complex, these assessments’ designers must impose some rigidity to ensure that their measurement 7 models can accommodate all possible responses. This rigidity cannot exist in a GBA, which means GBA designers must balance the need to be able to plan for the evidence that can be fed into a measurement model while also avoiding a monotonous game experience. From a game perspective, GBA designers seek to strike a balance between chaos, determinism, and opacity. Chaotic variation in the game experience can make the game seem more realistic, but it undermines standardization of students’ work processes, and a game with too much chaos can frustrate students (for example, a 15 Puzzle with a random starting board is fun and challenging, but a 15 Puzzle where the numbers randomly change on tile faces throughout the game would be maddening). Designers may instead focus on deterministic game experiences (similar to the design of simulation assessments), but these may need to be very complex if they are going to be engaging for the students. The designer may also choose to obfuscate certain game mechanics so that the students can make discoveries and learn about the game as they play, which is different from most assessments where the goal is to ensure that students understand exactly how to respond to an item or perform a task (Mislevy et al., 2014). Chaos, complex deterministic game elements, and opacity create variation in the game experience and the students’ work processes that can make it challenging (though certainly not impossible) to interpret students’ performance; however, GBA designers are now using data mining techniques to identify patterns in GBA response data while simultaneously loosening the constraints on the assessment design. In order to find patterns in subdimensions of these large, sparse datasets of GBA response data, researchers often turn to clustering algorithms to help group and interpret students’ GBA 8 results. Clustering algorithms are processes that use observed similarities or densities in complex data to identify patterns and group similar observations (Berkhin, 2006). Although the use of clustering algorithms to classify student response patterns may seem straightforward, researchers must still make decisions about which clustering algorithm to use and how to set any required parameters. Results may also be compared across multiple potential algorithms to help guide interpretation. Data Collection Considerations for Clustering GBA Data While there are existing studies that illustrate how to select proper clustering algorithms for general data mining work (e.g., Houle, Kriegel, Kröger, Schubert, & Zimek, 2010), there are additional considerations for researchers who are working with GBA data because the data must be linked by a theory to the construct of interest. A construct is a hypothetical concept that the test developers define based on theory, existing literature, and prior empirical work. A construct may be a skill (i.e., the capacity to do something under certain conditions), proficiency (i.e., competency within a domain), or knowledge (i.e., understanding of a subject) (Mislevy, Behrens, Dicerbo, & Levy, 2012). Assessment researchers must propose how a construct relates to a domain of observable behaviors and then design an assessment instrument to record those behaviors in a way to make an inference about the students in the context of the construct. As a simple example, a researcher will propose that the construct of math ability is related to correct answers on math questions, which one can observe directly. For GBAs, researchers must consider how gameplay and what aspects of gameplay relate to the construct of interest. 9 The researcher’s ability to use observable gameplay in a GBA to make inferences about a construct is dependent on how well the GBA elicits behavior related to the construct and collects relevant data. The researcher must have data at an appropriate level of granularity and organization to be able to identify relevant response patterns, and they must understand all of the variables in how the data are created and collected. Researchers interested in work processes may typically look at variables related to the architecture of the game (e.g., tasks or activities in which the student engaged, state of the game when an action occurred), but some analysis may extend to statistical variables as well (e.g., scores, normative comparisons with other students). Unlike inferences based solely on item responses or final work products, response patterns from process data (like gameplay) can include data representing actions, sequences, context, speed, timing— whole vectors of data representing multiple interacting or overlapping processes, only some of which may be of interest. For example, GBA data may have subsets of data that represent student response processes that are happening simultaneously, at different rates, or in different orders. If researchers want to have a chance at being able to untangle and identify these individual processes for analysis, they must understand how the GBA design impacts the collection and interpretation of gameplay data. Mislevy et al. (2014) discussed different types of architectural game elements and provided insight on how these elements relate to how evidence is collected and interpreted from students’ work processes or gameplay. Mislevy et al. observed that the fundamental elements of a game are the rules, where a reaction from the game is triggered when the student does an action at a given time and under certain conditions. There are several types of architectural elements in the game besides rules, but there are 10 two elements that relate to the rules and to how evidence is collected about the students’ work processes: connections and state. Connections define the relationships between objects and attributes in game design, which includes design elements like the rules of the game. Connections can also serve to define evidence from work processes in cases where sequences or timings matter. As an example, Mislevy et al. (2014) described a GBA where the sequence of students’ actions was an important factor in making an inference about how well the student understood and planned for the consequences of their decisions in the game because the rules of the game stated that a set of actions had to be completed in a certain order to achieve the desired outcome. In the 15 Puzzle example, connections can represent which tiles can be moved at a given point in time, and which direction they can move in. Connections also represent the order in which the tiles must be placed in order for the puzzle to be solved (e.g., tile 5 must appear before tile 6). Game state is another potential source of evidence for interpreting students’ work processes. Game state defines what actions the student can take and the possible consequences to those actions, and the game state is updated by rules in the game. Game state may influence the student’s actions as well. For example, a GBA with high autonomy may mean that some students do not get to explore the entire game environment—it is just too vast and complex—and if certain actions or processes are only possible under some game states, then it may be that those students never have an opportunity to demonstrate those processes. Mislevy et al. (2014) gave an example of a video game where the avatar’s possible movements at a given moment in time depend on whether or not she is walking on land or swimming in water. In the 15 Puzzle example, 11 game state defines which tiles can be moved, but it also defines the most efficient solution. Recall that Brüngger et al. (1999) solved difficult 15 Puzzle game states with 80 moves, but a puzzle where only one tile was out of place can be solved instantly. While game state can force or suppress certain actions, it can also be the source of context for applying a rule. For example, in sports, the team with the highest score only wins the game (application of a rule) when the timer runs out (a game state). The same is true in many simulation assessments, where complex logical scoring trees provide the same rule- based structures (e.g., Margolis & Clauser, 2006). Understanding how actions, sequences (connections), and game state may impact our understanding of a student’s work process is an important step toward deciding how to collect and analyze GBA data. Different analyses will be appropriate for different types of data and inferences, but for researchers seeking to identify and then interpret patterns of behavior in data, some researchers suggest using data mining techniques like clustering algorithms (e.g., Kerr, Chung, & Iseli, 2011; Romero, Gonzalez, Ventura, del Jesus, & Herrera, 2009). Clustering algorithms are exploratory processes that group cases together based on observed similarities in the data (Berkhin, 2006). For researchers who are using clustering algorithms to data mine from GBAs, the choice of variables to analyze can impact the interpretations of the clusters, so it is helpful to understand how the inclusion or exclusion of sequence and game state data impacts the identification of clusters representing different work processes. Sequence and game state data may or may not be relevant to interpretation, but including these data usually increases the size of the dataset, leading to a clustering problem for high-dimensional, sparse, and often nominal data. 12 Study Purpose To date, there has been excellent research into the design of GBAs (e.g., Kerr & Chung, 2012; Mislevy et al., 2014; Stevens & Casillas; 2006), and there is extensive guidance on the limitations and proper validation of clustering algorithm results (e.g., Kriegel, Kröger, & Zimek, 2009; Halkidi, Batistakis, & Vazirgiannis, 2001; Xu, Recker, Qi, Flann, & Ye, 2013). Unfortunately, little guidance is provided for the specific application of clustering algorithms to work process data (i.e., gameplay data) in GBAs. The result has been research papers that do little to explain the rationale for using a given clustering algorithm, and do less still to validate results with more than expert judgment. GBA researchers have made use of clustering algorithms and produced meaningful clusters of process data that can be interpreted, but as the field expands, GBA researchers’ efforts will be more focused and defensible if they understand how GBA design affects gameplay and how the resulting work process data structure may affect the utility of different clustering algorithms. Furthermore, although there is guidance on designing GBAs and creating a valid assessment (e.g., Mislevy et al., 2014), there is not yet a tutorial providing instructions on how to validate and report clustering algorithm results for those GBA researchers who choose to use these tools. There is an opportunity to merge lessons and examples from both GBA researchers and other clustering algorithm research to create a unified set of instructions. Given GBA design considerations and their potential impact on the interpretation of students’ work processes, this study accomplishes two goals: 1. Investigate how well different clustering algorithms identify clusters of response patterns in GBA response data in order to inform researchers about 13 clustering algorithm limitations and impacts of varying conditions in datasets that mimic GBA scenarios. 2. Provide researchers with a tutorial for how to select clustering algorithms, compare results, and validate a clustering solution in the context of GBA development to ensure that inferences made about students on the basis of a clustering algorithm have been rigorously investigated and can support a validity argument. Three clustering algorithms are compared in this study: k-means, self-organizing maps (SOM), and robust clustering using links (ROCK). The k-means algorithm was selected due to its general popularity and ease of use for education research. The SOM algorithm was selected because it has been used to analyze work process data in GBAs and simulation assessments (Stevens & Casillas, 2006; Stevens et al., 2012), and ROCK was chosen because it is specifically designed for binary data, which is a common way of tagging work process data. Chapter 3 discusses a small-scale feasibility simulation examining differences in these algorithms’ clustering solutions and their performance with GBA data. A large-scale tutorial example study using empirical GBA data and a simulation study are outlined in Chapter 4. Chapter 5 examines how the three clustering algorithms perform under the varying simulated GBA conditions discussed in Chapter 4. Chapter 6 offers a tutorial for how cluster validation can be conducted with competing algorithms, by way of the tutorial example study discussed in Chapter 4. It is important to acknowledge that this study does not generalize to all possible game designs or data conditions. Game designs and gameplay are enormously diverse, and it would be impossible to account for every possible scenario. Similarly, this study is 14 not expected to generalize to every possible clustering algorithm. Instead, this study serves to demonstrate that choices in clustering algorithms can impact the interpretability of solutions by comparing three specific algorithms. This study is intended to provide a framework for how to think about the use of clustering algorithms in GBA analysis rather than to yield claims or recommendations for the entire universe of games. For GBA researchers selecting clustering algorithms, the results illustrate how GBA design may create conditions that impact clustering algorithm results, as well as how to test, validate, and report clustering results in a GBA study. 15 Chapter 2: Game-Based Assessment The intersection of games and assessments can be hazy, but there are many assessments that are clearly not games, and many games that would be difficult to use for assessment. This section examines the similarities and differences between the two mediums and seeks to identify the characteristics that make an instrument a GBA, as well as the considerations that arise in GBA scoring and evaluation through the use of clustering algorithms. The criterion for when a game becomes an assessment is generally straightforward. A game can be used as an assessment instrument if the game is used to gather evidence from students’ behavior, and if it yields a quantitative value from which an inference about the students can be made (Crocker & Algina, 2008). There is less clarity around when an assessment becomes a game. Mislevy et al. (2014) provided a detailed overview of game design elements, but many of these concepts could just as easily describe an assessment that would not be considered a game, such as a simulation assessment, a performance assessment, or an intelligent tutoring system. The following sections provide an overview of Mislevy et al.’s (2014) design elements and seek to pinpoint the design choices that differentiate between an assessment and a GBA. Architectural Game Design Elements Mislevy et al. (2014) defined four types of architectural elements of GBAs: rules, connections, game state, and mechanics. These elements are used to engineer the functionality of the game as an instrument, and each element has its own aspects and facets that apply to different areas of game design. This section explains what these 16 elements mean in game design and demonstrates how the same concepts can be found in some non-game assessments. Rules. Rules define how the game reacts to students’ behavior given the game state and moment in time that the behavior occurred. This is true for any game. In soccer, the difference between kicking a goal and kicking the goal that wins the game is based completely on context (cumulative frequency of goals per team, time on the game clock); however, the same definition of rules exists in many non-game assessments. For example, a computer-adaptive test (CAT) reacts to a student’s response to an item by selecting the next item to present and by updating the estimate of the student’s location on a latent scale. These two reactions will depend on the state of the assessment (the student’s previously estimated location) and the moment in time on the assessment (how many items the student has already answered) (de Ayala, 2009). Connections. Connections define the relationships between game elements, and Mislevy et al. (2014) observed that a GBA may have two sets of connections: one that maps out relationships needed for gameplay, and one that maps out relationships for assessment. A game rule can be seen as a special type of connection that defines relationships between game reactions, students’ behaviors, and game state (Mislevy et al., 2014). Connections are the basis for the game’s reaction to sequential behavior as well (e.g., solving a problem with multiple steps in a certain order), but this again does not differentiate between general assessments and GBAs. A simple example is the use of sequential relationships in common item formats, such as drag-and-drop items. Game state. Game state defines what a student can do in a game at a specific moment in time, and game state can be an input for evaluating a rule in a game (Mislevy 17 et al., 2014). Game state can help to maintain a logical game environment for the user (e.g., when underwater, the student’s avatar can only swim, not run or jump), or it may be used as a component in the game narrative. A typical example is levels—students often cannot advance to higher levels until they have completed initial levels and the game state is updated. Although the term “game state” is used here, the state of an instrument and the context of a student’s action are also relevant in assessments. The rules described previously for CATs are clearly dependent on the state of the assessment at the time the rule is executed. Mechanics. The final type of game element defined by Mislevy et al. (2014) is a game mechanic. Mechanics define elements of the game that are used by the student during gameplay, and they allow the student to make actions that are registered and reacted to by the game. Mechanics are selected so that they guide actions that are relevant to the target domain. These mechanic devices exist in other assessment types—not just GBAs. Simulation assessments and intelligent tutoring systems require these types of mechanics to register a student's actions as evidence for the assessment. Even simple multiple choice items have a mechanic—specifically, a tool that allows a student to select a response or fill in a circle corresponding to their response. Thus, none of the architectural elements of games seem to differentiate games from assessments—all of these design elements are easy to find in traditional assessments—so if a differentiation does exist, it must lie in the experiential game design elements, discussed in the next section. Nevertheless, the architectural elements of a game are important to consider when researchers use clustering algorithms to analyze work process data from gameplay. 18 Understanding the architectural elements can help researchers make informed decisions about which variables to analyze and which cases to exclude. Understanding these elements may also help with interpretation of the clusters, especially if a game is designed to guide students toward a common response pattern, as may be the case in instructional games. Most importantly, the architectural elements will help researchers plan for the data distributions they can expect to observe, which in turn will help them select an appropriate clustering algorithm. For example, if an algorithm assumes the data are spherically distributed, but activation of one connection blocks the student from accessing certain parts of the game, thus creating dependence between variables and resulting in a non-spherical distribution, then researchers can think about restructuring their data or using a different algorithm with more relaxed assumptions. For a very simple example, consider a game where a section of the game becomes unreachable after a short period of time—perhaps a bridge is collapsing or a door is closing—and one must race to get through it. This change in connections keeps slower players from participating in the game section that has been cut off, which could put a ceiling on their achievement in the game, although they may still play in other areas of the game environment. Figure 2 below shows how this hypothetical relation between speed of play and game achievement might look in this simple example. 19 Figure 2. Hypothetical example of a time-based connection element that students can only experience if they play quick enough (e.g., get across a bridge before it collapses). This connection element puts a ceiling on slower players’ possible achievement in the game and impacts the distribution of achievement scores. Experiential Game Design Elements In addition to architectural elements, there are three types of experiential game elements: autonomy, competence, and relatedness (Mislevy et al., 2014). These elements define how the student experiences and interacts with the game, and they are intended to elicit emotions, learning, engagement, and cognitive reactions from the student. These elements may contain the characteristics of the instrument that distinguish between something that is only an assessment versus something that is a GBA; i.e., they may help answer the question, when does an assessment become fun? Autonomy. Autonomy refers to how much freedom a student has in his or her decisions about which actions to take in a game (Mislevy et al., 2014). Higher autonomy adds complexity which may make the game more engaging, but it also may expand the number of possible response patterns that can be observed, creating challenges for interpretation and analysis. There are four aspects of game design that impact autonomy: authorship, chaos, determinism, and opacity (Mislevy et al., 2014). Authorship refers to the constraints on 20 how a student interacts with a game, and it exists on a spectrum. If all students have the same identical experience, then the authorship is very restricted. If the game simply provides an environment and students can make up their own experience within that environment, then authorship is very open. The second autonomy factor, chaos, refers to elements in the game that are allowed to vary randomly, which impacts the number of decisions and actions that a student could potentially make. Some chaotic elements make games feel more realistic, but too many chaotic elements may overwhelm students. Conversely, deterministic elements are those that are created by the game designer and follow predetermined rules, but a significant amount of work is needed to create complex, challenging deterministic elements. The final factor in autonomy is opacity, which refers to how much of the underlying mechanics of the game are initially exposed to the user. A fully transparent game makes it clear to students how the game is working and what they are supposed to do, whereas a completely opaque game is nearly impossible to figure out and may be frustrating. Some opacity can be desirable for creating an interesting game experience where students can discover things within the game and learn about the game as they play. If autonomy is a distinguishing feature between GBAs and other assessment types, it is a subtle one. High varying degrees of authorship exist in both GBAs and other assessments. Chaos is less common in assessment instruments because chaos can make it difficult to compare responses between students (Mislevy et al., 2014), but it is not unheard of. While many assessment elements are deterministic, some formative assessments use chaotic elements like item templates that randomly generate values for predefined variables. For example, an assessment may have addition questions that 21 randomly select numbers (under some constraints) for the students to sum. Conversely, there are GBAs with very little chaos, game presentation is completely deterministic, and the same elements are applied for all students (e.g., Kerr et al., 2011). The final aspect of autonomy, opacity, also exists in simulation assessments. For example, there are assessments where the student does not initially know the complete scope of the task, and only through a process of discovery and learning through interaction with the assessment can the student arrive at a response (e.g., Margolis & Clauser, 2006; Mislevy, 1996). Although autonomy does not seem to be a concept that is specific to GBAs, there is an architectural element to consider when evaluating the authorship of an instrument’s design: rules. A designer can create an interface that supports many inputs from the students, thus creating high authorship, but if the instrument does not have rules to react to the student’s actions, then the extensive autonomy is not very engaging for the student, and it serves no purpose for assessment or gameplay. Autonomy in the absence of a supporting rules structure can flood a dataset with irrelevant response data, making it difficult to analyze or interpret response patterns with a clustering algorithm. For this reason, any researcher using a clustering algorithm with a high autonomy instrument like a complex GBA should be prepared for the presence of data that are not relevant to the construct or for distinguishing between work processes, and this scenario should also be addressed in any comparison of the performance of different clustering algorithms. Competence. Competence refers to the student’s ability to apply and demonstrate their skills or knowledge within a game. This means that the game should have goals or challenges that the student is trying to achieve. The game should also have some degree of complexity (which relates to opacity) where the student has the opportunity to learn 22 about and engage with the architectural elements of the game—some of which may not be immediately transparent. As part of competence, students must receive feedback, which can be as simple as earning points or completing levels (Mislevy et al., 2014). The final competence factor is adaptability, which means that the game should be designed to be challenging and engaging for all players, regardless of ability (Gee, 2007). Like autonomy elements, competence elements also create challenges for scoring and interpretation, especially with respect to adaptability. A well-designed game can provide experiences that are challenging to a variety of skill levels, but this can create issues for comparability. Depending on the game design, there may be very different game experiences for students with different abilities, increasing the sparseness of the dataset. There are many well-designed assessments that exhibit the elements of competence, and it is not unique to GBAs. Goals of traditional assessments are often communicated to students in the form of performance level descriptors, rubrics, test blueprints, and other test documents that outline the purpose of the assessment, how performance is evaluated, and how that performance will be measured. For students, this may boil down to a goal of answering as many questions correctly as possible, but that is nevertheless a goal. The ability to learn about architectural elements in an assessment is also common, and some assessments are designed to have students work through different steps of discovery before arriving at a solution (e.g., Margolis & Clauser, 2006; Mislevy, 1996). Feedback and adaptability are also common to many assessment designs. Competence elements are typically outcomes with feedback to the student, and the game designer expects most students to interact with each competence element 23 eventually; i.e., all students will beat a level, earn points, win the game, etc. if given enough time. In this sense, competence elements may not be an important consideration for including in a clustering algorithm that is seeking to describe work processes as opposed to outcomes; however, competence elements can serve to help interpret or validate clusters. Consider a case where there is one cluster representing a common work process for achieving a competence element goal but also a second cluster in which students also frequently—and perhaps unexpectedly—achieved the same goal. This would be evidence of an alternative solution to the game. Relatedness. If there is an aspect of instrument design that best highlights the differences between an assessment and a GBA, it is relatedness. Relatedness refers to elements that help the student to identify with or empathize with the game. These can include narratives, settings, or even collaboration with other players or simulated characters. Relatedness often increases students’ engagement and motivation to play and succeed in the game. SRI International, the organization responsible for independently studying the validity and effectiveness of GlassLab’s games, observed that GBAs are being designed as a method for overcoming students’ disengagement (SRI International, 2015). It may be that relatedness causes an assessment to make the leap to GBA when the design elements are added with the intent to help the student identify with the instrument without any theoretical connection to the measured construct or the task model for the assessment. SRI International (2015) made a similar observation, noting that GBAs stand out because they are activities in which both students are motivated to participate. This sentiment was echoed by a teacher profiled by Schwartz (2014b), who 24 argued that as instruments map more closely and directly to educational standards, the less “game-like” it becomes. To put it simply, in GBAs, something has been added to the design of the instrument with the sole purpose of making it more fun. D’Angelo et al. (2014) made the same distinction in an effort to distinguish between games and simulations. As part of their definition, they stipulated that simulations represented real-world or hypothetical scenarios; suggesting that a game might use scenarios that would be fantastical or even impossible in the real world as a form of relatedness elements. They added that games could include “non-learning-based goals” (i.e., relatedness elements) like levels, points, or other reward systems; whereas simulations focus the student’s experience onto the specific process or system that is being simulated and assessed. D’Angelo et al.’s distinctions between games and simulations echo Suits’ (2005) definition of gameplay, in which he noted that players accept the imposition of game rules which make it more difficult to achieve a work process goal than it would be under non-game conditions (e.g., it is easier to put a ball in a net if one does not have to play soccer to do so). Relatedness does not need complex story lines and interesting characters however. Simple games also have design elements that impact relatedness. As an example, Number Jumble presents students with multiplication tasks. The student can earn points by solving the problems correctly, and the student can earn bonuses by answering quickly and accurately while the bonus clock ticks down (BrainPOP, 2015). These simple additions support notions of rewards, suspense, and urgency without having anything to do with the student’s ability to multiply numbers, assuming one does not evaluate a student’s multiplication aptitude on the basis of speededness. 25 From a scoring and interpretation standpoint, relatedness elements present one of the biggest challenges for the use of clustering algorithms in GBAs: irrelevant data. Relatedness elements are present to increase empathy, not to elicit evidence about the construct, yet students interact with these elements and data are collected. Even if a relatedness element does not result in collected data (e.g., a soundtrack for the game), there is no guarantee that it is not having an impact on the student’s response patterns for those variables which are relevant to the construct. These relatedness elements will either produce their own variables which the researcher must choose to include or exclude, or they may influence the response patterns. From a cluster interpretation standpoint, it may be impossible to tell if a pattern of responses related to the work process that differentiates novices from experts or if it was just the student trying something for fun because of a well-designed relatedness element. GBAs: Assessments with Construct-Irrelevant Relatedness Elements The previous sections showed that all of the elements used in game design are found in regular, non-game assessments, but if an assessment is considered a GBA, it is due to the inclusion of construct-irrelevant relatedness elements: design elements that exist only to make the assessment fun. For the purposes of this discussion, we will consider an assessment instrument to be a GBA if a designer adds relatedness design elements to the assessment which have no theoretical connection to the task model and are intended only to help the student empathize with the game. This may include elements like narratives, navigation, animations, backgrounds, music, characters, avatars, timers, points, or other game elements. The addition of relatedness elements is the unique 26 differentiator between GBAs and other complex assessments or performance assessments. These non-assessment relatedness elements are both a strength and a weakness from an assessment standpoint. As a strength, they make the instrument more enjoyable for students. This encourages increased engagement, thus improving the instrument’s ability to be used as a learning tool as well as a tool for ongoing evidence collection for assessment. However, these relatedness elements also add superfluous elements to the task model. In traditional assessments, superfluous elements are surgically stripped from items to help focus the item on its content area and to avoid unnecessary complexity for students. Adding extraneous wording or stimuli to an item stem is often called “window dressing,” and item writers are instructed to avoid this practice, because it can negatively impact the reliability and validity of an item (Nitko & Brookhart, 2007; Haladyna & Downing, 1989). The same concerns exist when adding construct-irrelevant relatedness elements in a GBA design. Mislevy et al. (2014) warned that relatedness elements can increase construct-irrelevant variance. In electronic GBAs, relatedness elements are also risky from an evidence collection perspective because they may produce irrelevant data that muddy the water for researchers seeking to interpret students’ gameplay. This is why analysis of GBA data can be challenging for psychometricians who are interested only in identifying construct-relevant response patterns—it may not always be clear which data are relevant. The Use of GBAs in Education The use of games is already well established in U.S. classrooms. In a national survey of a stratified random sample of 648 K-8 teachers selected from a commercial 27 pool (a self-selected population of survey respondents maintained by VeraQuest), Takeuchi and Vaala (2014) found that 74% of the teachers surveyed use digital games, and 55% reported doing so on a weekly basis. In a different survey of 488 K-12 teachers, Fishman, Riconscente, Snider, Tsai, and Plass (2014) found similar results: 57% of teachers used games weekly, although they observed that that number tended to be higher for teachers working with younger students. With games already embedded in instruction, it is easy to see how the next logical step would be GBA, but Takeuchi and Vaala (2014) found that assessment was a minority use case for teachers who use digital games. Of the 648 K-8 teachers surveyed, only 29% reported using games for formative assessment of content related to educational standards, and only 17% reported using games for summative assessment as well. Furthermore, there is no distinction in Takeuchi and Vaala’s survey between the three aforementioned game paradigms described by Mislevy et al. (2014), so it may be that some of these teachers are using games that fall into the second paradigm, where traditional assessment tasks are woven into a game instrument. Of the 29% of teachers who said that they used games for assessments, only 31% of these (9% of the total sample) reported that they assess students’ abilities by observing or discussing how the students played a game, implying that the remaining teachers only used the outcome of the game for assessment. Although less than one-third of K-8 teachers reported using digital games for even formative assessment, three quarters of teachers reported using digital games on a weekly basis, suggesting that teachers at least perceive benefits of gameplay for learning (Takeuchi & Vaala, 2014), but whether those perceptions reflect reality has not been 28 rigorously studied. Granic, Lobel, and Engels (2014) argued that digital games may have cognitive, motivational, emotional, and social benefits for students, and they called for additional research to study these effect areas. Nevertheless, the potential benefits suggested by Granic et al. make sense from a game design perspective, because unlike other assessment instruments, games are designed to engage players emotionally or intellectually (Shapiro, Teknibaş, Schwartz, & Darvasi, 2014). Given the engagement design goals of games, Granic et al’s (2014) proposed benefits of gameplay for learning are certainly plausible, which suggests that the same benefits may be realized for educators who use GBAs as assessments for learning. Games (specifically GBAs) also have the added motivational benefit of not feeling like assessments. Many GBAs are examples of assessments as learning, meaning that they serve as learning devices as well as assessment instruments. By weaving the assessment features into the gameplay, students can learn continuously without having to pause to be evaluated. Students may also not feel as much anxiety about the assessment if it is embedded in their learning activities. This invisible integration of assessment and learning is sometimes called stealth assessment (Shute & Ke, 2012). From an emotional standpoint, Granic, Lobel, and Engels (2014) note that there is no clear theory of how positive emotional states benefit people beyond the fact that they just feel good. Nevertheless, negative emotions like test anxiety are something that assessment designers seek to minimize, as they can introduce construct-irrelevant variance in responses (Mandler & Sarason, 1952). Wrapping an assessment in a game that is associated with positive emotional states can improve students’ perception of 29 assessment and perhaps even improve measurement by minimizing anxiety. A review regarding the effect of anxiety on test performance can be found in Powers (1999). Regardless of the perceived student benefits of using games, one thing is certain: GBAs are already taking the place of traditional assessments as measurement instruments; so it is worth asking whether there are measurement benefits to using GBAs. Admittedly, some of these instruments are little more than gamified versions of traditional assessments (e.g., keep solving multiplication problems to keep your race car going fast), but recall in Chapter 1 the discussion of the shift in GBA design to a paradigm where assessment is occurs inside the game architecture (Mislevy et al., 2014). Using this paradigm of GBAs, the assessment domain is now represented by the work processes or gameplay of the students (Shute, 2011). Using work processes as domains creates an opportunity to measure constructs that may be difficult to represent in traditional assessments. For example, Shute and Wang (2016) explained that digital learning environments like digital GBAs are ideal tools for creating assessments that require the application of multiple competencies as part of the work process. They added that well-designed games are challenging (without being too difficult), provide feedback during gameplay, and reinforce students’ sense of control over the task—all of which, they argued, make GBAs useful for teaching and assessing hard-to-measure constructs. What are hard-to-measure constructs? Shute and Wang (2016) list creativity, problem solving, persistence, systems-thinking, gaming-the-system, and design thinking as examples. One might also add collaboration to this list, given the cooperative game designs of some GBAs already used in classrooms (e.g., Eseryel, Ge, Ifenthaler, & Law, 30 2011; Nash & Shaffer, 2011). Shute and Wang (2016) chose their list of hard-to-measure constructs from a set of skills that are currently seen as valuable competencies for students who are entering college or the workforce, but the list constructs that can be represented with work process domains could certainly be expanded. Challenges of Using Games for Assessment The previous section shows that educators and researchers perceive potential learning benefits from games, and the range of architectural and experiential game elements discussed above point towards new frontiers for the types of task models that can be used in assessments; however, the structure of games brings new challenges for those who seek to use games as measurement instruments. Researchers must consider how to evaluate students’ gameplay in light of construct-irrelevant relatedness elements, unrecorded student actions, response dependencies on sequence, and response dependencies on game state. Construct-irrelevant relatedness elements in games can cause work process response patterns to be a mixture of both construct-relevant actions and actions related to game skills. In complex gameplay, it can be difficult to determine when a student is exhibiting behavior related to the construct versus when the student’s performance is dependent on his or her ability to play the game. Sequence and game state may also be relevant factors in interpreting a student’s response. For instance, Shute (2014) explained that the game’s levels can become more difficult, not because the educational content has changed, but because the obstacles and goals in the game design become more difficult; i.e., the difficulty of the task is a construct-irrelevant relatedness element. As an example, Shute described a GBA where 31 students use knowledge of physics to create a system of simple machines (onscreen) to move a ball to a goal. The required knowledge of physics does not change throughout the game, but the levels still become more difficult because the designers add more obstacles around which the ball must be guided. Almond, Kim, Velasquez, and Shute (2014) expanded on this idea, noting that there is a difference between game difficulty due to complexity of the task and psychometric difficulty, which requires a higher level of the measured construct in order to succeed at the task. Almond et al. noted that these two concepts of difficulty are often but not always correlated. Keeping the distinction between task complexity and psychometric difficulty in mind, it becomes clear that a score or outcome in a GBA is only useful if the evidence for the work process is not drowned out by the enjoyable difficulties introduced as challenges and obstacles in the game’s design. Game design is not the only influence that might impact a researcher’s deductions when analyzing process data from GBAs. One must also recognize that a game may not be an enjoyable or accessible format for everyone. While games are touted as a way to boost engagement, not all games will be engaging for all students. Some students may not ever figure out how to play the game, just as some students do not understand how to respond to certain item types, which then results in response bias (Schmeiser & Welch, 2006). Students may also engage in behavior that has nothing to do with the construct of interest, yet also has little to do with scripted gameplay either. For example, open-world games are designed with such high autonomy and expansive game environments that students can spend time simply exploring (“free-roaming”) the game space. This behavior 32 reflects students learning about the game design, testing its limits, and seeing what is possible in the rules and mechanisms that define the game (Muncy, 2015). In open-world games, game designers are intentionally encouraging this behavior by providing complexity and details that reward free-roaming, but this exploratory behavior is not limited to complex games. Students may experiment with the limitations of any task, testing the range of acceptable behaviors and responses and discovering the mechanics that underlay the assignment. From the researcher’s perspective, the free-roaming work process data may be unintelligible. To further complicate analysis of work process data, it is possible that some student actions in a work process may not even be recorded. Data are not collected for a student’s work process unless that process includes interaction with the game interface. In their development of a GBA, Hoffman, John, and Makany (2014) noted much of the gameplay involved students observing and mentally interpreting the onscreen effects of an action in the game. The students then used those observations and interpretations to choose their next move onscreen, though no data were collected by the game about the cognitive decision-making processes that led up to that move. Given the difficulties of evaluation resulting from construct-irrelevant relatedness elements, unrecorded student actions, response dependencies on sequence, and response dependencies on game state, GBA designers must plan evidence collection strategies early in the game design process using assessment development frameworks like evidence-centered design. 33 Designing GBAs with Evidence-Centered Design An established framework for designing, delivering, and scoring assessments is evidence-centered design (ECD). ECD applies a broad range of assessment design resources (e.g., subject matter knowledge, software design, psychometrics, pedagogical knowledge) to the inferences that are drawn from the assessment scores, thus enabling interdisciplinary teams to use a common set of guiding concepts to reason through the design and implementation of an assessment (Mislevy, 2011). ECD has been used to develop many assessments across multiple disciplines, measurement models, and use cases, but Mislevy et al. (2014) specifically recommended the use of ECD for the development of GBAs, and there are several examples of GBAs designed with ECD (e.g., Hoffman et al., 2014; Stevens & Casillas, 2006). ECD provides guidance on how to create an operational assessment for a latent construct through multiple, iterative development steps. Each step helps to build an assessment argument to support the validity of the assessment’s inferences (Mislevy, 2011). In this sense, ECD is not just a guide for developing assessments, but also a conceptual framework linking the structure of an assessment to the inferences about the construct and intended applications of those inferences. The ECD component that relates to the design of the assessment is the conceptual assessment framework (CAF) (e.g., Mislevy et al., 2012). The CAF divides the operational elements of an assessment’s design into three parts: the student model, the task model, and the evidence model, which are discussed below. The student model in the CAF specifies the constructs and construct relationships that are being measured to describe the student’s knowledge or abilities (Mislevy et al., 34 2012). The goal of the assessment is to make inferences about a construct based on observed behavior in the assessment (Crocker & Algina, 2008). Assessments may measure a single construct, multiple constructs, or a hierarchy of constructs and sub- constructs at varying levels of granularity. Note that a student model does not necessarily describe all of the constructs that may be applied by the student when completing the assessment; it only describes those constructs for which we seek to make an inference. The task model defines the assumptions and specifications for what the student can do in the assessment (the task) and the features of the environment in which the task takes place. (Mislevy et al., 2012). The task model also defines the specifications for any work products the student may be expected to create during the assessment. Even in high- autonomy assessments where there may be substantial variability in students’ responses and work processes, the task model can act as an anchor for interpreting results. Regardless of the process a student executes, one can anticipate that most students are engaging in a process that relates to how they perceive and interpret the task model. This means that researchers may not know what the processes are, but there is a reasonable expectation that researchers know what the students are trying to accomplish because the researchers designed the task model. The evidence model connects the task models and the student model in the CAF. It has two parts: the evaluation / evidence identification component and the measurement model / evidence accumulation component. The evaluation component defines how evidence is identified and collected in the work products, and the measurement model component defines how to aggregate those data to make inferences about the student 35 (Mislevy et al., 2012). Figure 3 illustrates how the evidence model relates to the task model and student model. Figure 3. Relationship of the student model, evidence model, and task model in the CAF. The evidence model contains two parts: the evaluation component and the measurement model component. As discussed previously, identifying evidence and creating a measurement model can be especially challenging in GBAs due to construct-irrelevant relatedness elements, unrecorded student actions, response dependencies on sequence, and response dependencies on game state. In a multiple choice item, the evaluation component simply depends on whether or not the student selected the correct choice. In a GBA, the evaluation component must make sense of large volumes of data that may come across as raw, granular pieces of information, such as keystrokes, time stamps, or text entry. The evidence model may need to define how to reduce the data to a set of scorable data points (a process called “tagging”), identify tagged data to be used for scoring the student’s final output (goal state), and possibly identify tagged data to be used for scoring the student’s work process. Once relevant evidence data have been identified, the GBA designers must then design a measurement model that can combine the evidence into an actionable inference about the student. Measurement Models: Principled versus Empirical Approaches in GBAs There are a variety of strategies to analyze response data from GBAs. This section discusses different methods that have been used for the measurement model and the Student Model Task Model Evidence Model Evaluation Component Measurement Model Component 36 student model, as described by the ECD framework. Bejar and Braun (1994) identified two strategies for analyzing complex response data: a principled approach and an empirical approach. Simply put, one can think of these two approaches as a top-down versus bottom-up approach, where one can either work from a high-level theory to customize an instrument to collect evidence, or one can collect evidence and work up toward a theory that explains the observations, respectively. The following sections discuss these two approaches in further detail. Principled approaches. With a principled approach, measurement models and student models are defined a priori based on the findings of a domain analysis and subject matter experts’ understandings of how a construct will be manifested in the response data. This approach is appealing because the scoring rules are built on theory; however, there is a presumption that the test designers have considered every possible response pattern and how to score it in advance. This is usually not possible for high-autonomy task models, and measurement models and student models (as well as the tasks themselves) often have to be iteratively fine-tuned based on empirical approaches or usability studies of the GBA (Bejar & Braun, 1994). Many GBAs use a principled approach in their measurement model as a matter of practicality. Rather than collecting every single key stroke of a student and then mining these data to determine which data might relate to the desired inferences, inputted response data (e.g., keystrokes, mouse movements, eye tracking data, time stamps) are often tagged or scored at some granular level. This initial tagging can be done with scoring trees, which are simply logical decision trees that return a value based on the observed result data. For example, if a student executes some action under a specific set 37 of circumstances, a scoring tree might return a value of 1 (and any other action might return a 0). That scoring tree must be defined by a test designer’s understanding and interpretation of the student’s performance in the context of the game state, so scoring trees count as a principled approach. Item response theory (IRT) and Bayes nets are other examples of principled approaches used to evaluate GBA data, though in practice these methods rely on a foundational layer of scoring trees to simplify the response data (e.g., Quellmalz, Timms, Silberglitt, & Buckley, 2011; Mislevy, 1996; Shute, Ventura, & Kim, 2013). Conversely, exploratory approaches to GBA measurement models like clustering algorithms are often associated with empirical approaches for analyzing the complex response data, for which there may be only an inchoate theory about how response patterns will manifest as a result of the underlying construct. It is unlikely that these exploratory approaches would ever be used in a principled measurement approach because these methods are used to identify structures in multidimensional data that were previously unknown to the researcher. If the researcher has a theory for how a construct will manifest as a work process, then the assessment and measurement model can be built with a principled approach to capture and score evidence about the student’s work process—there would be no need for exploratory approaches. The following section explains how researchers may use exploratory methods in an empirical approach to analyzing complex data and evaluating student performance when a theory of response patterns does not exist a priori. Empirical approaches. With an empirical approach, subject matter experts analyze the results data and define a measurement model `that will relate the work 38 process observations to their student model describing the construct. As a simple example, a test designer might use a regression analysis to relate observed game-based assessment data (either raw or tagged) to experts’ ratings of the students’ performance. That regression model could then be used as the measurement model for future administrations (Bejar & Braun, 1994). Clustering algorithms are also an example of an empirical approach to measurement because one clusters students based on the collected work process data and then makes post hoc inferences about students based on the characteristics of the clusters. As an illustration, Kerr et al. (2011) took an empirical approach to analyzing GBA data with the use of fuzzy cluster analysis. Test designers found that the clusters corresponded to different solution strategies and common errors or misconceptions that manifested themselves in the response data. One benefit of their game was that different game levels had similar designs, and comparisons could be made between a student’s cluster membership across levels. Stevens and Casillas (2006) used the self-organizing map (SOM) clustering algorithm to analyze the paths students took through a game environment. The game tracked how many times the student visited each menu item in the game, the order of the visits, and the outcome of the game (right or wrong). The algorithm analyzed these paths and grouped the individual students’ paths into different clusters. The researchers were able to go back and map the characteristics of these clusters to descriptions of theoretical strategies they expected students to exhibit during gameplay. How principled and empirical approaches impact GBA design. It has been argued that the ideal GBA design is one where the test designers have tight control over 39 the assessment mechanisms without restricting the flow of the game (Mislevy et al., 2014). This is certainly the case for assessments designed with a framework such as ECD. Measurement models must be able to process the provided evidence data and yield results that are meaningful in the context of the student model. The test designers build the rules for what counts as evidence, which suggests a principled approach. With an empirical approach, test designers have to inspect responses that already exist and then make decisions regarding which data count as evidence and how they should be used in a measurement model. For example, both Kerr et al. (2011) and Stevens and Casillas (2006) mined data with clustering algorithms to identify patterns in response data which they then used to classify students. Recall also that Mislevy et al. (2014) used the term “autonomy” to refer to the amount of freedom or variation students have in their work processes. By relying on an empirical approach, test designers do not need to have quite the same control over the assessment mechanism. They can increase the autonomy of the GBA’s task model, allowing for more variation in responses— including unexpected response patterns that may be considered valid. These patterns can then be discovered and evaluated later with an empirical approach like a clustering algorithm. Even though a measurement model may use an exploratory empirical approach instead of a principled approach, one can still use the concepts and terminology of ECD to describe the development and design of a GBA. The researcher may not have a specific theory about precisely how the construct will manifest as a response pattern in the GBA until after the exploratory analysis, but the researcher still must base the design of the game environment (design elements) and task models on a theory for how the 40 game will elicit gameplay responses that will be evidence of the construct. The test designers may not know what responses will look like (which is why they may rely on tools like clustering algorithms), but they should have a theory that any response created and recorded within that environment could be used to inform the student model. Even if the measurement model relies on an exploratory approach, the model will presumably be locked down eventually. For example, clusters may be identified with a clustering algorithm, but then those cluster centers will likely remain static for evaluating future students’ responses. In this sense, an exploratory measurement model may be seen as another part of the ECD development process for the CAF. Unfortunately, even with careful design of a GBA environment using an approach like ECD, there is no guarantee that students will provide an interpretable or meaningful response. Students will often engage in response patterns in a high-autonomy environment that are difficult to interpret, even in relatively simple task models (Corrigan et al., 2014). The emergent behavior of students’ work processes means that it may not be obvious how to create a statistical model that relates the observed response data back to the construct (Bejar & Braun, 1994; Johnson, 2001). Exploratory approaches to measurement models may also obfuscate interpretations of students’ responses. For example, a researcher clustering student response patterns with a clustering algorithm may find it difficult to understand why a student has been assigned to a given cluster or how to interpret what an entire cluster represents with respect to the construct. For example, Kerr et al. (2011) used fuzzy cluster analysis to classify the strategies and errors that students made in Save Patch, but 26.4% of the students’ game results across 16 game levels had to be classified as 41 “unexplained errors.” Stevens and Casillas (2006) noted similar interpretive constraints, observing that there may be no way to understand how the SOM algorithm arrived at a prediction for a response pattern. The following sections explore empirical classification methods in more detail and discuss the perceived benefits of empirical approaches for work process data. These unclassifiable response patterns underscore the need for GBA researchers to use clustering algorithms or other methods that can handle outlier cases well. GBA data will often be polluted with outlier response patterns (e.g., false starts, unengaged students), and these ideally should not impact the identification of clusters of legitimate responses that relate to the construct. Furthermore, the clustering solution needs to be reasonably reproducible. If cases are not consistently assigned to the same clusters, it may be difficult to trust the interpretations being made about individual students based on their cluster membership. Nevertheless, accurate classification for all students may not be needed for GBAs in practice. The empirical approaches used by GBA researchers reflect and frame the reasons (discussed in Chapter 1) for why people use GBAs in the first place: to evaluate work processes to differentiate between novices and experts or to help novices become experts. GBAs are necessarily low-stakes, formative assessments, so accuracy at an individual level may not be as important as demonstrating the novice/expert distinction in work processes at an aggregate level. Clustering Algorithms The previous section discussed the utility of empirical methods for mining response data to identify assessment evidence when researchers cannot anticipate 42 precisely how students might respond in a task model, but if researchers cannot anticipate certain responses, it begs the question about whether or not clustering algorithms are even an appropriate tool for a measurement model, even in a low-stakes application. In their paper detailing the facets of data mining, Romero et al. (2009) explained that one must use a data mining strategy that is appropriate for the type of data one wishes to identify, such as data mining to identify patterns of behaviors. Their explanation is valuable for assessment researchers because it indicates that data mining can indeed be a method for identifying cognitive or behavioral processes (Berkhin, 2006). Kerr et al. (2011) explained that identification of processes within response patterns is typically done with clustering algorithms, which can be considered a type of data mining. Clustering algorithms are processes that use observed similarities or densities in data to identify patterns and group similar observations (Berkhin, 2006). Unlike global methods like Principal Component Analysis (PCA) which map the entire data space to a lower-dimensional space, clustering algorithms account for subspaces in the data. This means that some data dimensions or correlations may be relevant for defining a subset of clusters while remaining irrelevant to the definition of others. These types of subspace- specific cluster definitions are called local feature relevance and local feature correlation, respectively, and it implies that clusters can form in different subspaces of the data (Kriegel, Kröger, & Zimek, 2009). These subspace response patterns can be thought of as unscalable classes because observed responses on each variable do not form a scale (Goodman, 1975). For example, on a test where harder items are only answered correctly by students who are located high on a construct, item scores follow a logical scale where students are expected to answer items correctly until they become too 43 difficult—this is a scalable class (Braaten, 2009). When responses do not meet this scalable class criterion—as can be the case with work process data—the response patterns represent an unscalable class. Clustering algorithms are designed to work in these types of datasets, grouping data by similarities and correlation instead of ranking patterns based on response observations on more and more difficult (rarely observed) variables. There are dozens of clustering algorithms that are used for different research questions and data scenarios [see Kriegel et al. (2009) and Berkhin (2006) for overviews and taxonomies of many of these algorithms]. The clustering methods used in GBA evaluation are called soft-projected clustering algorithms, which is a subset of axis parallel algorithms. These algorithms make the assumption that the subspaces in which clusters might exist are restricted to the spaces which are parallel to subsets of dimensions defined by the global data and that the number of clusters (k) is known in advance, although there is no mechanism in the algorithm for confirming or verifying that k is correctly specified (Kriegel et al., 2009). To put it another way, the clusters are defined by cases which share similar values on a subset of variables in the data—each variable is an axis in the data space. If the members of a cluster were “projected” (plotted) onto those axes that define that cluster, one would see those cases bunched together. The space defined by a cluster is not rotated or transformed in any way (as would be the case with clusters defined by a localized PCA), which is why it is said to be “parallel” to the axes defined by the variables. “Soft-projected” means that all of the variables presented to the algorithm are considered with equal weight, and the algorithm 44 has no method for dropping variables that may be irrelevant for the definition of a given cluster. Unfortunately, the number of clusters is not known in advance in GBA research. Researchers often need to validate the number of clusters they select through a combination of fit statistics, validation against competing algorithms, and expert review of the interpretability of the results. It is also possible that multiple clusters, while representing different work processes, still share the same meaning in relation to the construct; e.g., groups of students who made different errors in a game which are all attributable to a single misconception or knowledge gape in the construct of interest. The validity of the clustering solution may also be influenced by the presence of irrelevant data and whether or not an appropriate clustering algorithm was selected (Berkhin, 2006; Kriegel et al., 2009; Xu et al., 2013). There is ongoing research for data-mining applications in which an algorithm is able to discover and extract patterns or clusters within a dataset without any human guidance on the relevancy of variables or the nature and number of the patterns. Fayyad, Piatetsky-Shapiro, and Smyth (1996) called this field knowledge discovery in databases (KDD), and they provided a concise literature review of the field, its applications, and the limitations of data mining in KDD. Fayyad et al. describe many KDD initiatives, but all are designed for specific industries (e.g., healthcare, astronomy), and they did not mention any KDD work in education, assessment, or games. Furthermore, while KDD algorithms sound like a panacea for identifying hidden clusters in vast datasets, the accuracy of KDD algorithms is hindered by the same problems as other clustering algorithms, including high dimensionality, over-fitting models with “noisy” data, missing 45 data, complex relations, and the potential for uninterpretable clusters (Fayyad et al., 1996; Kriegel et al., 2009). Dunietz (2016) echoed this reality check when discussing the limitations of deep neural networks that are able to identify their own high level data features, such as an algorithm that learns to detect curves in geometric images without having to be programmed to know which data represents curves. Dunietz explained that researchers still tweak these algorithms, selecting inputs, weights, and other parameters based on judgment and the interpretability of the results. In short, the challenges of selecting the number of clusters and the variables to include in the algorithm when analyzing GBA process data will not necessarily be solved through the use of more sophisticated data mining algorithms. Additionally, it is important to remember that the use case—namely, grouping students based on similarities in response patterns—can also be addressed with parametric methods. For example, latent class analysis (LCA) creates conditional probabilities of whether or not a student is likely to belong to a given cluster. Xu et al. (2013) gathered usage data from 661 teachers working with an instruction design program and clustered them using both LCA and the k-means clustering algorithm. They found that the LCA clusters were better fitting and had better correlations with relevant external criterion variables than the k-means clusters. They also showed that when increasing the number of clusters (k), LCA tended to create new, small groups out of existing clusters, whereas k-means tended to create completely new, unrelated cluster solutions. While these methods are certainly valuable and promising, the fact remains that researchers are already using non-parametric, exploratory clustering algorithms in GBA 46 research, and the focus of this paper is on the performance, appropriateness, and proper application of clustering algorithms in a GBA context. The following sections provide overviews of three axis parallel soft-projected clustering algorithms: 1. k-means – used in high dimensional datasets where data are expected to have roughly spherical distributions, clusters are similar sizes, and variables have equal variance (Robinson, 2017). For example, k-means would be appropriate for clustering people based on their responses on a set of Likert items. 2. SOM – used in high dimensional datasets, but with no assumptions about the data. SOM is used in a variety of applications, especially ones in which there is a need to inspect the similarities of clusters. 3. ROCK – used in high dimensional binary datasets, but with no assumptions about the data (beyond being binary). For example, ROCK can be used to group consumers into clusters based on the items they purchased, even though they may not buy exactly the same items. GBA researchers are often working with high-dimensional datasets with sparse, nominal data, so researchers should select an algorithm that is able to find clusters under these conditions. These sections briefly describe these three algorithms’ processes and their potential utility with respect to GBA evaluation. K-means. There are several versions of the k-means algorithm, but this paper focuses on the version defined by Hartigan and Wong (1979), which is generally accepted as the preferred k-means algorithm (Berkhin, 2006; R Core Team, 2014). In k- 47 means solutions, a cluster is defined by a center point which is the centroid for the members of that cluster. The algorithm begins with a set of k potential centers which can be defined by the researcher or by randomly selected k cases from the data to act as the initial centers of the clusters. The choice of starting centers is deterministic: given the same starting centers, k- means will always return the same clustering solution (e.g., Pelleg & Moore, 2000; Ishioka, 2005). Lattin, Carroll, and Green (2003) warned that the k-means solution is only as good as its initial starting centers and that the selection of the initial centers heavily impacts the clustering solution for small datasets. Once the centers are selected, the algorithm assigns all the cases to their closest centers and recalculates the new centers defined by these clusters. Distance is determined by a user-specified similarity measure—often Euclidean distance or Manhattan distance. The algorithm goes through multiple iterations of checking each case (e.g., student response pattern) to see if it should be moved to a different cluster based on the centers’ updated coordinates. If so, it changes the case’s cluster membership, updates the centers’ coordinates, and continues to the next iteration until it converges on a solution where no points are being switched between clusters. In the case of Hartigan and Wong’s (1979) algorithm, this process is handled by seven looping steps: 1. For each case xi, such as student i’s vector of response pattern data, find the closest and second closest cluster centers mc1i and mc2i. Assign each case xi to its cluster with the closest center, mc1i. Starting centers for the clusters can be specified by the researcher or from cases randomly selected from the data. As the algorithm progresses and updates the coordinates of cluster centers, each 48 case’s second closest cluster center (mc2i) is checked to see if it has become the closest cluster center to xi (see Step 7), in which case xi’s membership is changed and cluster centers are updated again. 2. Update all cluster centers mk by calculating the mean of each variable in vector x for all cases xi currently assigned to mk. 3. Select a case, xi and define the “live set” of clusters, L. a. If this is the initial cycle of the algorithm (i.e., the first time Step 3 has run), assign all k clusters to the live set of clusters, L. b. If the algorithm has already completed at least one cycle, and if cluster mk was updated in the previous quick-transfer stage (Step 6 below) or if cluster mk was updated in any of the previous N optimal-transfer stages (Step 4 below), where N is the number of cases xi currently assigned to mk, then include mk in the live set of clusters, L. 4. Run the optimal-transfer stage of the algorithm. a. If xi’s assigned cluster, mc1i is in L based on the criteria defined in Step 3, calculate: 𝑅2𝑘𝑖 = min[𝑁𝑘 × dist(𝒙𝑖, 𝒎𝑘) 2] /(𝑁𝑘 + 1) ∀ 𝒎𝑘 ≠ 𝒎𝑐1𝑖 (1) Otherwise, if xi’s assigned cluster, mc1 is not in L, calculate: 𝑅2𝑘𝑖 = min[𝑁𝑘 × dist(𝒙𝑖, 𝒎𝑘) 2] /(𝑁𝑘 + 1) ∀ 𝒎𝑘 ≠ 𝒎𝑐1𝑖 ∈ 𝐿 (2) where Nk is the number of cases assigned to the kth cluster and dist(xi, mk) is the distance between xi and the center of mk using the distance 49 measure chosen by the researcher (typically Euclidean distance by default). b. The cluster with the smallest R2ki becomes mc2i for xi. c. If 𝑅2𝑘𝑖 < [𝑁𝑐1𝑖 × dist(𝒙𝑖, 𝒎𝑐1𝑖) 2]/(𝑁𝑐1𝑖 − 1), where Nc1i is the number of cases in the cluster to which xi is currently assigned, then mc2i becomes mc1i, mc1i becomes mc2i, and xi is assigned to the cluster with the new mc1i center. Update the cluster centers based on the new case assignment after each case xi is evaluated. 5. If L is empty, stop the algorithm. 6. Run the quick-transfer stage of the algorithm. a. Consider all cases xi whose clusters for mc1i and mc2i have changed in the past N cycles. b. For these cases, calculate: 𝑅1 = [𝑁𝑐1𝑖 × dist(𝒙𝑖, 𝒎𝑐1𝑖) 2]/(𝑁𝑐1𝑖 − 1) (3) 𝑅2 = [𝑁𝑐2𝑖 × dist(𝒙𝑖, 𝒎𝑐2𝑖) 2]/(𝑁𝑐2𝑖 + 1) (4) c. If R2 ≥ R1, then mc2i becomes mc1i, mc1i becomes mc2i, and xi is assigned to the cluster with the new mc1i center. Update the cluster centers based on the new case assignment. 7. If no cases have changed cluster assignments in the past N cycles (i.e., the past N times these seven steps have been executed), then return to Step 4, otherwise return to Step 6. 50 The k-means algorithm is simple to understand, apply, and communicate, and it is likely the most widely-used clustering algorithm; however, it is very limiting. The algorithm can only be used for spherical interval/ratio data. Because the centroids are means, they can be influenced by outliers. The algorithm also does not scale well to high- dimensional datasets, especially when the data are sparsely populated. In fact, if data are not spherically distributed, variances within clusters are not equal, or some clusters are far larger than others, the k-means algorithm can fail, drawing cluster boundaries through true clusters in the data (Berkhin, 2006). The severity of such failures in practice will depend on the scenario and the application. Some researchers may still be able to interpret the clusters correctly, even if numerous cases were assigned by an incorrect boundary, but it is just as likely that the final k-means solution does not reflect the true clusters at all. If the researcher does not validate the findings, he or she may be missing the key interpretable structures of the data. These limitations make k-means a poor choice for many GBA datasets which can easily have high-dimensional, sparsely populated, binary data. Alternate algorithms which are similar to k-means (e.g., k-median) can be more appropriate for binary data and have been used in some GBA development work (e.g., Corrigan et al., 2014; Kerr et al., 2011). Nevertheless, it is possible that some researchers use k-means clustering algorithm in inappropriate scenarios due to their own familiarity of the algorithm and its easy availability (Kriegel et al., 2009). For example, in their review of nine education studies using clustering algorithms, Xu et al. (2013) observed that five of those studies used k- means, and in their chapter discussing clustering algorithms for multivariate analysis, Lattin et al. (2003) only discussed Hartigan and Wong’s (1979) k-means algorithm 51 without any mention of k-medians or discussion of other variants. Given its widespread use and easy availability despite the limitations discussed above, k-means is included in this study to see how it compares to other methods for the purposes of GBA classification. Self-Organizing Maps (SOM). The SOM algorithm (also known as a Kohonen Map) is an artificial neural network algorithm where multidimensional data to a set of k clusters (“nodes”). One of the primary reasons SOM is popular is that the clusters can be mapped to a two-dimensional grid that shows which clusters are similar to each other. This is possible because the SOM uses a two dimensional array of nodes with weights representing cluster centers. For each presentation of the dataset (and there can be many), each case updates the location of the closest cluster center, as well as all of the cluster centers of nodes in a neighborhood around the winning node. The cumulative effect of doing this neighborhood update for each case over multiple presentations of the dataset is that the nodes self-organize into cluster centers where similarity is inversely related to the distance to each other on the grid (Kohonen, 1990). The axes and distances of the grid are meaningless, but it creates an organized map of cluster centers with similar clusters near each other in a grid. This is a valuable tool of visualizing data and validating clusters (Berkhins, 2006). When mapping to a single set of clusters, the algorithm is very similar to the k- means algorithm. Like k-means, cases are mapped to the closest center, often determined by Euclidean distance. The researcher can specify a set of starting centers, or the algorithm will randomly select the centers from the data, and the researcher also has to specify how many clusters to identify (Bullinaria, 2004). As with the Hartigan and Wong 52 (1979) version of the k-means algorithm, SOM updates the centers after assigning a case to a center’s cluster and continues to do so until cluster memberships stop changing (Berkhins, 2006). The primary difference between k-means and SOM is that each time a case is assigned to a cluster, the center for that cluster is updated using a learning rate coefficient, but the centers for nearby clusters are also updated with the same learning rate coefficient to bring them closer to the winning cluster’s center. This adaptive step is desirable because it allows the researcher to map k clusters onto a grid or table with k cells based on the similarities between their centers’ coordinates; i.e., clusters can be displayed in a grid such that similar clusters are closer to each other. The axes and dimensions of the grid do not represent anything in of themselves (i.e., the researcher with 12 clusters could specify a 4 × 3 grid or a 6 × 2 grid), but the adaptive steps will assign each cluster to a space on the grid so that neighboring clusters are similar to each other. This tabular grid representation of the clusters cannot be achieved with k-means because it does not adjust nearby clusters based on a case’s newly assigned membership. The actual dissimilarity between clusters in multivariate space will of course vary with both algorithms—some cluster pairs may be more similar than others. The SOM grid is simply a useful tool for interpretation and summarization. The magnitude of the change of nearby clusters in the adaptive step depends on their proximity to the winning cluster center and the number of iterations (t) the algorithm has run. The SOM starts with a large learning rate coefficient (αt) which is used to shift the cluster centers (mk) for all clusters in a large neighborhood (Nc) surrounding the winning cluster’s center (mc). The initial radii of the neighborhoods can be set by the 53 researcher, but it diminishes as the algorithm runs. In other words, as time (iterations) progress, the neighborhood around each cluster shrinks to zero so that nearby clusters are not modified when a cluster is updated, and the clusters themselves are not changed as much by the presentation of new cases because the effect of these new cases is weighted by a decreasing learning algorithm. This is useful in cases where the researcher presents the same cases to the SOM network over and over again to achieve a more stable estimate of cluster centers. The initial cluster changes are large, with cluster centers being moved substantially by new cases and by changes in neighboring clusters. As the algorithm runs through its iterations, the learning rate coefficient and the size of the neighborhood shrink until eventually there are only minute, fine-tuning changes to the winning cluster’s center (Bullinaria, 2004). The size of the neighborhood and the rate of decrease can be set by the researcher. The rate of decrease may be linear or non-linear, and the neighborhood may exist for all of the SOM iterations, or it may be defined so that the neighborhood radius shrinks to zero after a set number of iterations have been completed. For example, in the default settings of the som package for R, the neighborhood’s radius is chosen to be larger than 2/3 of the unit-to-unit distances for all of the starting cluster centers (mk). The som package then linearly decreases the radius of the neighborhood over 1/3 of the iterations chosen by the researcher (Wehrens & Buydens, 2007). If 2/3 of the starting cluster centers are 100 Euclidean distance units away from each other, and the researcher specifies 300 iterations, then the radius of the neighborhood will decrease by one unit at each of the first 100 iterations, after which only the winning cluster’s center (mc) will be updated. Once the neighborhood radius diminishes to zero, clusters near the winning 54 cluster are no longer updated when cases are reassigned, and the SOM algorithm solution is then identical to the logic used by the k-means algorithm (Kohonen, 1990). 𝒎𝑘.𝑡+1 = { 𝒎𝑘.𝑡 + 𝛼𝑡(𝒙𝑖.𝑡 − 𝒎𝑘.𝑡) if 𝑖 ∈ 𝑁𝑐.𝑡 𝒎𝑘.𝑡 if 𝑖 ∉ 𝑁𝑐.𝑡 (5) Equation 5 shows how these updates occur at every iteration t. The SOM algorithm begins by creating k clusters with centers mk by randomly selecting k cases from the data. At each iteration t, the SOM algorithm selects a case xi from the data and finds the cluster whose center is closest. This closest cluster has center mc. The algorithm then finds all of the clusters in the neighborhood of the winning cluster. The size of the neighborhood Nc.t shrinks as iteration t increases. The centers mk of all of the clusters in Nc.t are then updated with the learning rate coefficient αt to be closer to xi.t. The algorithm then moves to the next iteration, selects a new case, decreases the size of the neighborhood radius and learning coefficient, and proceeds to update the cluster centers again (Kohonen, 1990). Note that a case’s cluster membership is defined by its closest cluster, but the closest cluster may change as the cluster centers are updated in each iteration. The SOM algorithm completes when the cases’ cluster membership stop changing. Figure 4 depicts the iterative cluster center updates in a two-dimensional space. 55 t = 1 t = 2 Figure 4. Visualization of the SOM learning process updating cluster centers across two iterations. Cluster centers within the winning cluster’s neighborhood are updated to be closer to the case’s data in each iteration. In addition to the appeal of displaying topological relationships of the cluster centers in a two-dimensional grid, SOM has other benefits for GBA applications. First, SOM does not rely on any assumptions about the distributions of the data and the solutions are not heavily influenced by outliers (Wehrens & Buydens, 2007). This is because, unlike k-means, SOM does not ever calculate a cluster center’s coordinates by taking the mean coordinates of all the cases assigned to the cluster. Instead, the cluster centers are moved incrementally depending on the case considered at each iteration. On the downside, SOM, like k-means, is dependent on the starting centers, but it is not deterministic. The final clustering solution will be a product of the initial starting centers, the order in which cases were added to update the cluster centers, and the size of the neighborhood and learning rate coefficient at each of those iterations. Furthermore, researchers may not always have a theory to guide the parameter settings for the size of the neighborhood, the learning rate coefficient, or the rate at which these parameters Nc.2 • m1.1 • m2.1 • m3.1 □ x1 • m1.2 • m2.2 • m3.2 □ x2 Nc.1 56 change, which might make the final solution somewhat arbitrary. Like k-means and other algorithms which rely on similarity measures, SOM may not work well for sparse datasets (Berkhins, 2006). Robust Clustering Using Links (ROCK). One problem with GBA data is that data are frequently tagged as binary variables with a 1 representing the observation of an action and 0 representing no observation. This measurement scale has two implications: the datasets become sparse (large number of variables with lots of 0 values), and that sparsity means that many cases look similar simply because they have so many 0 values in common. To handle these data scenarios, clustering algorithms like the Robust Clustering with Links (ROCK) algorithm have been developed which cluster cases based on the co- occurrence of categorical data. The ROCK algorithm also uses a similarity measure, and like k-means and SOM, the researcher can pick the metric, but the researcher must also define the threshold for labeling two cases as similar. If they are similar, they can be linked to each other and to other similar cases. Using this method, many links can be made between cases, but the ROCK algorithm is built around the concept that two similar cases are more likely to belong to the same cluster if they share a high number of links; i.e., they are similar to a lot of the same cases. ROCK uses a bottom-up hierarchical clustering to combine clusters until the algorithm finds k clusters (k being specified by the researcher). These clusters are defined by finding a clustering solution that maximizes the within-cluster links for all pairs of cases in the cluster (Guha, Rastogi, & Shim, 1999). The bottom-up hierarchical clustering steps are as follows: 57 1. Using a similarity measure (e.g., distance measure) specified by the researcher, assign all pairs of cases a similarity value between 0 and 1, where 1 represents identical cases and 0 represents maximum dissimilarity. 2. The researcher specifies a similarity threshold, θ. If a pair of cases’ similarity value is equal to or greater than θ, then those two cases are considered neighbors. 3. For each case, identify all of its neighbors. 4. For each pair of cases, calculate the number of links, which is the number of common neighbors they share; i.e., the number of overlapping neighbors for the pair of cases. The concept behind ROCK is that if two points share a lot of neighbors (high number of links), then they are more likely to belong to the same cluster, even if the two cases are not similar to each other. 5. Assign all cases to a cluster. At the first iteration, each case is its own cluster. 6. For a pair of clusters Ci and Cj, calculate the number of links between the cases xq in Ci and cases xr in Cj: 𝑙𝑖𝑛𝑘(𝐶𝑖 , 𝐶𝑗) = ∑ 𝑙𝑖𝑛𝑘(𝒙𝑞 , 𝒙𝑟) such that 𝒙𝑞 ∈ 𝐶𝑖 and 𝒙𝑟 ∈ 𝐶𝑗 (6) 7. To determine which two clusters should be merged next, calculate the goodness measure (g) for all pairs of clusters, where ni is the number of cases in Ci and nj is the number of cases in Cj: 𝑔(𝐶𝑖, 𝐶𝑗) = 𝑙𝑖𝑛𝑘(𝐶𝑖, 𝐶𝑗) (𝑛𝑖 + 𝑛𝑗) [1+2( 1−𝜃 1+𝜃 )] − 𝑛𝑖 [1+2( 1−𝜃 1+𝜃 )] − 𝑛𝑗 [1+2( 1−𝜃 1+𝜃 )] (7) 8. Merge the two clusters with the highest goodness measure and update all link calculations between clusters. Continue iteratively merging clusters with this 58 bottom-up hierarchical clustering method until the desired number of clusters, k, is obtained. ROCK is a good candidate for GBA data because of its ability to handle sparse datasets. It does not have any requirements for data distributions, and because of the method of identifying similarities and links, it essentially segregates and ignores outliers, which may be helpful in GBA process data if there are occasional, atypical response patterns. Like SOM, ROCK requires the researcher to specify the parameters, in this case the similarity measure to be used and the similarity threshold (θ) constraining the number of links in a case’s neighborhood, and researchers may not know how to select these values. To summarize, GBAs can be considered assessments that have construct- irrelevant game elements included to make the assessment more fun or engaging, but these elements may interact with relevant response patterns. Furthermore, high autonomy game environments increase the number of response patterns that may be observed, and as a result, GBA data often take the form of sparse, binary datasets. When designing the GBA, researchers may not be able to use a principled approach to plan out how to evaluate every possible response pattern. Instead, researchers must turn to empirical approaches to mine response data and identify interpretable patterns in the data. These empirical approaches typically employee clustering algorithms like k-means, SOM, and ROCK, but algorithms may be selected based on convenience, and researchers may not validate the results with multiple approaches. To demonstrate the potential differences and between these algorithms and the potential impact on a researcher’s interpretation of sparse, binary GBA data, this study 59 seeks to evaluate how well the SOM, k-means, and ROCK algorithms identify clusters in the presence of data that are irrelevant to the structure of the clustered data. Various data conditions (detailed in Chapter 4) are simulated to make these comparisons. In addition, the three clustering algorithms are used to identify clusters within real GBA process data from a math game, Beanstalk (Carnegie Mellon, 2013). These results are discussed in a tutorial study, examining the steps and choices that should be addressed when using clustering algorithms with GBA data. These inquiries illustrate how these algorithms can differ under varying data scenarios and provide guidance or cautions to researchers using empirical approaches to evaluate GBA performance. 60 Chapter 3: Feasibility Study A small-scale feasibility study was conducted to explore how to analyze and compare the success of the k-means, SOM, and ROCK algorithms in high-dimensional, sparse, binary datasets that are common in GBA research. The feasibility study did not seek to address any research questions directly, but instead served to demonstrate that these three algorithms could yield results that could be compared and contrasted. It also served as a learning exercise to help explore practical issues before moving to a full-scale study (e.g., how to standardized cluster labels so that comparisons can be made between the algorithms’ outputs). The feasibility study specifically focused on two small simulated datasets and a sample GBA dataset from a multiplication game. An artificial dataset was also appended to the GBA dataset to mimic a set of ten students who consistently used an incorrect process (addition instead of multiplication). Simulated Data for the Feasibility Study The first simulated dataset, S1, contained 50 cases and only three variables; these 50 cases were generated as coming from three clusters. The base probability of an observed binary response in any of the three variables, v1-v3, was Pr(vi=1) = 0.05, with exceptions for each cluster and variable shown in Table 1. Each case’s observed response to the three variables was randomly generated from a binomial distribution using the rbinom function in R (R Core Team, 2014), using the specified probabilities reported in Table 1. 61 Table 1 S1 Simulated Data Parameters Cluster n Pr(v1=1) Pr(v2=1) Pr(v3=1) 1 15 0.90 0.05 0.05 2 15 0.05 0.90 0.05 3 20 0.05 0.05 0.90 The second simulated dataset, S2, also contained 50 cases but had six variables, v1-v6. The probability of an observed response in any of the six variables was Pr(vi=1) = 0.05, with exceptions for each cluster and variable shown in Table 2a. As with S1, each case’s observed response to the six variables in S2 was randomly generated from a binomial distribution using the rbinom function in R (R Core Team, 2014), using the specified probabilities reported in Table 2a. An example of cases from this simulated dataset, S2, is shown in Table 2b below. Table 2a S2 Simulated Data Parameters Cluster N Pr(v1=1) Pr(v2=1) Pr(v3=1) Pr(v4=1) Pr(v5=1) Pr(v6=1) 1 15 0.90 0.50 0.05 0.05 0.05 0.05 2 15 0.05 0.05 0.90 0.50 0.05 0.05 3 20 0.05 0.05 0.05 0.05 0.90 0.50 Table 2b Example of S2 Simulated Data (First Six Cases) Case Cluster v1 v2 v3 v4 v5 v6 1 1 1 0 0 0 0 0 2 1 1 1 0 0 0 0 3 1 0 1 0 0 0 0 4 1 1 0 0 0 0 0 5 1 1 0 0 0 0 0 6 1 1 0 0 0 0 0 62 GBA Data for the Feasibility Study GBA data were provided by Derek Lomas (personal communication, July 4, 2014) from the game Number Jumble (BrainPOP, 2015). The data are anonymous, and information was not available about the conditions under which they were collected or the sample of students who played the game. Number Jumble is a multiplication game where students must move numbered tiles on a grid to coordinates so that the product of the coordinates equals the value on the tile (BrainPOP, 2015). For example, a student would move the 6 tile to the intersection of 2 and 3 because 2 × 3 = 6 (see Figure 5). In many ways, it is similar to the 15 Puzzle example discussed in Chapter 1, in that it has a grid of numbers, and there are ways to solve the Number Jumble grid more efficiently. Several grids with varying sizes and factors can be played in Number Jumble as different levels. Of the 328 response records provided for 14 levels, all but one of the levels had a sample size of 40 or fewer (mean = 16 results, s.d. = 14 results). The remaining level had 109 response records, and it also happened to be the simplest grid: a 3 × 3 grid with coordinates on both axes equal to [1, 2, 3], as is shown in Figure 5. Given the larger sample size and its simplicity, this remaining level was chosen for the demonstration and proof of concept in the feasibility study. 63 Figure 5. Level Mul_1 from Number Jumble (BrainPOP, 2015). The tiles in Number Jumble are randomly placed at the start of the level with the stipulation that tiles cannot be placed on their solution coordinates. The student must then use his or her mouse to drag tiles to their correct coordinates (the displaced tile will switch positions with the tile moved by the student). Bonus points are earned by working quickly. The student may also pick up a tile and then return it to its starting point (BrainPOP, 2015). Table 3 shows an example of log file data from Number Jumble that record every move made by the student. Recall in Chapter 1 how one could conceive of recording 15 Puzzle gameplay as tile movements, sequences, and move times—this is what Number Jumble has recorded in its work process data, along with additional information. The Number Jumble data represented individual moves made by students, and each move’s sequence and time stamp were recorded (see “Seq.” and “Time Stamp” columns in Table 3). Results were labeled with a student identification number (“SID”). The number of moves in a result ranged from 1 to 54, with an average of 8.7 moves (s.d. = 7.4). A player who made all correct moves would be able to complete this 3 × 3 level in eight moves (the ninth solution is automatic), assuming no added benefits like having a displaced tile also land in its solution coordinates. 64 Table 3 Excerpts of Number Jumble Log File Data (First Five Moves) Time Stamp SI D Seq. Pickup Numbe r Picku p X Picku p Y Dro p X Dro p Y result Level Name Reaction Time 4/13/2013 15:34 999 1 2 1 1 2 1 correct Mul_ 1 1.1 4/13/2013 15:35 999 2 6 1 1 3 2 correct Mul_ 1 1.3 4/13/2013 15:35 999 3 4 3 3 1 2 incorrec t Mul_ 1 0.6 4/13/2013 15:35 999 4 2 2 2 2 3 incorrec t Mul_ 1 1.0 4/13/2013 15:35 999 5 9 1 3 2 3 incorrec t Mul_ 1 1.1 These data were uploaded into the R statistical software (R Core Team, 2014) for analysis. The dataset of Number Jumble response cases from the game level shown in Figure 5 was labeled M1. A second dataset, M2, was created by adding a small set of simulated cases representing a hypothetical incorrect response patterns to the real responses in M1. The following paragraphs detail the construction of M1 and M2. M1: Responses. Each move was coded using the format ... This means that if a student moved the 1 tile to the coordinates (3, 2), the move was coded 1.3.2 (see Table 4). Data were coded binary, with 1 indicating the student made the move, and 0 indicating that the student did not make that move. No distinctions were made for students who made the same wrong move multiple times versus those who made a wrong move once, nor were distinctions were made between commutative factors. For example, a student must make the moves 6.3.2 and 6.2.3 in the game, but these were combined into two instances of the same multiplication problem and coded under the same variable: 6.3.2. Additionally, if a student picked up a tile and returned it to its starting position, the move was coded 999. 65 Table 4 shows an example of the data in M1. The first case, SID 13693303257821.1, shows that the student knew that 1 = 1 × 1 (i.e., 1.1.1 = 1), but also tried either 1 = 2 × 1 or 1 = 2 × 1 (i.e., 1.2.1 = 1). The student did not, however, try 1 = 2 × 2 (i.e., 1.2.2 = 0). For simplicity, no additional data about response order or game state are included in M1 for this initial exploration. Table 4 Excerpt of M1 Data (First Six Cases) .. SID 1.1.1 1.2.1 1.2.2 1.3.1 1.3.2 13693303257821.1 1 1 0 1 0 13693191284070.1 1 0 0 0 0 13693275705245.1 0 0 0 0 0 13693276443395.1 0 0 0 0 0 13693307246903.1 0 0 0 0 0 13693312091466.1 0 0 1 0 1 M2: Added simulated data. To help gauge the accuracy of clustering solutions with a set of responses from a known cluster, ten artificial cases were appended to the real data in the M1 dataset to create a second dataset, M2. These ten cases were copies of ten randomly selected cases in M1, but observations were changed from 0 to 1 for three variables: 2.1.1, 3.2.1, and 6.3.3. This pattern would be observed for a small cluster (i.e., these 10 results) where students mistakenly used addition instead of multiplication for three of their responses. For example, 2.1.1 = 1 represents that a student made a move representing 2 = 1 × 1, perhaps thinking instead that they were supposed to calculate 2 = 1 + 1. Similarly, the correct products were marked as unobserved for the ten simulated cases (e.g., 2.2.1 = 0). Since these were artificial cases purposefully created to share similarities in their incorrect response patterns, they are expected to all be mapped to the same cluster in any clustering solution. If they are split between clusters, then the 66 clustering algorithm has failed to recover this simulated set, which may suggest that its cluster assignments are not useful. Table 5 shows an example of the response data for 2.1.1 and 2.2.1 for some of these artificial cases. Table 5 Excerpt of M2 Simulated Data (First Six Cases) .. SID 2.1.1* 2.2.1* 2.2.2 2.3.1 2.3.2 adder1 1 0 0 0 0 adder2 1 0 0 0 0 adder3 1 0 0 0 0 adder4 1 0 0 0 1 adder5 1 0 0 0 1 adder6 1 0 0 0 0 *Manipulated by researcher to define simulated group. Procedure An exploratory analysis was used to compare clustering solutions for the simulated datasets, S1 and S2, and the Number Jumble datasets, M1 and M2 (which included the simulated Number Jumble cases). The k-means and SOM algorithms were run through 20,000 cycles to yield the most stable solution. For the k-means algorithm, this means that the k-means algorithm was run 20,000 times using a different random sample of starting cluster centers each time. The most stable solution of those 20,000 is the solution that minimizes the within sum-of-squares distances for the clusters. For the SOM algorithm, this means that the dataset was presented to the SOM network 20,000 times, with each presentation resulting in iterative updates to the cluster centers that decrease as the neighborhood and learning coefficient decay over each iteration. The ROCK algorithm was run only once because it always yields the same results for a given dataset and theta parameter. At a high level, the analysis procedures followed this pattern: 67 1. Run 20,000 cycles of k-means and SOM for each dataset, specifying three clusters for S1 and S2 and two clusters for M1 and M2. Three clusters were selected for S1 and S2 since these were known from the simulated data. Two clusters were selected for M1 and M2 based on exploratory analysis of fit for different numbers of clusters, which is discussed further in the following section. 2. Run ROCK algorithm once for each dataset, adjusting theta downward after each solution until the hierarchical algorithm converges on three clusters for S1 and S2 or two clusters for M1 and M2. 3. Compare clustering solutions by looking at the degree of agreement between two solutions, their recovery of known clusters from the simulated data, and how well the solutions fit the data. The following sections discuss the procedures developed around each of these steps, as well as the findings and decisions that were made through the process. Setting the number of clusters. As noted previously, cluster validation often hinges on interpretability (Hand, Mannila, & Smyth; 2001; Kriegel et al., 2009), so the most useful clustering solution might not be the best-fitting solution. Even objective methods of identifying the number of clusters may still rely on a person’s interpretation. For example, Hothorn and Everitt (2009) recommend plotting the within sum of squares for each clustering solution over a range of possible numbers of clusters. Several methods have been proposed for determining the optimal number of clusters based on the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC), and Pelleg and Moore (2000) developed an algorithm, x-means, which uses these information criteria to determine the optimal number of clusters. Extensions of this strategy have been 68 developed for non-spherical Gaussian distributions (Hamerly & Elkan, 2004) and for SOM algorithms (Wang, Yang, Mathee, & Narasimhan, 2005). Pelleg and Moore (2000) found that x-means found better clustering solutions than repeated k-means trials on simulated Gaussian distributions, suggesting that the number of clusters determined by x-means might be a stable solution, but in a simulation study, Ishioka (2005) found that x-means did not always return the same recommended number of clusters, and in cases where a cluster was split evenly, x-means might return more clusters than expected. The x-means algorithm also performs well with large datasets (Pelleg & Moore, 2000), but it may not return optimal solutions when the data are not scaled in spherical, multivariate normal distributions around centroids (Hamerly & Elkan, 2004). Given these constraints, x-means is not an acceptable method for selecting the number of clusters in sparse, binary GBA response data. Exploratory analysis of the Davies-Bouldin Index (Davies & Bouldin, 1979) for the clustering solutions for M1 suggest that there may not be any clusters (i.e., k = 1). This may reflect the simplicity of the Number Jumble level. Clustering algorithms will produce any number of clusters requested, even if it does not make sense to do so. For the purposes of this feasibility study, M1 can be analyzed with k = 2 clusters, even though there may not actually be two clusters in the data. This still provides a basis of comparison for when the clustering algorithms are run (also with k = 2) with the added simulated cases in M2. S1 and S2 are simulated to have k = 3 clusters. Running clustering algorithms. For each dataset, three different analyses were run: k-means, SOM, and ROCK. The k-means algorithm was run 20,000 times, with each 69 cycle beginning with a different set of k starting centers randomly selected from the cases in the dataset. Similarly, the SOM algorithm was run through 20,000 cycles, where the dataset was presented again to the SOM network at each cycle, allowing the estimates of the cluster centers to be fine-tuned. SOM and k-means clustering solutions are both dependent on the starting centers. The ROCK algorithm was run once per dataset because it does not require starting values and can only return one possible solution for a given dataset. This allowed for comparisons of agreement between the three algorithms’ results for each dataset, as well as comparisons of their results to the known cluster structures that were used to create datasets S1, S2, and M2. For the k-means analyses, R was used to run the kmeans function in the stats package using k clusters (R Core Team, 2014). The default value of one random starting set per analysis was used, meaning that for each of the 20,000 iterations of kmeans, it randomly selected k of the cases from the dataset as its starting points. Following the recommendations in the kmeans documentation, the default k-means algorithm from Hartigan and Wong (1979) was used. For the SOM analysis, R was used to run the som function in the kohonen package (Wehrens & Buydens, 2007). The grid for mapping the clusters was set to have dimensions of 1 × k so that it could accommodate any k suggested by x-means. The default values were used for the number of times the dataset would be presented to the algorithm (i.e., 20,000 presentation updates) and for the SOM learning rate (a linear declination from 0.05 to 0.01 over each of the 20,000 updates). The default neighborhood radius was also used, meaning the starting radius value was set to cover 2/3 of all unit-to- unit distances in the neighborhood of the potential cluster. A hexagonal grid was used (as 70 opposed to a rectangular grid) in a circular neighborhood, but there is no difference in visualizing the output when k = 2. For the ROCK analysis, R was used to run the rockCluster function in the cba package (Buchta, & Hahsler, 2014). The similarity index, theta, was set to 0.95, which means that results needed to be very close together to create within-cluster links. If the ROCK algorithm could not converge on k clusters, then theta was decreased by increments of 0.05 until the ROCK algorithm did converge. Lower theta values indicate that there needs to be less overlap between two response patterns for a link to be established. Note that ROCK does not use starting centers, but its solution is based on the value of theta. This means that it returns the same clustering solution every time, given the same theta, which is why only one application of the ROCK was needed for each dataset. Comparing clustering solutions. As Kriegel et al. (2009) observed, there is no universal method for comparing clustering solutions, and some clustering methods cannot be logically compared with others. Fortunately, k-means, SOM, and ROCK are very similar algorithms (i.e., they all are soft-projected clustering algorithms as described in Chapter 2, and they share the goal of segmenting data based on similarities), so one can examine the solutions to determine how well they agree and which solution fits best. For this feasibility study, comparisons were made between cases’ individual cluster assignments from k-means, SOM, and ROCK in order to determine the consistency between algorithms. Before comparisons could begin, the clustering solutions had to be recoded so that identical clusters (or similar clusters) were labeled the same way. For example, if one 71 solution put 75 cases in cluster 1 and a second solution put the same 75 cases into cluster 2, it would be helpful to change the labels in the second solution so that the cluster was labeled cluster 1 in both solutions. Unfortunately, clusters do not always match up perfectly, so a procedure was developed to recode the solutions using the following steps: 1. Take two clustering solutions, A and B, which show the k cluster memberships for n cases. 2. For each clustering solution, A and B, create a k × n membership matrix, where cell kinj = 1 if the j th result has been classified in the i th cluster of the solution and kinj = 0 otherwise. 3. For each cluster row in the membership matrix for clustering solution A, calculate the distance to each cluster row in membership matrix B in the cases’ space. 4. Find the cluster membership rows which are closest to each other from solutions A and B. These two clusters have the best overlap in terms of the cases. Give them the same cluster label. 5. Find the next pair of closest membership rows, omitting the first two rows. Give this new pair the same cluster label. 6. Repeat step 5 until there are no clusters left to match. Figure 6 provides an example of this recoding process in a two dimensional space. In Figure 6, the two clustering solutions have three clusters each (A1-A3 and B1-B3). Solutions B3 and A2 are closest, so they are recoded to have the same label. Solutions B1 and A3 are the next closest, so they too are recoded to have the same label. A1 and B2 are left and are given the same label. There were clustering solutions that were previously closer to A1 and B2, but they have already been paired with other clusters that were an 72 even better match. For the matrices described above, there would be n axes (one for each case), and each cluster solution would have a coordinate of either 1 or 0 on each axis. Figure 6. Example of recoding sequence of two-dimensional cluster solutions (A and B) with k = 3. The study data only had coordinate values of 0 or 1 and n dimensions. Once the cluster solutions have their cluster identifiers matched, metrics are needed to compare the similarities between two clustering solutions. One simple comparison is the percentage agreement between two clustering solutions; i.e., the percentage of cases that were classified the same way in both algorithms. This was done by creating a pivot table showing the frequency of cases classified under each clustering solution. Table 6 shows an example of a hypothetical pivot table for a k-means clustering solution and a SOM clustering solution. Note that 89/109 = 82% of the cases were classified in the same clusters by both algorithms in this comparison. A1 A3 B1 B2 B3 A2 -3 -2 -1 0 1 2 3 4 -3 -2 -1 0 1 2 3 4 5 1 st 3 rd 2 nd 73 Table 6 Example of a Pivot Table Showing Classification Agreement Between Two Clustering Solutions Cluster Solution X 1 2 Cluster Solution Y 1 39 16 2 4 50 Percentage comparison is a simple and practical measure for comparing the consistency between clustering solutions, but another common measure is Cohen’s Kappa, which is often used to describe agreement between human raters. Cohen’s Kappa is a measure of agreement that accounts for the amount of agreement that would have occurred by chance based on the individual raters’ (or clustering algorithms’) distribution of ratings. The unweighted kappa formula is used because the cluster categories are nominal (Cohen, 1960), and the formula for unweighted Kappa is: 𝜅 = Pr(𝑎𝑜) − Pr(𝑎𝑒) 1 − Pr(𝑎𝑒) (8) where ao is the observed rate of agreement and ae is the expected rate of agreement based only on chance, given how many times two algorithms each assigned cases to each cluster. In the example in Table 6, the kappa value for this pair is 0.63. While 82% agreement is very useful for interpretation, it is also helpful to know that that agreement is not just an artifact of the clustering solutions. For example, two clustering solutions that dumped most cases into a single large cluster would have a high percentage agreement, but one might be interested on whether they agreed on which remaining cases did not belong to the large cluster. For each cluster solution for each dataset in the study (S1, S2, M1, M2), the percentage agreement was calculated using R (R Core Team, 2014) and Cohen’s Kappa 74 was calculated using the cohen.kappa formula in the R psych package (Revelle, 2015). The same statistics were calculated to evaluate how well the clustering solutions matched the expected “true” clusters used to create S1, S2, and M2. Clustering solutions can also be evaluated and compared based on an internal criterion. One common method for evaluating a clustering solution is the Davies-Bouldin Index (Davies & Bouldin, 1979). The Davies-Bouldin Index (DBI) was originally proposed as a method for selecting the number of clusters, much as the x-means algorithm does, but unlike x-means results, the DBI can be used to compare multiple clustering solutions to find the one that provides the best result. The concept behind the DBI is that good clustering solutions should provide clusters that are not too similar to each other. If two clusters are close to each other or if clusters have a lot of within-cluster dispersion, then they are similar, but clustering solutions that are further apart and have little dispersion are more desirable. DBI is designed so that the clustering solution with the lowest index values is considered to be best (Davies & Bouldin, 1979). Another way to say this is that the best clustering solution is one where there is higher between-cluster variance and lower within-cluster variance. DBI is calculated with the following formula: DBI = 1 𝑘 ∑ max 𝑖≠𝑗 𝑆𝑖 + 𝑆𝑗 𝑀𝑖𝑗 𝑘 𝑖=1 (9) where k is the total number of clusters, Si is the average distance between the center of cluster i and its assigned cases (the same goes for Sj), Mij is the distance between the centers of clusters i and j. Cluster j is the cluster that maximizes this ratio for cluster i, meaning that it is the most conservative scenario—cluster j is the cluster that is closest to cluster i or has high within-cluster variance (or both) (Davies & Bouldin, 1979). The 75 distance measure used for Si, Sj, and Mij is typically Euclidean distance, but DBI can use other distance measures as well. From an application standpoint, this may not be the most useful solution. Interpretability may be more valuable than small differences in fit, and from the standpoint of GBA data, there is no reason that one would want to dismiss two similar yet different clusters of results. Recall that Amershi and Conati (2009) warned that clustering solutions might mask subtle, yet instructionally-relevant differences between student responses. One could imagine that this would be more likely if researchers blindly chose the output that yielded the biggest differences between clusters. With these cautions in mind, the Davies-Bouldin Index is still a suitable descriptive statistic for comparing the fit of different clustering solutions in this feasibility study, regardless of the algorithm. For each of the four datasets in the study, the Davies-Bouldin Index was calculated for three clustering solutions using the intCriteria function in the clusterCrit package for R (Desgraupes, 2014). These indices were saved, along with identifiers indicating to which clustering algorithm solution they corresponded. Performance of the Clustering Algorithms for Datasets S1 and S2 Before proceeding to the GBA datasets, M1 and M2, the clustering solutions were analyzed using the simulated data from S1 and S2. This provided insight for how these three algorithms compare with each other using simple, low-dimensional data. First, the clustering solutions were compared to the true clusters used to generate S1 and S2 in order to evaluate how well they were able to recover the true clusters. Table 7 shows the true cluster recovery for S1, and Table 8 shows the true cluster recovery for S2. Kappa values and percentage agreement are shown below each matrix. Landis and 76 Koch (1977) indicated that Kappa values between 0.81 and 1.00 represent near perfect agreement. Notice in both tables that k-means and SOM perform just as well. They each have a good match with the true clusters in S1 (Kappa = 0.73, 82% agreement with the true clusters) and near perfect recovery in S2 (Kappa = 0.91, 94% agreement with the true clusters). Surprisingly, Tables 7 and 8 show that the ROCK algorithm tended to lump the results into a single large group, so the ROCK solution did a poor job recovering the three true clusters in S1 and S2 (Kappa = 0.07 and 0.43, respectively). This poor performance is unexpected for an algorithm designed to accommodate binary data, but it may be a reflection of the small number of variables in these datasets. The ROCK algorithm may perform better with more variables in the data, and this was considered in the proposal for the final study. Consistency between clustering algorithms was investigated next. While perfect consistency might not be expected from different algorithms, it is worth seeing how well these three algorithms agree with each other if they are all potential candidates for evaluating the process data. Table 7 S1: True Cluster Recovery by Algorithm k-means ROCK SOM Cluster 1 2 3 1 2 3 1 2 3 T ru e 1 12 2 1 T ru e 1 0 0 15 T ru e 1 13 2 0 2 0 15 0 2 0 1 14 2 0 14 1 3 5 1 14 3 1 4 15 3 1 5 14 Kappa 0.73 0.07 0.73 % Agreement 82% 36% 82% 77 Table 8 S2: True Cluster Recovery by Algorithm k-means ROCK SOM Cluster 1 2 3 1 2 3 1 2 3 T ru e 1 18 2 0 T ru e 1 0 0 15 T ru e 1 19 1 0 2 1 14 0 2 0 13 2 2 2 13 0 3 0 0 15 3 1 1 19 3 0 0 15 Kappa 0.91 0.43 0.91 % Agreement 94% 64% 94% Table 9 illustrates the level of agreement between the three clustering algorithms for their solutions for S1. The three matrices in Table 9 show the number of results classified into each cluster by the algorithms. One observes that the k-means and SOM solutions have good, but not perfect agreement (Kappa = 0.67). The ROCK solution has very poor agreement with the other two solutions because the ROCK solution grouped 44 of the 50 results into a single cluster. The ROCK and k-means solutions have a paltry 0.04 Kappa value, and the ROCK and SOM comparison yields an abysmal -0.17 Kappa. In this first example, k-means and SOM yield somewhat similar solutions, whereas the ROCK algorithm forced most cases into a single large cluster. Table 9 S1: Cluster Agreement Between Pairs of Algorithms k-means ROCK SOM Cluster 1 2 3 1 2 3 1 2 3 S O M 1 11 1 2 k- m ea n s 1 1 5 11 R O C K 1 0 1 0 2 6 15 0 2 0 0 18 2 0 5 0 3 0 2 13 3 0 0 15 3 14 15 15 Kappa 0.67 0.04 -0.17 % Agreement 78% 38% 30% 78 Table 10 shows the same algorithm comparisons for S2. Recall that S2 has six binary response variables whereas S1 had only three. Note the immediate improvement in agreement across all three comparisons shown in Table 8. For S2, the k-means and SOM algorithms had nearly perfect agreement (Kappa = 0.94). The agreement with the ROCK algorithm improved over the S1 solutions, but not drastically. Following the same behavior observed in S1, the ROCK algorithm again produced one large cluster, this time with 36 cases. The ROCK solution’s agreement with the k-means solution was fair (Kappa = 0.47), and the agreement with the SOM solution was again the poorest (Kappa = 0.06). These results yield some insights about how these algorithms perform under the data conditions in S1 and S2. First, notice that the consistency between SOM and k- means and their ability to recover the true clusters increases in S2. This is attributable to the number of variables—specifically the irrelevant variables where the probability of observation for a given cluster was 0.05. Since these algorithms are soft projection algorithms, they consider all of the variables equally. This means that one discrepant observation in a vector of process data holds less weight for assigning the cluster when there are more variables. Remember too that the unobserved variables are just as important as the observed variables for defining cluster membership. 79 Table 10 S2: Cluster Agreement Between Algorithms k-means ROCK SOM Cluster 1 2 3 1 2 3 1 2 3 S O M 1 19 2 0 k- m ea n s 1 19 0 0 R O C K 1 21 13 2 2 0 14 0 2 15 1 0 2 0 1 0 3 0 0 15 3 2 0 13 3 0 0 13 Kappa 0.94 0.47 0.06 % Agreement 96% 66% 42% Conversely, the ROCK algorithm struggled under these conditions. Although the ROCK algorithm is designed to handle binary data, there may not have been enough variables for the ROCK algorithm to perform well. Although Guha et al. (1999) do not specify a lower limit for the number of variables to use with the ROCK algorithm, its successful applications in research tend to have at least 15 (and as many as several hundred) variables (e.g., Guha at al. 2014; Song & Li, 2006). For datasets S1 and S2, it is possible that with so few variables, more links are identified in the ROCK algorithm, ultimately forcing many results into a single cluster in order to achieve the desired solution with k = 3 clusters. Because of this tendency to create one large cluster, the ROCK solution did not have good agreement with the other two algorithms and did a poor job at recovering the true clusters in S1 and S2. SOM and k-means did an equally good job at recovering the true clusters in S1 and S2, but since they did not arrive at the same solution, which one was the best? According to the Davies-Bouldin Index (DBI), k-means performed slightly better. The DBI can be used to determine how well a given clustering solution creates well-defined clusters, as determined by small within cluster variance and larger between cluster 80 variance. Recall that lower values of DBI indicate better fit. Table 11 shows the DBI values for the three solutions for S1 and S2. Note that SOM had the best-fitting solution in S1, but k-means had a slightly better fitting solution in S2 (as indicated by having the lowest DBI). Given the high true cluster recovery rate of both SOM and k-means in S2, it is not surprising that their DBI indices are so close—they both returned results very similar to the true clusters. Table 11 DBI Results by Solution for S1 and S2 Algorithm DBI for S1 DBI for S2 SOM 0.51 1.02 k-means 0.61 0.96 ROCK 1.29 1.15 One can also inspect the true cluster recovery rate by looking at the cluster centers of the solutions. If they were a perfect recovery of the true clusters, these coordinates will be very close to the probabilities of observing variables in each of the clusters. Figures 7a and 7b show the magnitudes of the cluster centers for each solution and the true clusters in S1 and S2. Notice that for each cluster, the centers for the SOM and k-means solutions approximate the probabilities defining the true clusters, whereas the ROCK solution tended to have different cluster centers. 81 Figure 7a. Cluster center magnitudes (mean coordinates) by variable and clustering solution in S1. The magnitudes of the k-means and SOM centers roughly reflect the probabilities of observing a response in variables v1-v3 in all three clusters, showing that k-means and SOM recover the clusters better than the ROCK algorithm. Figure 7b. Cluster center magnitudes (mean coordinates) by variable and clustering solution in S2. The magnitudes of the k-means and SOM centers roughly reflect the probabilities of observing a response in variables v1-v6 in all three clusters, showing that k-means and SOM recover the clusters better than the ROCK algorithm. The exploration into S1 and S2 shows that SOM and k-means both performed better than the ROCK algorithm in terms of true cluster recovery, fit (as measured by 0% 25% 50% 75% 100% TRUE SOM k-means ROCK TRUE SOM k-means ROCK TRUE SOM k-means ROCK Cluster 1 Cluster 2 Cluster 3 Pr(V1=1) Pr(V2=1) Pr(V3=1) 0% 25% 50% 75% 100% TRUE SOM k-means ROCK TRUE SOM k-means ROCK TRUE SOM k-means ROCK Cluster 1 Cluster 2 Cluster 3 Pr(V1=1) Pr(V2=1) Pr(V3=1) Pr(V4=1) Pr(V5=1) Pr(V6=1) 82 DBI), and agreement with each other. SOM and k-means each recovered the true clusters equally well in both solutions, improving the true cluster recovery in S2 when more variables were presented. In S1, SOM found a better-fitting solution, whereas k-means found a slightly better solution in S2; however, the fit improvement was not substantial— the SOM solution would probably have the same practical suitability and utility for S2. In general, the improvement in true cluster recovery and fit from S1 to S2 across all three algorithms suggests that each will perform better with more variables. Performance of the Clustering Algorithms for Datasets M1 and M2 The data in S1 and S2 were created with a known number of clusters, but there is no information about how many clusters are in M1 (the real GBA data from the Number Jumble game). Before comparing how well the clustering algorithms recover the ten simulated Number Jumble responses added to M1 to create M2, one must determine how many clusters are in M1 to begin with. As discussed in Chapter 2, identifying the number of potential clusters is a challenge, and there is no agreement about the best way to identify the number of clusters in a dataset (Berkhin, 2006; Kriegel et al., 2009; Xu et al., 2013). One simple method is to compare the within sum-of-squares values for clustering solutions over varying numbers of k clusters. One can then plot these values and look for a sharp elbow in the chart, depicting the point at which additional clusters do not significantly lower the within sum-of-squares for the clustering solutions. Unfortunately, doing this with M1 yields no such elbow, suggesting that M1 may be a single cluster solution. The mclust package in R provides a method for estimating the number of clusters by reporting a Bayesian Information Criterion (BIC) value for different values of 83 k clusters (Fraley & Raftery, 2007). Unfortunately, the package only uses hierarchical clustering that relies on linkages between cases, not a density partitioning method like k- means or SOM, so its recommendations might not be accurate for the algorithms used in this feasibility study. Furthermore, the BIC values are estimated under the assumption that the data are from a multivariate Gaussian distribution (mclust varies parameter assumptions for these distributions), but it is safe to assume that the binary data in M1 would not follow a Gaussian distribution. Even so, using mclust with the M1 dataset returns results that consistently suggest a one cluster solution across all of their Gaussian distribution parameter combinations. For large sets of binary data , Newell, Cook, Hofmann, and Jannink (2013) recommend comparing Hubert’s gamma statistic across varying levels of k clusters, where the maximum gamma value indicates the number of clusters to seek out in the data (Halkidi, Batistakis, & Vazirgiannis, 2001). The fpc package in R calculates Hubert’s gamma; however, it does not evaluate whether or not there is a single cluster in the data (Hennig, 2015). Furthermore, the algorithm that calculates the gamma statistic does so by splitting the data into subsamples. Hennig and Liao (2011) recommended using subsamples with at least 200 cases in each. The sample size for M1 (n = 109) cannot support such large subsamples, and this becomes evident when trying to use the fpc package with the M1 dataset. The fpc package failed to reach convergence for the gamma statistic for k > 4 clusters, but showed declining gamma values for both SOM and k-means solutions as k increased from 2 to 4 clusters. In the absence of a gamma statistic at k=1, one can infer from these declining gamma values that M1 may have either 1 or 2 clusters, but not 3 or 4. 84 Given the results from comparing within sum-of-squares, BIC, and Hubert’s gamma over varying numbers of clusters, it seems that M1 may include either one or two clusters in the Number Jumble data. Classical test theory item difficulty statistics show that the percentage of correct responses for the individual multiplication products in the Number Jumble game ranges from 49%-67%, meaning that about 1/3 to 1/2 of the moves for a given tile in the game are placed incorrectly, but ultimately most people do correctly place most of the tiles in this Number Jumble level. A single cluster solution would be interpreted to mean that most students made some mistakes, but that those mistakes were not systematic across the sample, and nearly every student was able to correctly identify the majority (if not all) of the multiplication products in the game level. However, exploratory investigation of a two cluster solution using cluster assignments from both SOM and k-means shows interpretable results. In both solutions, students in cluster 2 identified the correct multiplication products at a higher rate than students in cluster 1, and they also had fewer “pick up and release” moves where they replace a tile at its starting point instead of picking a product from the grid, suggesting that students in Cluster 2 were somewhat more confident in their moves in the game. These differences may not provide enough discrimination to support a two cluster solution, but the interpretability is noteworthy, so a two cluster solution is at least worth exploring. As a result of these findings, M1 was analyzed with a two cluster solution across the three algorithms (SOM, k-means, ROCK). An analysis of a one cluster solution is not needed, since there is no comparison to be made—all three algorithms would assign all results to the single cluster, and DBI cannot be calculated with k = 1. Next, with its 85 additional ten simulated cases, M2 was analyzed with a two and three cluster solution to see if the ten simulated cases were recovered consistently in the same cluster. Tables 12 shows the consistency between algorithms in M1 when there are k = 2 clusters. As was the case in the the results from the simulated datasets, SOM and k-means solutions have higher agreement with each other than they do with the solution found with the ROCK algorithm. In fact, Table 12 shows that SOM and k-means have identical solutions in the M1 dataset (Kappa = 1.00). As was seen with the simulated data, the ROCK algorithm again has a tendency to group most of the cases into a single, large cluster. Table 12 M1: Cluster Agreement Between Algorithms in a Two Cluster Solution k-means ROCK SOM Cluster 1 2 1 2 1 2 S O M 1 66 0 k- m ea n s 1 65 1 R O C K 1 65 43 2 0 43 2 43 0 2 1 0 Kappa 1.00 -0.02 -0.02 % Agreement 100% 60% 60% Table 13 shows that this consistency between SOM and k-means dissipates in M2 if only two clusters are specified (percent agreement = 72%, Kappa = 0.33), and ROCK continues to group most cases into one large cluster. Unless the two different solutions are aligning with two different goals or interpretations, this lack of agreement between SOM and k-means when k = 2 could be a validity challenge in practice, perhaps suggesting that one or both solutions are statistically unstable. The SOM solution seems to suggest that there was primarily only one large cluster to begin with in M1, and the addition of the simulated data in M2 86 created a new cluster, to which three of the original cases were also assigned. The k- means solution presents a different interpretation, suggesting that there were indeed two clusters in M1, and the simulated cases are similar enough to 33 other cases that they should be included in one of the original clusters. Table 13 M2: Cluster Agreement Between Algorithms in a Two Cluster Solution k-means ROCK SOM Cluster 1 2 1 2 1 2 S O M 1 13 0 k- m ea n s 1 1 45 R O C K 1 0 1 2 33 73 2 0 73 2 13 105 Kappa 0.33 0.03 -0.02 % Agreement 72% 62% 88% Table 14 shows how the three algorithms compare in M2 when the number of clusters is increased to three. SOM and k-means snap back to the perfect alignment in clusters 1 and 2 that was seen when M1 was analyzed with a two cluster solution. All ten simulated cases are recovered in cluster 3, along with three of the real cases. Table 15 summarizes the DBI fit values for the three algorithms’ solutions for M1 and M2 for the varying value of k (recall that DBI cannot be calculated for k = 1). Surprisingly, ROCK was consistently the best-fitting solution. Recall that ROCK achieved this fit by grouping all but one response into a single cluster, thus exemplifying Kriegel et al’s (2009) warning that the best-fitting solution is not necessarily the most useful or interpretable. 87 Table 14 M2: Cluster Agreement Between Algorithms in a Three Cluster Solution k-means ROCK SOM Cluster 1 2 3 1 2 3 1 2 3 S O M 1 64 2 0 k- m ea n s 1 62 1 1 R O C K 1 62 42 13 2 0 42 0 2 42 0 0 2 1 0 0 3 0 0 13 3 13 0 0 3 1 0 0 Kappa 1.00 -0.02 -0.02 % Agreement 100% 52% 52% Table 15 also shows that SOM and k-means had the same DBI value (3.02) for M1 with k = 2 because they created the same solution, but SOM had better fit in M2 for k =2 and k = 3 (DBI = 1.86 and 1.82 respectively). Recall that when k = 3, cluster assignments for SOM and k-means in M2 were identical. The difference in DBI shows that SOM achieved this by selecting better fitting cluster centers, even though both algorithms yielded the same cluster assignments. Table 15 DBI Results by Solution for M1 and M2 Algorithm DBI for M1 DBI for M2 k = 1 k = 2 k = 2 k = 3 SOM ___ 3.02 1.86 1.82 k-means ___ 3.02 3.02 2.99 ROCK ___ 0.80 0.84 0.67 Discussion of the Feasibility Study The simple feasibility study offered a glimpse into how the SOM, k-means, and ROCK algorithms perform under different data scenarios, similar to those found in GBA response data. One can evaluate the findings by considering the algorithms’ abilities to 88 recover true clusters (as measured by Kappa) and the fit of their solutions (as measured by DBI). In S1 and S2, SOM and k-means were able to make good to near-perfect recoveries of the true clusters, as represented by their Kappa values (Landis & Koch, 1977). SOM and k-means performed similarly in S1 and S2 with the cleaner, completely simulated data. SOM and k-means created the best fitting solutions for S1 and S2, and since their solutions were so similar, there were not vast differences in their DBI values for the S1 and S2 solutions. ROCK tended to lump many cases together into one large cluster in S1 and S2 and yielded poorer DBI results for fit. For M1, SOM and k-means yielded identical clustering solutions when k = 2 clusters. When the simulated cases were added to create M2, SOM and k-means diverged, with k-means creating similar partitions as were found in M1 and SOM isolating the simulated cases (along with three others) at the expense of combining all other cases into a single cluster. When k was increased to 3 for M2, SOM and k-means realigned, matching their two clusters from M1, and correctly clustering the ten simulated cases along with three of the real cases, although SOM’s centers yielded a better DBI value. Inspection of those three cases showed that they each had observations for one or two of the three variables used to define the simulated cases: variables 2.1.1, 3.2.1, and 6.3.3. This suggests that these students may have made the addition instead of multiplication error that the simulated cases were meant to mimic. The results of the feasibility study show that k-means and SOM may yield similar solutions in some scenarios, in which case they can be used in tandem to validate each other’s solutions; however, there are situations like M2 where SOM outperforms k-means 89 in terms of fit (DBI). There may conceivably be cases where the reverse is true. The results from the Number Jumble data suggest that M1 and M2 may in fact have two and three clusters, respectively, and if the researcher specifies the incorrect number of clusters (e.g., k =2 for M2), SOM and k-means may arrive at drastically different results. ROCK performed very poorly in the feasibility study, but this may be a factor of the high sparseness of the data and the low number of clusters: with so many zeroes many cases share similarities, which then create a large number of links between all responses, thus amalgamating all of the results into a single cluster when a low number of clusters (two or three) is desired. ROCK did produce better DBI values in the M1 and M2 solutions, but this may be because the ROCK solution had clusters where n = 1, which would lead to zero within cluster variance for those clusters, which in turn might lower the DBI. Guha at al. (2014) and Song and Li (2006) have demonstrated that ROCK recovers known clusters well in higher dimensional binary datasets (16-548 variables), even when there are a low number of clusters, and Song and Li showed that ROCK’s clusters were as good as or better than k-means clusters when creating sets of news articles that had been marked with binary topic tags. Despite the findings of the feasibility study, ROCK may still be a suitable algorithm when working with more variables. 90 Chapter 4: Methods for Full Study The full study leverages the findings of the feasibility study to compare the performance of clustering algorithms in data that reflect common GBA response data; i.e., high-dimensional, sparse, binary data. The study seeks to evaluate how well SOM, k- means, and ROCK algorithms identify true latent clusters under varying conditions representing aspects of gameplay, including game state, autonomy, and relatedness. These conditions are explored by creating datasets which vary the numbers of clusters in the data, the number of cases in a cluster (cluster size), the probability of observing responses for some relevant variables in a cluster, and the amount of overlap between the variables that are relevant to defining the clusters. The study seeks to illustrate the conditions under which recovery of true clusters yields Kappa values greater than 0.80 for a variety of GBA data scenarios—0.80 Kappa representing a commonly used threshold for “near perfect agreement” (Landis & Koch, 1977). With this goal in mind, the study addresses the following research questions: 1. Given the correct number of true clusters, k, what are the magnitudes of the differences in mean Kappa values for the SOM, k-means, and ROCK clustering solutions across multiple datasets as: a. the number of clusters increases? b. the number of cases in a cluster increases? c. the number of cases in clusters are similar? d. the probability of observing responses for relevant variables in a cluster increases? 91 e. the amount of overlap between the variables that are relevant to defining the clusters increases? 2. Under what combinations of these conditions is the true cluster recovery poor (Kappa distributions generally below 0.80) for each algorithm, even when the researcher has correctly specified the number of clusters? 3. Under what combinations of these conditions are the SOM, k-means, and ROCK solutions expected to agree with each other (Kappa ≥ 0.80) so that they can be used to validate each other? 4. When varying the number of clusters specified for an algorithm, does the best- fitting solution for the SOM, k-means, and ROCK algorithms (as measured by DBI) match the correct number of clusters in the data? By answering these questions, this study helps researchers think about how to design GBAs to encourage response data that will be suitable for their desired accuracy of true cluster recovery, potentially by avoiding game designs that can result in data structures that yield poorly separated clusters of gameplay. For example, if overlapping relevant variables make it difficult to identify clusters, GBA designers may design the game with less autonomy to create less overlap in gameplay, or they may choose to omit variables that are observed in many students’ response data. For existing GBAs, this study provides guidance about the suitability of these algorithms for classifying student responses. Recall that these clustering algorithms do not help the researcher identify the number of clusters—a process that typically involves researchers comparing the interpretability of solutions across varying values of k—but this study demonstrates under which conditions these algorithms fail to recover true 92 clusters at an acceptable rate, even when the researchers have specified the correct number of clusters. The study also shows if better fitting solutions are found when k is misspecified and does not match the true number of clusters. This study requires simulated data because these conditions cannot be easily generated in a real GBA scenario. Given a real GBA, there would be no way to ensure students were creating a desired number of response clusters without actually coaching them through every step of their gameplay. Crossing all five conditions from the first research question would be nearly impossible, and there would be logistical issues of getting students to participate in all of these variations. The time commitment and cost of recruiting such a sample would be prohibitive. A simulation allows for targeted manipulation of the response data conditions needed to address the research questions. To help illustrate how the guidance from the study output relates to a real world application, the SOM, k-means, and ROCK algorithms have also been used to cluster data from the game Beanstalk (Carnegie Mellon, 2013). This example is intended to serve as a tutorial, discussing not just the specifics of the analyses, but also the step-by-step process of picking algorithms, coding log file data, comparing solutions, validating results, and interpreting the outcomes. The following sections discuss the design of the simulation, the study’s output variables, and the tutorial structure in more detail. Simulation Design Summary The study simulates responses from a GBA in which each cluster is differentiated by having a higher probability of observed responses for certain variables in the dataset. Data are generated and analyzed 400 times for each cell in the study, meaning that each cell has 400 replications, each with hundreds or thousands of simulated response patterns, 93 depending on the number of clusters and cluster sample sizes specified for the cell. Since each cell has 400 datasets, 400 values for Kappa and DBI can be calculated in a given cell for each clustering algorithm. This sample size ensures more than 80% power to detect small effects between the average Kappa values and average DBI values for any two algorithms’ solutions within a cell or between two cells. 400 datasets is also a multiple of 100, which helps simplify communication of the percentage of solutions that have certain characteristics (e.g., Kappa greater than 0.80). The study varies four aspects of the data. The first three aspects are directly manipulated, varying the number of clusters (k), the number of cases in a cluster (cluster size; Nk), and the probability of observing responses in a cluster. These conditions create a 48-cell design (4×4×3) such that each dataset will have k × Nk cases (observed work process response patterns) and k × 20 response variables. The fourth aspect, the amount of overlap between the relevant variables defining the clusters, also varies by allowing the 20 relevant variables for each cluster to be randomly selected, thus varying the between cluster variability. The following sections describe, in detail, the data generation, the parameters manipulated, the rationale for the specific parameter values, and the methods used in the simulation and the tutorial study using game data from Beanstalk (Carnegie Mellon, 2013; Koedinger et al., 2010). Data Generation Each of the 400 datasets in each cell of the study were generated using a latent class analysis (LCA) data simulator in the poLCA package in R (Linzer & Lewis, 2011). LCA models predict the probability of a student belonging to a given group, given the student’s response pattern and the probabilities of responses on each variable for a given 94 group (cluster). Similarly, when generating simulated data in the poLCA package, one must specify the number of clusters, the number of variables, the probability of being in a given cluster, the total sample size, and the probability of an observed response on each variable for each cluster. For this simulation, the poLCA data simulator was set to have k clusters, depending on the cell parameters. Each cluster is distinguished by simulated student responses on 20 variables in which there is a higher probability of a response for cases in that cluster. These 20 variables are considered “relevant” to the identification of the cluster, but some of these variables may also be used in the identification of multiple clusters (i.e., the same variable may be relevant in the identification of several clusters). To create these variables, an initial pool of k × 20 variables was created. Each cluster had 20 variables randomly selected from the complete set of variables (without replacement) as their relevant variables (e.g., for four clusters, there are 4 × 20 = 80 total variables in the dataset, and each cluster has 20 of those 80 variables randomly assigned to be the clusters’ relevant variables). The probability for a response on each cluster’s 20 relevant variables was set to 0.30, 0.60, or 0.90, depending on the cell. The remaining variables were set to have a probability of 0.05. Each cluster’s 20 variables were picked independently, meaning that clusters may (and are expected to) share some of the same relevant variables. Thus, two clusters may each have a high probability of a response on the same variable. The sample sizes for each of the k clusters was picked from a normal distribution with a mean of 50 or 100 and a variance of 9 or 100 using the rnorm function in R (R Core Team, 2014). The total sample size is simply the sum of the cluster sample sizes. 95 For the generation of the simulated data, the probability of a respondent belonging to a given cluster is simply the cluster’s sample size divided by the total sample size: Nk/N. For example, if a cluster has a sample size of 50, and there are 200 cases in the dataset, then the probability of a respondent being in that cluster is 0.25. Number of Clusters for the Simulation Having more clusters in the dataset represents more autonomy in the game; i.e., there is more variation in the possible response patterns. In theory, a higher autonomy game can yield a higher variety of response patterns, which could correspond to more clusters of interpretable work processes. For example, the earlier feasibility study of the Number Jumble data suggested that there may be only two clusters of work processes in the data, but Number Jumble is a low autonomy game. The game Save Patch has moderately higher autonomy than Number Jumble, and in Kerr et al.’s (2011) fuzzy cluster analysis of the Save Patch response data, they determined each Save Patch level had seven or fewer clusters. They determined the number of clusters by increasing k until a cluster existed that contained only correct response patterns (Kerr & Chung, 2012). In a biology game with higher autonomy than Save Patch, Stevens and Casillas (2006) identified 36 clusters of responses using the SOM algorithm. In a similar study of brain state patterns of submarine pilot teams in a simulation (not a GBA), Stevens et al. (2012) identified 25 clusters of team brain state patterns, although these were inputs into a larger student model, and the clusters were not used individually to support inferences. For the simulation conditions, there are four initial possibilities for the number of clusters (k) in each dataset: 4, 10, 20, and 35 clusters. These choices roughly mirror what has been observed in these other studies: ranging between a handful of distinct work 96 process clusters up to a few dozen. The highest number of clusters (k = 35) illustrates how these clustering algorithms might perform in an extremely high autonomy GBA. This is a reasonable inquiry, as some researchers have begun wrapping assessments around existing video games that have complex gameplay (e.g., Corrigan et al., 2014). Number of Cases in the Simulated Clusters In some GBAs, some work processes may be less common than others, and this will manifest in clusters with fewer cases than others. Smaller clusters may represent unexpected response patterns, and small clusters are a likely possibility in GBAs with high autonomy or where game state dependencies might limit the prevalence of certain response patterns. Of course, the number of cases in a cluster will also depend largely on how many students play the GBA, so for this simulation, the average cluster size and the variability in cluster size were both manipulated. Varying the average number of response patterns (cases) in the simulated clusters helps identify whether certain clustering algorithms perform better as cluster size increases (i.e., does an algorithm recover true clusters better when clusters are very large?). Allowing for variability in cluster size within a dataset also demonstrates whether certain algorithms perform better in scenarios where some work processes are less common than others. This is of particular interest because variability in cluster size is likely common in GBAs; however, the k-means algorithm assumes that all clusters are roughly the same size (SOM and ROCK do not have this assumption). Neither Kerr and Chung (2012) nor Stevens and Casillas (2006) reported how many cases were in each individual cluster, but an estimate can be made from the aggregations provided in their papers. For example, in a level with 900 total response 97 patterns, Kerr and Chung (2012) reported that three clusters represented 68% of those cases, thus one can infer that those three clusters had an average size of 204 response patterns. Similarly, given that Stevens and Casillas (2006) had 1,710 total responses and 36 clusters, the average cluster size was about 48 response patterns. These studies provided context for the choice of clusters sizes in this simulation study. Each cell of the study has either smaller clusters defined as having an average cluster size of 50 cases, or they have larger clusters with an average size of 100 cases. It is expected that these results will generalize to larger cluster sizes, as the clustering algorithms are expected to perform better when clusters have even more cases. The variance of the cluster sizes is also manipulated in the study. Clusters either have a similar size (σ2 = 9 cases) or have variability in cluster sizes (σ2 = 100 cases). This implies that when clusters are similarly sized, clusters will not differ in size by more than 18 cases, but when they vary in size, they may differ in size by as much as 60 cases. The study results are then able to show whether clustering algorithms’ recovery of true clusters changes as average cluster size goes up and when there is more variability between cluster sizes. Data Type and Response Probabilities in Simulated Clusters The work process data consist solely of binary data. For brevity, a value of 1 is called a response, but it could represent any dichotomous variable of interest in a game, such as an observed student action, a set of actions, the game state, or activation of a game connection. Binary data were used to represent structured-logging of data, meaning that game designers or researchers have tagged log file data with labels or codes that can be represented as binary indicators (Kerr & Chung, 2012). This is a common way to 98 organize GBA data to make it more manageable for analysis, which is why binary data were chosen for the simulation, and it is worth the reminder that the ROCK algorithm can only work with binary data. Nevertheless, it should be noted that GBA data do not need to be coded into binary data. Structurally logged data can also contain continuous measures (e.g., representing elapsed time, sequences of steps), and it is conceivable that some GBA work processes could be analyzed directly from the log file data without any additional tagging. The probability of observing responses within a cluster manipulates the within- cluster variability, which helps determine if certain algorithms are better at clustering results that have some dissimilarity within a cluster. This aspect of the data also relates to autonomy (freedom in gameplay), in that students may be engaging in the same general work process, but the game design may allow for some flexibility or alternatives, even though the overall approach of the work process is the same. Of the studies reviewed here, only Stevens and Casillas (2006) touched on the relevancy of certain variables in defining a pattern. In sample diagrams, they showed bar charts from two of their 36 clusters, illustrating the selection frequency for each variable. For example, one of their clusters showed that the response patterns nearly always included observed responses from five of the variables in the game and almost never had responses on many of the other variables. Similarly, for this simulation clusters were created by having a higher probability of response in a subset of 20 variables in each simulated dataset. To manipulate within-cluster variability, clusters were generated with low, moderate, and high probabilities of an observed response within their 20 relevant variables, as discussed in the Data Generation section. For the other variables not 99 manipulated in the definition of a cluster, the response probability was 0.05, meaning cases in a cluster have a very low (but not zero) probability of an observed response on the remaining non-relevant variables, thus creating a high probability of not observing responses in these variables and ensuring that they were largely dissimilar to the cluster’s 20 relevant variables. This also means that this 0.05 probability of a response on non- relevant variables was shared across many clusters, thus making any given variable extremely redundant and irrelevant for most clusters’ interpretations unless it is assigned to a cluster in which it has a high probability of a response. One limitation of the simulation study is that there is no theory supporting the levels of the probability of a correct response used to create the simulated data. Using values of 0.30, 0.60, and 0.90 provides a wide range of probabilities which helps show how the clustering algorithms perform when there is higher within-cluster variability, but there remains the question of whether a 0.30 response probability actually is relevant to interpreting a cluster in practice. Stevens and Casillas (2006) looked at response frequencies by variable when evaluating the interpretability of their clusters, suggesting—at least for them—that an inconsistent presence of responses within a cluster might be grounds for discarding the cluster solution. In practice, researchers may be more interested in clusters where cases have high similarity or where there are correlations between the responses across multiple variables in a cluster, but these variations have been included to show how the algorithms perform as within-cluster variability changes. Amount of Overlap Between Clusters The previous section described the probability of observed responses in the variables that were considered relevant for differentiating one cluster from the others. 100 Allowing clusters to share these defining variables decreases the between-cluster variability because multiple clusters had high probabilities of responses on a set of overlapping variables, which is a likely scenario in practice. Stevens and Casillas (2006) showed that response probabilities for some variables were very similar in some (but not all) clusters, suggesting that those overlapping variables were important for the definition and interpretation of those subsets of clusters; however, overlap decreases between- cluster variability, which can make it more difficult to identify clusters. Artificially creating overlap between clusters in a structured manner is a difficult task. One must consider whether a cluster shares relevant variables with multiple clusters, and if so, how many variables? Is a variable shared with just one other cluster, or many? Is the degree of overlap the same between all clusters, or is one cluster allowed to be completely unique? Rather than trying to control these many aspects of overlap, it is simpler to let the clusters share relevant variables randomly. Each cluster had 20 variables randomly selected to have a higher response probability, and there is the expectation that each cluster would likely share at least one relevant variable with each of the other clusters. This creates a degree of overlap or similarity between the clusters—the more overlap, the less between-cluster variance there could be. Additionally, overlapping relevant variables were more prominent when there were fewer clusters and thus fewer variables. This is an important condition to simulate because it allows for the fact that many interpretively- distinct response patterns may share some features in log data; i.e., students across clusters likely share some of the same gameplay steps. It also depicts that the potential for that overlap decreases in games where there is more autonomy and thus the potential for 101 more distinct clusters, as well as more variables that can be included in the analysis. Allowing the clusters to randomly share some relevant variables provides a view into whether clustering algorithms work better as between-cluster variance increases. Allowing clusters to overlap in their relevant variables helps illustrate how GBA data may be impacted by GBA design. When a subset of clusters has higher response probabilities in a shared set of variables, this can be representative of game state. For example, a specific initial work process or activation of a game element might be needed before a secondary work process can occur. If multiple clusters share that initial work process response pattern, then they might all be seen as co-dependent on those defining variables. Returning to the soccer example from the Rules section in Chapter 2, two clusters might share the defining variable of having the ball go into the goal, yet differ in the variables representing whether or not that occurred before or after the game clock ended (game state). Similar situations might occur in scenarios where a researcher tracks the order of actions; e.g. two clusters share the same variables representing actions in students’ work processes, but they have differences in the variables representing the sequence in which those actions were conducted. When all clusters have similar response probabilities in a shared set of variables, then this can be viewed as a representation of irrelevant variables, which might occur with the addition of relatedness elements in a GBA. For example, if bonus points are awarded to students who provide responses quicker in a game, even though speeded response is not a feature of the construct of interest, then the frequency of observed bonus points in the response data may not be a relevant variable in the definition of the clusters; i.e., in each cluster, some people got bonus points, and some did not. The bonus points 102 are irrelevant for interpretation of any cluster, which in turn decreases the between- cluster variability across all of the clusters. Number of Response Variables in the Simulation Interpretation of a student’s work process is largely dependent on how his or her response pattern differs from other students’ response patterns, so an interpretable cluster might have a unique probability of an observed response on a set of variables considered “relevant” to identifying that cluster. As mentioned above, each cluster had 20 relevant variables that have a higher probability of an observed response. The choice of 20 relevant variables per cluster is arbitrary and represents a limitation of the study. In practice, the number of variables that might be considered relevant to identifying a work process will likely vary between clusters and be heavily dependent on a GBA’s design and the researcher’s informed judgment. Furthermore, the number of variables can correspond to every action a student can make in the game, but it can be extended to include factors like the game state and sequence. The researcher makes decisions about which of these factors are relevant for clustering student responses. Nevertheless, having 20 variables that all have a higher probability of a response for a cluster can be considered a fairly complex game, and it is expected that the algorithms’ performance in this simulation would generalize (or perhaps even improve) for GBAs with even more relevant variables for each cluster. In the interest of keeping the number of varying simulation conditions to a minimum, the number of relevant variables is fixed at 20 variables for all clusters in the study. Despite this one limitation, 20 relevant variables for each cluster stills result in a large number of total variables because the number of clusters (k) varies, and the total 103 number of variables was set to k × 20. This creates enough total response variables so that there is a reasonable level of overlap in the 20 relevant variables in each cluster. Fewer total variables would increase the overlap, potentially to the point that there would be few distinguishing features of response patterns in different clusters and making many variables irrelevant due to having the same probability of a response across all clusters. Having more response variables would dilute the data with variables with equally low response rates across all clusters—essentially adding irrelevant variables. Table 16 shows the total number of GBA variables that exist in each simulated dataset by the number of clusters. Table 16 Total Number of Variables by Number of Clusters k Total Number of Variables 4 80 variables 10 200 variables 20 400 variables 35 700 variables The resulting range of total variables shown in Table 16 correspond to a reasonable range of the number of variables that might be used in a GBA cluster analysis. This in turn provides some justification for the choice of 20 relevant variables for each cluster. For example, Kerr and Chung (2012) clustered responses using only observed student actions from the Save Patch game and reported that their more complex, difficult game levels had 100-200 variables used in the cluster analysis. Kerr and Chung’s game with 200 variables may seem high, but higher dimensionality may exist for a deceivingly simple game. Stevens and Casillas’ (2006) biology game identified clusters out of 2,862 variables. Mislevy et al. (2014) noted that game mechanics go beyond just the tools and 104 buttons that appear onscreen. Game mechanics encompass every possible action that students can take using those tools, so it would not be uncommon for GBA researchers to be working with hundreds of variables. Given the large number of total variables that result from having 20 relevant variables for each cluster, there is an expectation for improved performance of the ROCK algorithm, as compared to its poor performance in the feasibility study. Recall too that game autonomy (the amount of freedom of actions a student has in the GBA) will increase the number of variables that might be available for analysis, but as autonomy increases, the results data will tend to be very sparse (Kerr & Chung, 2012). Stevens and Casillas’ (2006) game may have had 2,862 variables, but some of the most efficient strategies could solve the game with fewer than ten of these variables. Thus the varying number of clusters and the corresponding changes to the total number of sparse variables in each dataset can be seen as a realistic representation of varying game autonomy. Methods and Outcome Measures Table 17 shows how the 48 cells of the simulation study are organized using the conditions discussed in the preceding sections. Recall that 400 datasets were generated in each cell, and each dataset had roughly k × Nk cases representing simulated student responses. To address the first three research questions and discover how mean Kappa values differ between cells and when algorithms are expected to return Kappa values of 0.80 or higher, the 400 datasets in each of the 48 cells were evaluated with the SOM, k-means, and ROCK algorithms. The algorithms were instructed to identify k clusters in the data, where k is the known, true number of clusters in the data. Recall that researchers typically 105 choose the number of clusters based on the interpretability of clustering solutions that vary by k (e.g., Stevens & Casillas, 2006; Kriegel et al., 2009; Kerr & Chung, 2012), so by specifying the true k in the simulation study, these results demonstrate whether the algorithms perform poorly even when the researcher has the correct number of clusters. Table 17 Simulation Design of Four Varied Parameters Resulting in 48 Cells k = 4 clusters k = 10 clusters Cluster Sizes Nk Cluster Sizes Nk Pr(1) ~N(50,9) ~N(50,100) ~N(100,9) ~N(100,100) Pr(1) ~N(50,9) ~N(50,100) ~N(100,9) ~N(100,100) 0.30 0.30 0.60 0.60 0.90 0.90 k = 20 clusters k = 35 clusters Cluster Sizes Nk Cluster Sizes Nk Pr(1) ~N(50,9) ~N(50,100) ~N(100,9) ~N(100,100) Pr(1) ~N(50,9) ~N(50,100) ~N(100,9) ~N(100,100) 0.30 0.30 0.60 0.60 0.90 0.90 Note: Pr(1) is the probability of an observed response on the 20 relevant variables randomly assigned to each cluster. The k-means algorithm was set to run through 2,000 cycles, thus allowing it to pick a solution that minimizes the within sum-of-squares distances for the clusters. Similarly, each dataset was presented to the SOM algorithm 2,000 times, allowing the SOM algorithm to iteratively update the cluster centers with each presentation of the dataset. The ROCK algorithm was run with a theta value of 0.95, but if it could not 106 converge on k clusters, theta was lowered in 0.05 increments until the ROCK algorithm could produce exactly k clusters. The distance measure was used for all three algorithms was Euclidean distance. This was partly a necessity for comparisons—the SOM algorithm in the kohonen package in R currently only permits Euclidean distance (Wehrens & Buydens, 2007). The R packages for ROCK and k-means allow for other distance measures, but they default to Euclidean distance. When using binary data, Euclidean distance can be a problem for describing differences between cases because there is the potential for ties. For example, (0,1) and (1,0) are equidistant from (0,0) and (1,1). This becomes less of a concern when there are more variables in the data, which is one reason why it is beneficial that many of the cell conditions yield several hundred variables for each simulated dataset. For each cell of the simulation, Cohen’s Kappa was used to evaluate the true cluster recovery rate for all three algorithms, thus showing how well each algorithm was able to correctly group cases from each cluster. These statistics were used for comparisons of correct classification rates between the three algorithms under the varying conditions and parameters of the simulation. Since each cell will have 400 datasets, the distribution of Kappa values can be shown for each algorithm in each cell. Each dataset also has a between-cluster variability measure of overlap calculated using the root mean square deviation (RMSD) of the response probability vectors for each cluster. In other words, each cluster’s vector of response probabilities for all variables were compared to the overall average probability of someone having a response on a variable. This is comparable to measuring the deviation of clusters’ expected centers from the global mean. When the RMSD is smaller (within a cell), it indicates more 107 overlap between the clusters’ relevant variables; i.e., less between-cluster variability. Conversely, when the RMSD is larger within a cell, it indicates higher between-cluster variability. RMSD values cannot be meaningfully compared between cells since their magnitudes depend on the probabilities of responses and the numbers of clusters. RMSD is calculated using the observed cluster centers and the standard RMSD formula: √ ∑ (𝒎𝑐 − ?̅?)2 𝑘 𝑐=1 𝑘 (10) where mc is the center coordinates of cluster c and ?̅? is the global mean of all k clusters’ center coordinates. The first research question seeks to determine if there are differences in mean Kappa values for the three algorithms under the varying conditions of the simulations (when k is correctly specified). The first research question was investigated by plotting mean Kappa values across manipulated conditions in the simulation, and correlations between RMSD values and Kappa within a cell show if higher between-cluster variability makes it easier to recover true cluster assignments. The second research questions was investigated by plotting distributions of Kappa values for each cell to illustrate the conditions under which each algorithm is expected to achieve Kappa values of at least 0.80. These distributional plots also address the third research question by showing when two algorithms fail to consistently yield Kappa values of 0.80 in a cell, meaning that their solutions would likely not validate each other under similar conditions in practice. To address the fourth research question about whether the algorithms’ best fitting solutions match the true number of clusters, the algorithms were run with a variety of k values ranging from kt - 3 to kt + 3, where kt is the true number of clusters. The 108 distribution of DBI was plotted for each number of clusters for all three algorithms across the cells of the simulation study, categorized by whether the best-fitting solution overestimated, underestimates, or accurately identified the correct number of clusters. This illustrates whether the best-fitting solution for each algorithm reflects the true number of clusters used to simulate the 400 datasets in each cell. Kappa and percentage correct were not calculated for the clustering solutions when k was misspecified since the algorithms were not expected to correctly recover cluster memberships when the wrong number of clusters was provided. The goal of comparing the fit at varying specifications of k in the fourth research question is to see if any algorithms routinely produce better-fitting solutions with a misspecified value of k or if DBI consistently tends to overestimate or underestimate the true number of clusters. Note that there are multiple methods for determining the number of clusters beyond DBI, but it is outside the scope of this study to compare them or to see if any are suitable methods for identifying the true number of clusters. Furthermore, some GBA researchers appear to prefer selecting the number of clusters based on interpretability rather than fit (e.g., Kerr et al., 2011; Stevens & Casillas, 2006). That being said, DBI was selected as the fit measure for comparisons in the fourth research question because it has been shown to be one of the best algorithms for identifying the number of clusters in binary data (Dimitriadou, Dolnicar, & Weingessel, 2002). DBI also has the added benefit of allowing for fit comparisons between algorithms’ solutions. Tutorial Example The simulation study described previously serves to examine how the three algorithms perform under a variety for a variety of data scenarios, but the tutorial study is 109 intended to serve primarily as an example and tutorial for how GBA researchers should go about identifying and comparing clustering solutions as part of the validation of clustering algorithms as measurement models in the CFA. This tutorial includes examples of how to compare alignment between solutions from competing algorithms, select the potential number of clusters, compare fit, and evaluate differences in solutions in the context of the data and the algorithms themselves. For the tutorial study, the k-means, SOM, and ROCK algorithms were used to classify student response patterns from the different levels in a mathematics game called Beanstalk (Carnegie Mellon, 2013). The tutorial study uses anonymized students’ log file data from the 'Beanstalk_2013_05' dataset accessed via DataShop (Koedinger et al., 2010). The tutorial discusses the considerations in selecting algorithms and coding variables prior to analysis. For each Beanstalk game level, clustering solutions were calculated for varying numbers of k clusters, and DBI values were calculated for each solution. The retained clustering solution for each game level was selected based on both fit (as measured with DBI) and agreement between algorithms (as measured with Cohen’s Kappa). The solutions were validated against a holdout sample and an external criterion—in this case, indicators of whether or not the student solved the level correctly. Finally, the tutorial discusses the process of labeling and grouping clusters for interpretation, as well as using changes in cluster membership to visualize student progress across the game. To summarize, the full study consists of two parts: the simulation and the tutorial example. The simulation generated datasets with a known true number of clusters under varying conditions of the number of clusters, the size of the clusters, the probability of 110 observing responses in relevant variables in clusters, and the amount of overlap between clusters’ relevant variables. These datasets were analyzed with the k-means, SOM, and ROCK algorithms to see how often they are able to adequately recover the true clusters, as determined with Kappa measures of agreement between cluster solutions and true cluster identifiers. The simulation study then used the same conditions but then used DBI to evaluate the number of clusters associated with the best-fitting solution (e.g., if the true number of clusters is 10, is an algorithm’s best-fitting clustering solution also associated with k = 10, or does it over- or under-estimate the number of clusters needed?). The second part of the full study is the tutorial, which illustrates how one might implement a clustering algorithm in a GBA, walking through the steps and practical considerations from data identification and coding through cluster interpretation and reporting. 111 Chapter 5: Simulation Study Results The simulation was run in R (R Core Team, 2014), and the final study scripts are provided in Appendix A. The simulation ran for approximately three months, and during that time, some adjustments were made to the code based on performance observations. It is important to note these changes here, as they are mentioned in the discussion of the results; however, none of these changes had an impact on the methods and outcome measures proposed in Chapter 4. Early in the analysis, it was observed that the SOM algorithm occasionally did not return the correct number of clusters (k). For example, in the initial cells, the study was set to analyze k = 4 clusters, but the SOM algorithm occasionally returned only three clusters. This was occurring in approximately 1% of the results when the probability of a response in the relevant variables was 0.6 or 0.9. This phenomenon did not appear to be related to overlap between clusters’ relevant variables, as measured by RMSD. Further investigation suggested that this was occurring as a result of poor, randomly selected, starting centers for the SOM. Recall that (unless directed otherwise) the SOM algorithm selects its starting centers randomly from the observed data. If two of those centers are extremely similar, those clusters will be competing for the same cases, switching case assignments between the two clusters back and forth and eventually having nearly identical center coordinates. At the end of the data presentation iterations, the cases are all assigned to whichever center (node) was active when the algorithm stopped. Similar issues can occur in k-means, where poor starting centers can lead to bizarre clustering solutions, but recall that the k-means algorithm in R runs itself multiple 112 times and selects the outcome that minimizes the within sum-of-squares (R Core Team, 2014). The SOM algorithm, by comparison, presents the data to the model over and over to keep fine tuning the centers as the neighborhood radius decreases, but the model never goes back to try the process with new starting centers. Thus the SOM algorithm still has a risk of returning a poor cluster partition, and savvy researchers should run try the SOM algorithm multiple times with different starting points to compare solutions (e.g., Amershi & Conati, 2009). Despite this observation, the simulation was allowed to run as proposed, with the modification that if the SOM algorithm returned fewer clusters than the k clusters specified by the simulation cell, it was rerun with new starting centers. Recall that the researcher specifies the number of clusters required for analysis, and this was a necessary change to allow for proper calculation of Kappa when calculating the true cluster recovery. It also seemed like a reasonable correction to make, since in practice, a researcher who requested a k cluster solution would likely not accept a solution that produced fewer clusters. Another observed issue was that in some datasets, the ROCK algorithm could not converge on the specified number of clusters. This only occurred when k = 35, Nk = 50, and probability of a response on the relevant variables was 0.3. In other words, the ROCK algorithm had difficulty converging on the desired number of clusters when there were a lot of small clusters with low response rates in the relevant variables. When k = 35, Nk = 50, probability of a response on the relevant variables was 0.3, and cluster size variance was nine, 397 datasets were created, and the ROCK algorithm converged in 24 (6%) of them. Similarly, under the same data conditions but with cluster size variance set 113 to 100, 391 datasets were generated, and the ROCK algorithm converged in only 34 (9%) of them. The number of datasets mentioned in the preceding paragraph hints at the final issue, which was time. Despite running for three months, not all cells could be run with the desired number of datasets by the end date of the project. Datasets with more cases and more variables took more time to both simulate and analyze, and several cells did not have the 400 datasets desired. Nevertheless, the results are still interpretable and the trends are evident with the datasets that were completed. The study cells had the following final counts for simulated datasets as of the study’s completion (Table 18): Table 18 Actual Number of Datasets Generated per Cell Pr(1) k Size 0.3 0.6 0.9 4 ~N(50,9) 400 400 400 ~N(50,100) 400 400 400 ~N(100,9) 400 400 400 ~N(100,100) 400 400 400 10 ~N(50,9) 400 400 400 ~N(50,100) 400 400 400 ~N(100,9) 400 400 400 ~N(100,100) 400 400 400 20 ~N(50,9) 400 400 400 ~N(50,100) 400 400 400 ~N(100,9) 400 400 400 ~N(100,100) 400 400 400 35 ~N(50,9) 397 400 400 ~N(50,100) 391 400 230 ~N(100,9) 140 400 120 ~N(100,100) 140 400 130 114 Research Question 1: Kappa Differences by Varying Individual Conditions The first research question investigates the differences in mean Kappa values between the true clusters and the k-means, SOM, and ROCK algorithms when they are given the correct value of k under each of the individual conditions manipulated by the simulation study. It is critical to underscore that the first research question looks at individual aspects of the data manipulation in the simulation study, aggregating mean Kappa values that were calculated across many conditions. This is effectively looking at marginal effects, but this research question does not consider interaction effects. For example, the first research question investigates if mean Kappa values change as the number of clusters increases, without consideration for the size of the clusters or the probability of a response in the relevant variables. These combinations of conditions are explored in the second research question below, so the results of the first research question are intended to be exploratory, identifying the trends suggested by aggregated marginal effects before disaggregating and diving into the individual cell results in the second research question, which will illustrate the variability of the results. As an additional note, recall that the ROCK algorithm could not converge with many datasets where k = 35, Nk = 50, and probability of a response on the relevant variables was 0.3. In the cases where it did converge, Kappa values were near zero. The lack of the ROCK results in these study cells means that the average ROCK Kappa values are likely inflated in some of the following figures. Had the ROCK algorithm converged in more datasets in these cells, one likely would have observed more low Kappa values, which would lower the mean Kappa scores. Where this has occurred, the plot point is marked with an asterisk (*). 115 To explore the marginal effects, one can first examine the suggested impact on mean Kappa values when the number of clusters changes, aggregated over the other conditions in the study. Figure 8 shows the mean Kappa values for the algorithms’ true cluster recovery over the varying number of clusters in the study (4, 10, 20, and 35 clusters). Figure 8 suggests that the k-means and SOM algorithms can be expected to perform worse at recovering the true clusters when the number of clusters increases, with SOM performance declining faster than the k-means algorithm. Conversely, the average Kappa value for the ROCK algorithm’s recovery of true clusters suggests that ROCK’s true cluster recovery improves when the number of clusters increases, although recall that the ROCK algorithm often could not converge when there were 35 clusters with response probabilities of 0.3. When considering number of clusters alone, Figure 8 suggests that k- means may be the best algorithm (of the three compared here) for analyzing GBA data. When the game is low autonomy and there are minimal possible gameplay strategies (i.e., fewer clusters), both k-means and SOM are expected to perform well. When working with higher autonomy GBAs where many types of work processes may be observed, ROCK and k-means may be preferable. 116 Figure 8. Mean Kappa rates by algorithm for different numbers of clusters k. The ROCK plot marked with an asterisk (*) may be inflated due to the lack of convergence at 35 clusters (?̅?𝑘 = 50 cases) and probability of response = 0.3. Standard error ranges: k-means – 0.001-0.007, SOM – 0.002-0.009, ROCK – 0.006- 0.009. To further explore the marginal effects, one can next examine the suggested impact on mean Kappa values when the average size of the clusters changes, aggregated over the other conditions in the study. Recall that the study manipulated the average size of the clusters, specifying two levels: small cluster sizes (mean Nk = 50) and large cluster sizes (mean Nk = 100). Figure 9 suggests that both k-means and SOM are expected to perform well under both scenarios (discounting other data conditions), but that the mean Kappa values for recovering the true clusters often increased with larger sample sizes. The ROCK algorithm did not appear to perform as well as the other two algorithms, and ROCK’s mean Kappa value when Nk = 50 is likely inflated due to the lack of convergence in the datasets where k = 35 and probability of a response was 0.3. Regardless, k-means generally performed the best on average with both cluster size options, when discounting other conditions of the data. When considering the sheer * 0.00 0.20 0.40 0.60 0.80 1.00 0 10 20 30 40 M ea n K ap p a Number of Clusters (k) k-means SOM ROCK* 117 volume of GBA work process data alone, Figure 9 suggests that k-means and SOM may be preferable algorithms for working with large datasets in some scenarios. Figure 9. Mean Kappa rates by algorithm for two different average cluster sizes Nk. The ROCK plot marked with an asterisk (*) may be inflated due to the lack of convergence at 35 clusters (?̅?𝑘 = 50 cases) and probability of response = 0.3. Standard error ranges: k-means – 0.002-0.003, SOM – 0.002-0.003, ROCK – 0.004 for both. In addition to the general trends associated with the size of clusters, one can also look at marginal effects associated with the variance of the cluster sizes; i.e., if the average cluster had 50 cases in it, clusters could either be similar in size or vary in size. This condition was manipulated by letting the variance of the cluster sizes Nk be either 9 or 100 (e.g., if the mean cluster size was 50 cases, individual cluster sizes would be expected to range from 41-59 cases or from 20-80 cases). This was an important aspect of the simulation because equally sized clusters in GBA work process data are unlikely in practice and because Robinson (2017) claimed that the k-means algorithm will perform poorly when clusters vary in size, with k-means splitting up larger clusters and ignoring smaller clusters in an effort to minimize the sum-of-squared error in its solution. Figure 10, however, suggests that varying cluster sizes did not meaningfully impact the mean * 0.00 0.20 0.40 0.60 0.80 1.00 50 100 M ea n K ap p a Average Cluster Size (Nk) k-means SOM ROCK* 118 Kappa values in any of three algorithms’ recovery of the true clusters (ignoring, for the moment, other aspects of the data). In fact, k-means performed better than SOM and ROCK again. This was unexpected, given Robinson’s (2017) claims that k-means requires similar cluster sizes and the fact that k-means is not intended to be used with binary data. Robinson’s (2017) claim may hold true when one cluster is far smaller than another, as he demonstrated with an example, but this may cross into the realm of k- means simply doing a poor job with outliers (Berkhin, 2006). Under the aggregated conditions of this simulation study, k-means (as well as SOM and ROCK) were not notably impacted by increasing the cluster sample size variance from 9 to 100. Figure 10. Mean Kappa rates by algorithm for two different variances in the data’s cluster sizes Nk. The ROCK plots marked with an asterisk (*) may be inflated due to the lack of convergence at 35 clusters (?̅?𝑘 = 50 cases) and probability of response = 0.3. Standard errors are: k-means – 0.00 for both, SOM – 0.003 for both, ROCK – 0.004 for both. One can also explore the marginal effects on true cluster recovery over the different probabilities of response in the 20 relevant variables that defined each cluster, aggregating again across the other conditions in the study. Recall that all variables in each dataset had a default observation rate of 0.05, but then 20 variables were assigned to * * 0.00 0.20 0.40 0.60 0.80 1.00 9 100 M ea n K ap p a Average Cluster Size Variance k-means SOM ROCK* 119 each cluster that had probabilities of 0.3, 0.6, or 0.9 (selected randomly with replacement between clusters), depending on the specification of the simulation cell. Higher probabilities indicate that the cases in the cluster were all highly similar, as might occur in a very structured, low-autonomy GBA. Low probabilities represent clusters with more diversity in observed work processes, but where students in that cluster are all drawing from a common set of variables. When probabilities of observed responses are higher for clusters’ relevant variables, the response patterns within a cluster correlate well with each other, generally making it easier to identify a cluster with an axis parallel soft-projected clustering algorithm. When the probability of a response on a cluster’s relevant variables is lower, and the probability of a response is also very low on the remaining irrelevant variables in the dataset, then there is more similarity in all response patterns across clusters, and correlations with cases between clusters will be similar to cases within a cluster, thus making it more difficult to correctly identify clusters. Following these expected trends, Figure 11 suggests that, when aggregated across conditions, mean Kappa values of true cluster recovery may be expected to increase as the probability of observing a response increases, with all three algorithms generally recovering the true clusters nearly flawlessly when the probability is 0.9. The k-means algorithm performs nearly perfectly under many scenarios when probability is 0.6, and SOM performance is also often good under this condition. All three algorithms struggle on average when probability of response is 0.3, although the mean Kappa values for k- means and SOM may be good enough to justify the use of these algorithms under some conditions when evaluating a low stakes GBA. 120 Figure 11. Mean Kappa rates by algorithm for three different probabilities of response in the relevant variables defining clusters. The ROCK plot marked with an asterisk (*) may be inflated due to the lack of convergence at 35 clusters (?̅?𝑘 = 50 cases) and probability of response = 0.3. Standard error ranges: k-means – 0.001- 0.004, SOM – 0.002-0.004, ROCK – 0.002-0.004. The final condition of interest for the first research question was whether the amount of overlap between relative variables influenced the algorithms’ abilities to recover the true clusters; i.e., would the algorithms perform better if the clusters were well-separated? The degree of overlap (or similarity) was measured with the RMSD between clusters, and it was not directly manipulated in the simulation code. Instead, the 20 relevant variables defining each cluster were randomly selected (with replacement), meaning that in some datasets, clusters might share multiple defining, relevant variables, whereas in other datasets, that overlap might be smaller. The RMSD values were contingent on the number of total variables and probabilities of observed responses in clusters’ relevant variables in each cell , so it is most illuminating to consider each cell individually. Since RMSD is a continuous measure (rather than the discrete levels set in the previous conditions), one can look at the Pearson correlations between RMSD and the algorithms’ Kappa values for recovering the * 0.00 0.20 0.40 0.60 0.80 1.00 0.3 0.6 0.9 M ea n K ap p a Probability of Correct Response on Relevant Variables k-means SOM ROCK* 121 true clusters in each cell. Table 19 shows these correlations. Overall, Table 19 shows that increasing the between-cluster variance had little effect on improving algorithm performance, with most cells showing only low positive or negligible correlations between RMSD and algorithms’ Kappa values. The notable exceptions occurred for the k-means and SOM algorithms when there were only four clusters and the probability of a correct response was 0.3. In these cells, there were low to moderate positive correlations between RMSD and the k-means and SOM Kappa values, with higher correlations occurring when the clusters were larger. Note that lower correlations are associated with a lack of variation in Kappa values in the solutions, and correlations are omitted in cells where the correlation was not significant or the algorithm consistently returned Kappa values of 1.00 for all datasets. Table 19 Correlations between RMSD and Kappa by Study Cell and Clustering Algorithm k-means SOM ROCK Pr(1) Pr(1) Pr(1) k Size 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 4 ~N(50,9) 0.39 0.04 ___ 0.33 -0.05 -0.04 0.17 0.26 ___ ~N(50,100) 0.38 0.08 ___ 0.29 0.00 0.00 0.19 0.18 0.06 ~N(100,9) 0.51 0.16 ___ 0.49 0.02 0.07 0.17 0.17 ___ ~N(100,100) 0.48 0.18 ___ 0.48 -0.03 -0.04 0.18 0.16 0.06 10 ~N(50,9) 0.17 0.04 ___ 0.09 0.01 0.07 0.13 0.13 ___ ~N(50,100) 0.12 0.07 ___ 0.04 -0.01 -0.05 0.12 0.20 ___ ~N(100,9) 0.28 0.04 ___ 0.03 0.09 -0.01 0.18 0.17 ___ ~N(100,100) 0.28 0.01 ___ 0.07 0.12 -0.04 0.22 0.09 ___ 20 ~N(50,9) 0.04 0.05 ___ 0.09 0.09 0.09 0.11 -0.01 0.03 ~N(50,100) 0.13 0.00 ___ 0.07 -0.08 0.08 0.11 0.03 ___ ~N(100,9) 0.18 0.03 ___ 0.03 -0.04 0.01 0.18 0.00 ___ ~N(100,100) 0.12 0.04 ___ 0.11 -0.01 -0.05 0.16 0.06 0.05 35 ~N(50,9) ___ 0.00 0.04 ___ 0.07 0.05 ___ 0.06 0.05 ~N(50,100) ___ 0.01 ___ ___ -0.04 ___ ___ 0.11 ___ ~N(100,9) ___ -0.06 ___ ___ -0.08 ___ ___ 0.08 ___ ~N(100,100) ___ 0.04 ___ ___ -0.12 ___ ___ 0.08 ___ 122 Research Question 2: Conditions Under Which Kappa is Expected to Exceed 0.80 The second research question investigated the combinations of the study conditions that are expected to yield true cluster recovery Kappa values of at least 0.80. Recall that Landis and Koch (1977) suggested that Kappa values of 0.80 or higher represent near perfect agreement, or in this case, near perfect recovery of the true clusters. The 0.80 cutoff is admittedly arbitrary, but it serves as a useful reference point for comparing the performance of the algorithms across the different cells. With this in mind, a high level interpretation can be made by simply looking at the percentage of solutions in each cell that have Kappa values of 0.80 or higher. Table 20 shows the percentage of each algorithm’s solutions that exceeded Landis and Koch’s (1977) 0.80 Kappa threshold for near perfect true cluster recovery. Table 20 shows that the k-means and SOM algorithms recover the true clusters reasonably well in most cases where the probability of a response in the relevant variables is 0.6 or 0.9, although the k-means algorithm performs better on average. When the probability of a response is 0.3, the k-means and SOM algorithms perform worse when there are a large number of small clusters. The SOM algorithm begins to perform poorly when the response level is 0.3 and there are 20 clusters, even when those clusters are large. Conversely, the ROCK algorithm has a near perfect recovery rate when the probability of response is 0.9 and a total failure rate when response probability is 0.3. When the response probability is 0.6, the ROCK algorithm recovers the true clusters much better when there are 35 clusters. 123 Table 20 Percentage of Solutions with Kappa ≥ 0.80 by Study Cell and Clustering Algorithm k-means SOM ROCK Pr(1) Pr(1) Pr(1) k Size 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 4 ~N(50,9) 93% 100% 100% 90% 93% 96% 0% 35% 100% ~N(50,100) 81% 100% 100% 79% 91% 88% 0% 37% 100% ~N(100,9) 100% 100% 100% 100% 96% 97% 0% 23% 100% ~N(100,100) 100% 100% 100% 100% 92% 94% 0% 24% 100% 10 ~N(50,9) 95% 100% 100% 61% 85% 94% 0% 69% 100% ~N(50,100) 82% 100% 100% 41% 77% 88% 0% 72% 100% ~N(100,9) 100% 100% 100% 91% 89% 96% 0% 16% 100% ~N(100,100) 100% 100% 100% 90% 88% 95% 0% 17% 100% 20 ~N(50,9) 27% 100% 100% 0% 86% 89% 0% 63% 100% ~N(50,100) 12% 100% 100% 0% 72% 85% 0% 64% 100% ~N(100,9) 100% 100% 100% 47% 90% 88% 0% 82% 100% ~N(100,100) 99% 100% 100% 36% 86% 88% 0% 82% 100% 35 ~N(50,9) 0% 98% 81% 0% 89% 75% 0% 99% 100% ~N(50,100) 0% 98% 90% 0% 78% 75% 0% 99% 100% ~N(100,9) ___ 100% ___ ___ 100% ___ ___ 100% ___ ~N(100,100) ___ 90% ___ ___ 80% ___ ___ 100% ___ Visually, one can see these results by looking at boxplots of the Kappa values for each cell of the study. Figure 12 shows these distributions by cell, plotting boxplots with the same number of clusters and response probability next to each other, ordered by cluster size and cluster size variance. Figure 12 helps illustrate some of the effects for the interactions of the simulated data. For example, in the k-means and SOM algorithms, higher variability in cluster sizes does appear to lower Kappa values on true cluster 124 Nk Boxplot Legend (Left to Right): ■~N(50,9) ■~N(50,100) ■~N(100,9) ■~N(100,100) k k-means SOM ROCK 4 10 20 35 Figure 12. Kappa distributions by cluster size, for each combination of response probability and k clusters. 125 recovery when the cluster sizes are small (compare the two leftmost k-means and SOM boxplots when the cluster sizes are 50, comparing cluster size variance at 9 versus 100). One also observes a scattering of low Kappa outliers in the SOM solution, which, as discussed at the beginning of this chapter, appear to be a result of the random starting centers selected for these solutions, and not a direct effect from the cell parameters themselves. Figure 12 also makes it easy to see which algorithms recover true clusters nearly perfectly. For example, notice how the k-means Kappa values are consistently close to 1.00 when response probabilities are 0.6 and 0.9, but that this performance begins to deteriorate when the number of clusters is increased to k = 35. Notice too how well the ROCK algorithm performs when the response probability is 0.9 except for a few outliers. The ROCK boxplots when response probability is 0.6 in Figure 12 are also of interest, because one sees that the ROCK algorithm performs worse at recovering large clusters when k = 4 and k = 10. Based on these Kappa values, one might almost consider using the ROCK algorithm when response probability was near 0.6 and there were 10 clusters anticipated, but not if a large number of students were going to play the GBA. However, the ROCK algorithm begins to perform very well when response probability is 0.6 and k = 20 or k = 35, suggesting ROCK might be a great algorithm to use when there are high autonomy games and lots of observable strategies, with moderate or high consistency in the work process response patterns within clusters. Research Question 3: Conditions In Which Algorithms Can Validate Each Other By inspecting the boxplots in Figure 12, one can see that all three algorithms have high Kappa values when the probability of response in the relevant variables is 0.9. This 126 appears to be true regardless of the number of clusters, the size of the clusters, or the variation in cluster size in this study. When the response probability is 0.9, there is very little within-cluster variance, and all response patterns in a cluster are highly similar on a defining set of variables. This does not necessarily describe a low autonomy GBA, considering that one could have 35 clusters with response probabilities of 0.9, meaning that students are engaging in 35 different work processes; however, it does suggest that a GBA would be highly structured to minimize variation in work process patterns. These probabilities would not occur in a high autonomy game that was mixed with lots of chaos elements or relatedness elements to give the game a highly realistic, open-world feel. Similarly, all three algorithms recover true clusters reasonably well (in terms of observed Kappa values) when the response probability is 0.6 and there are 20 clusters. In this scenario, all three algorithms again could be used to validate each other. This is encouraging news from a GBA standpoint, as these parameters suggest that a moderate amount of autonomy can be designed into the GBA and still support the use of these clustering algorithms. A GBA with 20 clusters and a 0.6 probability of a response on each cluster’s relevant variables is one in which there is diversity in work processes (20 clusters) and some degree of diversity within those work processes—a response probability of 0.6 increases within-cluster variance compared to a response probability of 0.9. Only the k-means and SOM algorithms have acceptable expected Kappa values when the response probability is 0.3, and even then, that only occurs when there is a small number of clusters: k = 10 or 4 clusters. The k-means algorithm produces reasonable Kappa values when k = 20, but the SOM algorithm does not, so they could not 127 validate each other. When there are only a small number of clusters (k = 4 or k = 10), the SOM and k-means algorithms both perform well at each of the three probabilities of observed response. This suggests that these two algorithms may be used to validate each other under these data conditions, and that these two algorithms may be preferable to the ROCK algorithm when analyzing binary GBA data in lower autonomy games. Perhaps the most important observation overall is that there are very few conditions where Kappa values greater than 0.80 occurred in all datasets. The SOM algorithm tended to have outlier solutions with low Kappa values in almost every cell. As noted previously, these are likely caused by poor starting centers, and these failures could be obviated by running multiple SOM analyses with different starting centers, as the k- means algorithm does. In terms of informing practice, these results could be seen as evidence that researchers should analyze their GBA data with more than two clustering algorithms, since the algorithms have a susceptibility to converge on a poor solution in most of the simulation cells. Using three algorithms when analyzing GBA work process data gives the researcher a better chance of seeing at least two algorithms create similar solutions for validating the use of the cluster analysis, as is shown in the tutorial in the next chapter. Research Question 4: Finding the Number of Clusters with DBI The fourth research question was investigated by simulating datasets in each cell as before, but then asking the algorithm to try different values of k for each dataset. Because the data were simulated, the true number of clusters was known, so the algorithms were instructed to run through a range of k values from three below the true k to three above (two below when the true number of clusters was k = 4, since the 128 algorithms cannot create a single cluster solution where k = 1). This means that each algorithm created seven cluster solutions for each simulated dataset, each with a different number of clusters. The DBI values were calculated for each of these solutions to see which solution had the best fit (lowest DBI). This allowed for an investigation into whether the algorithms’ best-fitting solution tended to overestimate, underestimate, or match exactly the true number of clusters. Of course, over 400 datasets in each cell, some solutions might overestimate the number of clusters and some solutions might underestimate the number of clusters. It is helpful to consider whether an algorithm tends to overestimate or underestimate the number of clusters, using DBI as the decision criterion. Figure 13 represents these proclivities visually. The grey segments of the chart represent the percentage of times the lowest DBI corresponded to the correct number of clusters. The black and white segments represent the percentage of datasets where the lowest DBI value over- or underrepresented the true number of clusters, respectively. Figure 13 shows that the algorithms tend to find better fitting solutions when they are instructed to pick more than the true number of clusters (with a few exceptions in the SOM algorithm where k = 4 and probability of an observed response is 0.3). This can be seen from the number of charts where the black segment (DBI overestimating the number of clusters) is the largest segment. Figure 13 shows two algorithm-level trends: first, that the ROCK algorithm’s best-fitting solution never matches the true number of clusters, and second, that the k-means algorithm is the most likely to identify the correct number of clusters in each cell of the study when using DBI as the selection criterion (although it still tends to overestimate the number of clusters). 129 k-means SOM ROCK Pr(response=1) Pr(response=1) Pr(response=1) k Nk 0.3 0.6 0.9 0.3 0.6 0.9 0.3 0.6 0.9 4 ~ N (5 0 ,9 ) ~ N (5 0 ,1 0 0 ) ~ N (1 0 0 ,9 ) ~ N (1 0 0 ,1 0 0 ) 1 0 ~ N (5 0 ,9 ) ~ N (5 0 ,1 0 0 ) ~ N (1 0 0 ,9 ) ~ N (1 0 0 ,1 0 0 ) 2 0 ~ N (5 0 ,9 ) ~ N (1 0 0 ,9 ) Figure 13. Pie charts depicting the percentage of datasets was correctly identified (■ grey segments), overestimated (■ black segments), and underestimated (□ white segments), as determined by the clustering solution with the lowest DBI. Due to the computational load of these comparisons, results are provided for the first 30 cells of the study. 130 The tendency to overestimate the number of clusters may be attributable to the use of DBI as the decision criterion, and not a feature of the algorithms themselves. Recall that DBI is lowest when there is higher between-cluster variance and lower within-cluster variance, but this may be achieved by creating more small clusters, tailor- fitted to a few similar cases. Recall in the feasibility study in Chapter 3 that the ROCK algorithm produced superior DBI values compared to the SOM and k-means algorithms by lumping most cases into a single large cluster and creating a few small clusters for a the remaining outlying cases. The low within-cluster variance in these smaller clusters in those ROCK solutions helped yield a much lower DBI, even though the solution itself would not be a very useful partition of the cases. Baker et al. (2017b) observed that adding more clusters will often lead to better coverage of the dataspace and lower within-cluster variance, and they recommend using a method like AIC or BIC to penalize models with more clusters. Recall in the discussion from Chapter 3 that Pelleg and Moore (2000) developed an algorithm for using these information criterion measures to select the number of clusters for k-means, and Wang et al. (2005) created a similar method for SOM; however, Ishioka (2005) showed that Pelleg and Moore’s (2000) algorithm still could overestimate the true number of clusters. Recall too that Dimitriadou et al. (2002) claimed that DBI was the best measure for identifying the number of clusters in binary data, such as those created in these simulated GBA datasets, but their study showed very similar results, with DBI correctly identifying the number of clusters in only 35% of their binary datasets (analyzed using k-means and a hard competitive learning clustering algorithm). Dimitriadou et al. also observed a tendency to overestimate the true number of clusters when using DBI, and it is worth 131 noting that they did not evaluate the performance of AIC or BIC for selecting the number of clusters. Figure 14 illustrates the challenges of using DBI for deciding the number of clusters. Figure 14 shows DBI values for the k-means, SOM, and ROCK algorithms across varying numbers of clusters (k) for a dataset that was simulated with 20 clusters, average cluster sizes of 50 cases (with cluster size variance set to 9), and a 0.6 probability of an observed response in the clusters’ relevant variables. The DBI plots do not monotonically decrease like factor analysis scree plots, and there are clearly instances of local minima. Davies and Bouldin (1979) recommended using the global minimum DBI as the indicator of the best-fitting solution, but the plots in Figure 14 show that for each algorithm, many DBI plots are similar to the lowest value, and there is no DBI distribution that can be used to determine if these are statistically different. What Figure 14 does show is that for this simulated dataset, the use of the lowest DBI value for determining the number of clusters led to an overestimation of the number of clusters in all three algorithms. 132 Figure 14. Plot of DBI values for three clustering algorithms when varying the number of requested clusters from k = 17-23 clusters. The dataset used for this plot was simulated to have 20 clusters. The DBI plots have local minima, and the global minimum in this dataset overestimates the number of true clusters that were used to generate the data. Simulation Study Results Summary This study reviewed the performance of the k-means, SOM, and ROCK algorithms under a variety of data scenarios representing common aspects of GBA work processes, including the presence of construct-irrelevant variables, differences in the probabilities of responses, overlapping processes between clusters, varying cluster sizes, and differences in autonomy (i.e., different numbers of clusters). Under the simulated conditions, the k-means algorithm tended to be the safest bet, outperforming the SOM and ROCK algorithms’ recovery of the true clusters in many cells. The SOM algorithm tended to perform nearly as well as the k-means algorithm, but it frequently had outlying simulated datasets where it converged on a solution that did not recover the true clusters well. This seems to be primarily attributable to the poor selection of starting centers, and 133 allowing the SOM algorithm to run through many sets of starting centers and then to choose its best-fitting solution (as the k-means algorithm does) would likely obviate these anomalies. The ROCK algorithm performed poorly unless the response probabilities were set to 0.9 for the relevant variables in each cluster, suggesting that the ROCK algorithm is best suited for data in which clusters are clearly distinguished from each other and where each clusters cases are almost guaranteed to have observations in a shared set of relevant variables. The ROCK algorithm’s recovery of true clusters generally improved as the number of clusters increased, although it never performed well (or in some cases, converged at all) when the response probability was 0.3. What this shows is that all three of these algorithms may yield misleading results under certain conditions, thus underscoring the need for GBA researchers to justify their selection of clustering algorithms and cross-validate across multiple algorithms. One cannot assume that one preferred clustering algorithm can work as a measurement model in all GBA designs or work process datasets. The practical application of the use of multiple algorithms is demonstrated in the following chapter (Chapter 6), and the implications for researchers are explored further in the Discussion chapter (Chapter 7). 134 Chapter 6: Tutorial Example Study This tutorial is intended to provide examples and guidance on the steps and decisions one should consider when using clustering algorithms to evaluate students in a GBA. The steps here are not necessarily exhaustive, nor can one expect them to generalize to every possible game design. Moreover, the concepts discussed here are not original. These steps are, however, an amalgamation of best-practice recommendations or observed applications pooled from multiple sources, and by combining them here, this tutorial provides value by exemplifying these ideas in one place and explicitly stating how they relate to GBA research using clustering algorithms. The underlying purpose of these steps and decisions is for the researcher to be able to document and defend assessment inferences made on the basis of a clustering algorithm, essentially building supporting evidence and obviating rebuttals for a validity argument about the interpretation of students’ cluster assignments. With this goal in mind, the tutorial will walk through the following eight steps: 1. Coding and selecting variables from the GBA log file. 2. Selecting clustering algorithms. 3. Creating a holdout sample for cross-validation. 4. Running the algorithms. 5. Validating across algorithms to pick the best clustering solution. 6. Validating against the holdout sample and an external criterion. 7. Interpreting, labeling, or grouping clusters for reporting. 8. Investigating changes in students’ clusters over time or the course of the game. 135 Beanstalk Game Beanstalk is a mathematics game designed to teach fulcrum principles, specifically the sum of cross products rule that determines whether a scale with a fulcrum (e.g., a see-saw) will balance, given the weights on the scale and their distance to the fulcrum. Carnegie Mellon (2013) explained that the game is intended to teach principles of balance, scientific inquiry, and social-emotional skills, which they aligned to K-3 objectives in the National Research Council’s Framework for K-12 Science Education and Pennsylvania’s Academic Standards. Although the game is aligned to K-3 objectives, the designers reported that the game is intended for children between ages 5 to 11 years old. The designers also noted that the levels in the game are intended to increase in difficulty as the mental models become more complex. The designers explained that there are four progressively complex mental models in the game, and that these are based on the research of Siegler (1976) and Siegler and Chen (2002). Carnegie Mellon (2013) reports that these mental models are:  Paying attention to weight, but not the distances on the scale.  Considering distance when weight is equal on both sides of the scale.  Considering both weight and distance while being provided with cues for congruity.  Using the sum of cross products rule. The game storyline centers around a monster who lives on another planet and drops his teddy bear while he sleeps. The teddy bear lands in the room of the avatar controlled by the student, and a magic beanstalk begins to grow. Teetering atop the beanstalk is a platform, which the student must balance to make the beanstalk grow 136 further. The student is aided by a talking chicken and a talking crow that provide hints, interspersed with cheerful banter. The scale is tipped to one side with the weight of a collection of ladybugs. The student must grow flowers or use eggs from the talking chicken to balance the scale and make the beanstalk grow. The goal is to grow the beanstalk up into space so that the teddy bear can be returned to the extraterrestrial monster. The monster explains the solution to the scale if the student uses up the hints. The student’s actions in the game primarily consist of picking where to place flowers or eggs and clicking on the birds to get a hint. The student can also undo a choice to rearrange the positions of flowers and eggs on the scale. Figure 15 shows a screenshot from the game (Carnegie Mellon, 2013). Figure 15. Screenshot from Beanstalk. The student must grow flowers on the left side of this scale to balance the weight of the ladybugs on the right side. 137 Beanstalk employs many of the game design elements defined by Mislevy et al. (2014). Rules, connections, game state, and mechanics all define what the student can do and what the outcomes will be. The game has low autonomy—there is relatively little variability in possible student actions, and there are no chaos elements—but that may be appropriate for the target age range. Competence elements, however, are featured prominently. The bulk of the game design is around creating a hierarchy of challenges and teaching the student to achieve those goals through experimentation and observation. All of this is saturated with relatedness elements (arguably the distinguishing factor between games and other instrument) such as music, background animations, a fantastical plot, avatars, quirky characters, and impossible scenarios (Carnegie Mellon, 2013). Although Beanstalk is clearly a game, it is not an assessment, so in this sense, it is not strictly a GBA. The game tracks students anonymously, and there is no reported output or evaluation of the student at the end of the game. This, however, is the hypothetical scenario that is being explored by this tutorial: were one to begin using Beanstalk as an assessment, how could one create a measurement model with a clustering algorithm to evaluate students? Beanstalk Student Data The log file data in the 'Beanstalk_2013_05' dataset was accessed via DataShop (Koedinger et al., 2010). This dataset consists of 42,802 recorded student actions collected from 177 students in 371 gameplay sessions at two Pennsylvania elementary charter schools on May 8-9 and May 29-30, 2013. Beanstalk does not collect any identifying information, and anonymous IDs are used to represent students in the dataset (Carnegie Mellon, 2013). As such, there is no information about the ages or 138 demographics of the students in this dataset. The dataset is a log file format, with each row representing a recorded student action, tracking key data like the anonymous student ID, timestamps, attempt numbers, locations of items placed on the scale, the types of hints provided, and outcomes. Figures 16 shows a screenshot of excerpted columns from the 'Beanstalk_2013_05' dataset. Figure 16. Screenshot of Beanstalk log file dataset. The log file rows represent individual actions made by the students, such as asking for a hint and placing an item on the scale. Step 1: Recode the Data and Select Variables To create a numerical dataset that can be used by a clustering algorithm from log file data, one must first consider the unit of analysis. For this example, the unit of analysis is a student’s attempt at a single level, with each game level representing an unbalanced scale scenario in the Beanstalk game. This means that if a student tried the same scenario twice, there would be two rows in the dataset used for analysis. This allows one to create clusters representing strategies at each level, recognizing that the student may come back and try a new strategy. In short, every attempt at a level gets a row in the dataset. The final recoded dataset for analysis is made up of 21,018 cases representing attempts at game levels in Beanstalk. Kerr and Chung (2012) and Stevens and Casillas (2006) also used the attempts as the unit of analysis in their GBAs. This has the benefit of showing how individual 139 students change strategies across multiple attempts at a level, as is discussed in Step 8 below. Recall that one of the reasons to use a game as an assessment is that it should be fun and challenging enough that students will fail certain levels and keep trying them until they succeed, so most GBA research should focus on the level attempts as the unit of analysis. This is not, however, the only way to structure the dataset. One could set up the dataset so that instead of attempts at each level, the rows represented a student, and the student’s repeated attempts at a level would be documented in the columns. One could also look at the student’s performance across levels in a single row, examining gameplay processes that span across the designer-defined segments of the game. If data were available, one could also analyze at the level of classrooms or schools, if desired. The data could also be turned around so that each row represented a level in the game, with student gameplay represented in columns. This could be used to examine which levels yield similar gameplay patterns, perhaps as an investigation into the effects of certain design decisions. The unit of analysis represented by rows in the dataset depends on the research question, and since this tutorial is interested in examining a student’s work process, each row represents a single response pattern for a given level. Next, one must decide which variables to include in the columns. These decisions should be guided by the researcher’s theory or game design choices that are meant to be used as evidence in the CAF model. For this analysis, the dataset includes variables that depict:  The game level (i.e., the specific scale scenario being solved).  Where the student placed flowers or eggs on the scale to balance it. 140  How many flowers or eggs the student placed at each location on the scale.  Whether the student got hints or answers from the talking chicken, crow, or monster. What is being left out? Student response times, attempt number, whether the item was answered correctly, and other data included in the log file. These data may still need to be retained in the dataset for data management or other reporting, but they were not used in the clustering algorithms. This is not to say that they could not be used as assessment evidence, but for the inference being made in this tutorial, the student’s work process is only evaluated on where the student placed items on the scale and whether he or she asked for help from the game characters. Remember that the purpose of using GBAs should be to evaluate process, not outcome, since everyone should be able to eventually win a game, but the evidence of interest is how they do so over repeated gameplay. Finally, the variables that were selected to be used in the clustering algorithms may have to be recoded. For each response pattern for a given item scenario in the game, binary variables were created that corresponded to each position on the scale, as well as the number of times the student used that position. For example, the variable for placing one flower at the third position to the right of the fulcrum was labeled x3.1. If the student placed a second flower on top of the first, this would be recorded in x3.2. The binary coding of this method allows one to use the ROCK algorithm, which only analyzes binary data; however, one could just as easily create a single variable with an integer representing the number of flowers the student placed. 141 The use of the hints had to be parsed from the Feedback Text variable shown on the right hand column in the log file data depicted in Figure 16 above. Not all Feedback Text was related to hints. For example, the characters would say non-hint statements when the student completed or failed a level. The characters would also make non-hint comments if the student clicked on the character immediately after having received a hint, presumably to provide some delay between hints and discourage hint abuse. If the student exhausted the hints or failed, the monster would provide guidance on the solution. Thus, additional binary variables were created to identify the number of hints the student requested and whether he or she got the answer from the monster. With eight scale positions, each with up to four flower placements, plus four hints and the monster option, 37 variables were coded for use in the study. From a GBA standpoint, this is a small number of variables. This reflects the low autonomy of the game, as well as the fact that each level of the game is identical in terms of possible actions the student can take. If desired, and if theory supported it, one could add more variables representing things like sequence of actions, response times, and game state, but this also increases the risk of adding irrelevant variables that may impact clustering algorithms’ solutions and their interpretations. Note also that this is a very simple example for the purpose of illustration, but many games will not have the same gameplay variables or even the same game design between scenarios in a game. The researcher needs to consider the structure of the game to know reasonable dividing points that may require new variable coding and separate clustering solutions, such as changes in levels or sections of a game. 142 Step 2: Select Clustering Algorithms When selecting clustering algorithms, it is important that one justifies not just the use of a clustering algorithm as a measurement model, but also the choice of the specific algorithms that will be used in the analysis. When possible, one might frame this rationale in the context of the anticipated data structure, the assumptions of the algorithms, the desired structure of the output, and the research question itself. For example, if identifying outliers is critical for the research, then one should use a clustering algorithm that can identify and isolate outliers from the definition of clusters. As with any research, it is also helpful to consider whether other people have used the algorithm for similar applications. For this tutorial, the k-means, SOM, and ROCK algorithms continued to be used, and the results of the simulation study in Chapter 5 can help support this decision. First, as investigated by the third research question, the distributions of Kappa values for true cluster recovery can be high in some data scenarios, but there are often cases where all three algorithms have a possibility of converging on a solution that does not represent the true clusters well. For this reason, it is recommended that all three algorithms be used in order to cast a broad net and to increase the likelihood of finding two solutions which validate each other. Second, the data coding from Step 1 is all binary. Both the ROCK and SOM algorithms work with binary data, and despite being designed for numerical data, the simulation showed that k-means often outperformed both ROCK and SOM in terms of true cluster recovery and correct identification of the number of clusters, as determined 143 by DBI. As such, all three algorithms may be considered appropriate for analyzing the binary data created in Step 1 from the Beanstalk log file. Third, Beanstalk is a low autonomy game. The student is limited in the actions he or she can make, and the coding from Step 1 only yields 37 variables—many of which were not active in a given game level (for example, the student can only place flowers on one side of the scale in any given level). Given this knowledge of the gameplay and game design, it was reasonable to expect that there would be small numbers of clusters of work processes in each Beanstalk level, and the probability of observing responses in the relevant variables would likely not be low for many variables. The results of the algorithms’ true cluster recovery rates when k = 4 and k = 10 in Table 20 (reported above in Chapter 5) suggest that both k-means and SOM would perform well under the expected conditions of the Beanstalk data, with the expectation that the k-means algorithm would perform best. The ROCK algorithm is also expected to perform well if the work process data is very similar within a cluster (i.e., students in a cluster have nearly identical work process response data), so it may be a useful algorithm for cluster identification or validation in simple levels of the Beanstalk game. Step 3. Plan Cross Validation When using a clustering algorithm to evaluate students’ work processes in a GBA, there is a danger that one might overfit the model to the data, and that the clusters may not generalize well to new student responses. One way to defend against this possibility is to cross-validate a clustering solution with subsets from the original dataset to see if the clustering algorithms fits the new data as well as it fit the training data. 144 The simplest way to do this is to simply split the dataset and keep a subsample as a holdout sample for testing the clustering solution from the remaining training data; however, Baker, Wang, and Andres (2017c) cautioned that, when analyzing repeated observations like multiple attempts at a game, one should make sure that the same students do not have data in both the training sample and the holdout sample, otherwise, instead of testing generalizability to new students, one may in fact be testing the generalizability of new data from the same students. For example, one would not want to have Student A’s responses to Level 1 in the training data and his or her responses to Level 2 in the holdout data. To create the cross-validation holdout sample in the Beanstalk data, one-third of the student IDs were randomly selected from the data, and a new binary variable was added to denote that these students’ response rows all belonged to the holdout sample. This resulted in 59 students being identified for the holdout sample, and these 59 students were associated with 7,309 rows of response data across the different levels of Beanstalk. This left 118 students with 13,709 rows of response data for the training data, which would be used to create clustering solutions for each level in Beanstalk. Step 4: Run the Algorithms Running the algorithms sometimes requires some decisions and data manipulations on the part of the researcher, specifically around variables to exclude under certain conditions, parameters for the algorithms, values of k to try for the number of clusters, and stopping rules. This section breaks down these decisions below. Excluding variables. If students’ data within a variable do not vary much, the variable may not be meaningful for defining clusters that differentiate students’ work 145 processes. If this happens across the entire game, the variable may not be worth including in the dataset at all, but it is often common to have a variable that varies a lot under some conditions and not others. This is the case in Beanstalk, where students would only place flowers and eggs on the side of the scale opposite of the bugs, which the student was trying to balance. If the bugs are on the left side of the scale on a given game level, the student could only place items on the right side of the scale, thus making any gameplay variables associated with the left side of the scale irrelevant for analysis of that level. The previous example is a variable made irrelevant by the rules of the game, but variables may have low variation for other reasons. Consider a game where one option is so clearly wrong that nearly no one uses it, or alternatively, an obvious game move that nearly everyone uses. These variables may provide more discrimination in other levels of the game that are being analyzed separately, but they may not be distinguishing features for every game level. For example, Kerr and Chung (2012) omitted variables from the clustering analysis if the variable was only activated in five or fewer attempts at a level. Given the similar sample sizes (Kerr and Chung had 209 students, the Beanstalk data has 177 students), the same approach was used in the analysis of the Beanstalk data. The one difference was that Beanstalk gameplay variables were excluded from the clustering analyses if five or fewer attempts had that variable activated or if all but five or fewer attempts used that variable. For example, if the hint button was used by only five students on a given Beanstalk level, that variable was omitted from the clustering analysis of that level. Conversely, if all the students except for five used the hint button, that variable would still be omitted from analysis. In both examples, the hint button likely would not 146 provide enough discrimination between students to be valuable in defining clusters of their response patterns. Parameters for the algorithms. The simulation study reported in Chapter 5 showed the importance of experimenting with different parameters and starting values when using clustering algorithms for analysis. Recall that the k-means algorithm, as implemented in R, tries numerous starting centers, randomly selected from the observed data. The k-means algorithm then selects the clustering solution that minimized the within sum-of-squares distances for the clusters (R Core Team, 2014). Conversely, the SOM algorithm takes a random set of cases as starting centers, and then adjusts cluster center coordinates based on repeated presentations of the data to the SOM network (Wehrens & Buydens, 2007). The simulation study in Chapter 5 showed that these repeated presentations of data do not overcome the influence of poor starting centers, given the default parameters used in the kohonen package, and inclusion of an outlying case as a starting center can drastically change the convergent outcome of the SOM model. For this reason, researchers may wish to follow the k-means example, trying multiple random starting centers with the SOM algorithm, then comparing fit to choose the final solution. The ROCK algorithm produces a single solution for a dataset, given a similarity threshold theta value. The feasibility study in Chapter 3 and the results of the simulation study in Chapter 5 show that the ROCK algorithm often cannot converge on a specific number of k clusters under higher ranges of theta values. Researchers who desire a certain number of clusters need to try multiple theta values when creating a ROCK solution. ROCK’s ability to identify potential outliers is useful for some applications, but 147 it is important to have a set number of k clusters when validating the ROCK solution against solutions from other algorithms that have the same number of clusters. For this reason, researchers should be prepared to try multiple theta values in the ROCK algorithm to achieve their desired number of clusters, although as shown in the simulation results in Chapter 5, the ROCK algorithm may not be able to converge even at low values of theta if there are a large number of expected clusters with high within- cluster variability. As with the simulation study in Chapter 5, the ROCK algorithm was set to start at a theta value of 0.95 and decrease in 0.05 increments until the specified number of clusters, k, was attained in the Beanstalk data. Exploring various numbers of clusters. As discussed in Chapter 3, interpretability of the results is often the key criterion for selecting the number of clusters (Hand et al., 2001; Kriegel et al., 2009; Stevens & Casillas, 2006); however, choosing a clustering solution—or at least narrowing one’s options—is considered a good practice, and many researchers have promoted this strategy or built automated clustering algorithms around this premise (e.g., Baker et al., 2017b; Hamerly & Elkan, 2004; Hothorn & Everitt, 2009; Pelleg & Moore, 2000; Wang et al., 2005). Following this guidance, and in the absence of any subject-matter expertise to guide interpretation of the results, the response data from each level in the Beanstalk dataset’s training data was analyzed with the k-means, SOM, and ROCK algorithms, with the number of clusters ranging from k = 2 through k = 10. DBI values were recorded for each algorithm’s solution at each value of k. The upper limit was set to 10 clusters because of the low number of variables used in analysis (37 variables) and the size of the training sample (177 students). If one 148 assumes that students do not keep trying the same strategy within a level over repeated attempts, then using an upper limit of k = 10 implies that the average cluster is expected to contain attempt data from 18 students. Increasing the number of clusters in such cases arguably leads to a level of granularity that would not be useful for analysis of the Beanstalk data, although a researcher may be interested if there is reason to expect large variety in work process response patterns, as might be found in a high autonomy game. A lower limit of 2 clusters also seemed reasonable, since it was known in advance that some students successfully beat levels in the Beanstalk game and others failed them (at least on their initial attempt). This suggested that at least two strategies would be at work in any given level of Beanstalk. Recall that each case is an attempt at a level, so a single student may be represented multiple times (once for each attempt at the level); however, if the student changes his or her work process, each attempt would be assigned to a different cluster. Stopping rules. The analysis of the Beanstalk data was automated with a “for” loop in R to go through each level of the Beanstalk data, identify the attempts from the students in the training sample, drop variables in that level with low variance in responses (which includes dropping variables for which there is no variance because they were not used in that level), and analyze the remaining response variables with the k-means, SOM, and ROCK algorithms for k = 2-10 clusters. The script was instructed to stop analysis and go to the next game level under three scenarios. First, if the number of attempts was less than 25 for a given level, that level was skipped. The cutoff sample size of 25 was chosen based on an inspection of the response frequencies in the data. Of the 93 levels in the Beanstalk data, 88 (95%) had 32 responses or more from the 177 students in the training 149 sample. There was then a gap in the number of observed responses, with the remaining 5 levels having 13 or fewer responses from the 177 students in the training sample. These samples were deemed to be too small to yield meaningful clusters, and any clusters derived from these small datasets would likely not generalize well to new students. The researcher should decide when or if a variable’s observation is too rare to be of interest. Variables that are rarely observed may be of particular interest for identifying nuances in gameplay, but they will not help to define clusters. When in doubt, one can run clustering algorithms with and without the variable in question to see if its presence appreciably changes the clustering solution. Secondly, if the number of variables selected for the analysis was less than three, then the level was skipped in the analysis. Recall that variables that were activated in 5 or fewer attempts or which were activated in all but 5 or fewer attempts were omitted from analysis because they did not have enough variability in the data to be useful for cluster identification. In some levels, such as the easier initial levels in the game, removing these variables left only one or two variables for analysis. This scenario did not seem to warrant use of clustering algorithms, due to the extremely low dimensionality of the remaining data. Lastly, the script was instructed to proceed to the next level when k was greater than half the number of unique response patterns. For example, consider a level where eight binary variables have been selected for analysis. Theoretically, one might observe 2 8 = 256 possible response patterns, but in practice only 10 unique patterns of responses are observed. This step was added after several solutions showed that the clustering algorithms were overfitting the data and creating clusters for one or two outlying 150 responses, which in turn created a very low DBI value. This choice was made based on the observations in this dataset, but it is not necessary. There will certainly be cases (such as extremely low autonomy GBAs) where a small number of unique response patterns correspond to large, meaningful clusters of gameplay. Step 5: Validating Across Algorithms Xu et al. (2013) recommended validating clustering solutions across two or more algorithms. This is good advice, but only if the algorithms are expected to perform similarly under the data conditions. Kerr and Chung (2012), for example, took the opposite approach, taking data from one game level and showing how a hard clustering algorithm did not yield the same interpretable results as the fuzzy clustering algorithm they used in their analysis. Their approach is good for building evidence supporting their choice of the fuzzy clustering algorithm; however, the argument could be stronger still had they demonstrated that another, similar algorithm produced a comparable clustering solution. Just as the holdout sample helps support the generalizability of a clustering solution to new students, one may want to build an argument that the clustering solution also generalizes to new clustering algorithms, effectively demonstrating that two different approaches have identified the same outcome. The counterargument to Xu at al.’s (2013) recommendation is that it is possible that the best, most interpretable solution can only be found with one, specialized algorithm. This is a possibility, and if that is true, then it puts more pressure on the researcher to test a wide variety of algorithms and defend their choice of the best one for their data. While this is certainly a possibility, this example tutorial follows Xu et al.’s recommendation that two algorithms (both appropriate for the data and yielding 151 comparable cluster structures) should have relatively high agreement in whichever clustering solution is chosen for evaluating the students’ response patterns. For this study, the k-means, SOM, and ROCK algorithms were run for varying values of k for each level of Beanstalk. The selected clustering solution for each game level was the solution with the lowest DBI value which also had a Cohen’s Kappa value of 0.80 or higher with one of the remaining two algorithms’ clustering solutions at a given value of k. This process is illustrated by example below. An example from Beanstalk Level 2-11. Tables 21a and 21b shows an example from Beanstalk level 2-11. Notice that the k-means solution has a slightly better fitting solution (DBI = 0.13) when k = 5, but it does match well to the other two algorithm’s solutions (Kappa for k-means and SOM = 0.47, for k-means and ROCK = 0.69). The next best-fitting solution occurs at k = 4, where the ROCK algorithm produced a solution with DBI = 0.20, and in this case, ROCK and k-means had high agreement: Kappa = 0.95. In this case, the ROCK solution at k = 4 was chosen as the clustering solution for Beanstalk level 2-11 (note that in practice, one may wish to compare both solutions with subject matter experts to ascertain which solution is more interpretable in the context of the GBA’s intended purpose). Notice that the decision process used for selecting a clustering solution in Tables 21a and 21b does not require the researcher to use the same algorithm for all levels of the game. This option may be overlooked by many GBA researchers, but there is no reason to use the same clustering algorithm for all levels of a game. After all, gameplay (and the associated response data) may be very different across different levels or scenarios in a game. Just as one might use different IRT models for different item types, one should be 152 open to using different clustering algorithms for different sections of a GBA. In the example in Table 21, the ROCK solution for k = 4 is chosen and it agrees closely with the corresponding k-means solution, and the SOM solution agrees fairly well too. There are other cases in the data where a different algorithm performed well, agreed with one other algorithm, but the remaining algorithm did not agree with the first two. There was no theoretical basis for prohibiting the selection of whichever algorithm could produce the best-fitting solution at each level of Beanstalk. Table 21a Table 21b Algorithms’ Agreement (Kappa) by k (Level 2-11) Algorithms’ Fit (DBI) by k (Level 2-11) Cohen’s Kappa DBI k k-means to SOM k-means to ROCK SOM to ROCK k k-means SOM ROCK 2 1.00 -0.23 -0.23 2 0.94 0.94 0.77 3 0.37 0.22 -0.31 3 0.81 0.60 0.70 4 0.78 *0.95 0.74 4 0.29 0.53 *0.20 5 0.47 0.69 0.85 5 **0.13 0.40 0.23 *Best-fitting solution with Kappa ≥ 0.80 for two algorithms. **Best-fitting solution overall, but Kappa < 0.80 with other algorithms. Note: Analysis of six gameplay variables, n = 213 response patterns. Consider also the plot of number of clusters (k) and DBI values for each clustering algorithm in level 3-2, shown in Figure 17 below. The clustering solution for this level was the SOM clustering solution at k = 4, but if one looked solely at DBI fit and not cross-algorithm agreement with Cohen’s Kappa, then one might come to a different decision. The lowest DBI value occurs for the ROCK solution at k = 5, but as has been seen in the feasibility study (Chapter 3) and the simulation study (Chapter 5), these low DBI values can be achieved by creating some small clusters and grouping the other cases into a large cluster, which was the case here. One might pick k = 8 because all three algorithms have similar, low DBI values, but although they all fit fairly well for those 153 numbers of clusters, they do not agree with each other (Kappa is less than 0.80). They are each telling different stories about the data, and without expert input, it would be difficult to say whether any of those solutions was useful or interpretable. Figure 17. DBI values by clustering algorithm and number of clusters (k) for level 3-2, which analyzed 12 gameplay variables and had 156 attempts in the training data. The SOM solution for k = 4 was selected as the final clustering solution because of its low DBI and its high Kappa value (0.86) when compared with the k-means solution . In 51 (68%) of the 75 game levels analyzed, the selected clustering solution was also the lowest DBI observed at all levels of k for that level, meaning that the absolute best-fitting solution had a Kappa value of 0.80 or higher with another algorithm’s solution approximately two-thirds of the time. The k-means solution was selected in 44 (59%) of the game levels, the SOM solution was selected in 26 (35%) of the levels, and the ROCK solution was selected in 5 (7%) of the levels. Step 6: Validating Against Holdout Sample and External Criterion Once the number of clusters and best-fitting clustering solution (that agrees well with another model) has been selected for each level in the game, one must examine the 154 validity of those solutions. Ultimately, each clustering solution should be interpretable by subject matter experts; however, validating with the holdout sample builds evidence of the clustering solutions’ generalizability to new students, and validating against a criterion measure builds evidence of the solutions’ accuracy and utility. For the Beanstalk data, each level’s clustering solution from the training sample was used to predict cluster membership for each response record in the holdout sample. The k-means, SOM, and ROCK functions in R each have the ability to predict cluster membership for new data, given a solution from training data (Buchta, & Hahsler, 2014; R Core Team, 2014; Wehrens & Buydens, 2007). The DBI values for the training data’s clusters were compared to the DBI values from the clusters assigned to response data in the holdout samples. Figure 18 shows the DBI comparisons for the training data and the holdout sample for each of the 75 game levels in Beanstalk, with game levels listed in descending order by the training data’s DBI. Notice that for most levels, the DBI value of the holdout sample’s clusters was close to or lower than the DBI from the training sample. Exceptions occur at levels 2-7 and 1-5, where the DBI value for the holdout sample is noticeably higher than the DBI from the training sample. This suggests that these clustering solutions are overfitting the training data. Very low DBI values in the holdout sample are also noteworthy, but not necessarily cause for alarm. When the DBI value in the holdout sample is much lower, it indicates that certain clusters were not as prevalent or did not appear at all in the holdout sample; i.e., there were work processes observed in the training data that were not observed often in the holdout sample. If the clustering solutions have similar fits to both the training and holdout samples, then one can validate the clusters against an external criterion; i.e., a related 155 measure that was not included in the clustering algorithms’ analyses (e.g., Stevens & Casillas, 2006). This can be any meaningful criterion measure, such as scores from other tests, results from cognitive interviews, student engagement observations, or personality measures. Correlations between the criterion measures and the clusters help with interpretation and to validate the clusters as assessment outcomes. A very simple validation criterion is simply the outcome of the game levels (i.e., win/loss, correct/incorrect). Stevens and Casillas (2006) looked at percentage correct as a criterion validation for their SOM analysis of a biology GBA. Kerr and Chung (2012) took a similar approach, but instead, they adjusted the number of clusters until clusters appeared that had only correct results stemming from the work processes. Kerr and Chung’s approach does not count as validation because they specifically calibrated their cluster solution to match the correct/incorrect result outcomes, but the underlying concept is the same as Stevens and Casillas’ approach: if students in a cluster are engaging in the same work process, then the GBA should give those students the same outcome. 156 Figure 18. DBI comparisons between training sample and holdout sample. When the holdout sample’s DBI is much higher than the training sample, it indicates that the clustering solution is overfitting the training data and a new clustering solution should be identified. When the holdout sample’s DBI is much lower than the training sample, it suggests that some work processes observed in the training data were not observed as frequently (or at all) in the holdout sample. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 4 -5 2 -1 2 5 -1 0 3 -6 6 -8 4 -2 4 -1 3 1 -2 5 -1 5 -1 4 6 -1 0 3 -7 2 -9 2 -1 0 3 -2 5 -2 1 -4 3 -9 1 -9 5 -1 3 6 -1 2 7 -9 4 -8 5 -5 3 -4 2 -1 4 5 -1 2 3 -1 0 6 -1 1 4 -4 1 -7 7 -7 3 -5 3 -1 5 -9 3 -8 5 -8 2 -8 7 -1 0 6 -2 5 -4 4 -1 0 5 -6 5 -7 2 -7 7 -1 1 6 -9 5 -3 5 -1 5 5 -1 1 1 -8 6 -7 6 -5 4 -7 1 -3 2 -5 7 -8 2 -6 1 -1 0 4 -9 6 -4 6 -3 7 -4 7 -6 4 -1 1 1 -1 1 7 -5 2 -1 1 2 -3 2 -4 1 -5 7 -2 1 -6 7 -1 3 -3 D B I Game Level, Ordered by Training Data DBI Training DBI Holdout DBI 157 Descriptive statistics were run for each cluster at each level using all cases in the dataset (both the training sample and the holdout sample). This provided a view of the average response patterns in each cluster as well as the percentage correct attained by each cluster. Across all 75 analyzed Beanstalk levels, 358 clusters were created, with each level having between 2 and 10 clusters. Out of all of the clusters, 297 (83%) reflected strategies that generally led to an incorrect or correct outcome: these clusters’ response patterns yielded correct outcomes more than 85% of the time or less than 15% of the time. In inspecting the clusters where the correct or incorrect outcome happened less consistently, response patterns show students figuring out some of the pieces of the correct solution, then trying different ideas to finish the level, with mixed results. An example from Beanstalk Level 2-9. To understand how one might validate against a criterion like whether the student beat a game level, consider Beanstalk level 2- 9, which used a six cluster SOM solution (DBI = 0.82, Kappa with k-means solution = 1.00). Figure 19 shows the correct outcome rate for each of the six clusters in Level 2-9. Each cluster is given an arbitrary ID, thus labeling the clusters as Clusters 1 – 6. Notice that Clusters 5 and 6 have nearly perfect correct outcome rates—students in these clusters nearly always beat the level (the difference being that everyone in Cluster 6 got the answer from the Beanstalk monster after having an incorrect outcome on their previous attempt). Cluster 4, however, is not so clear-cut. In Cluster 4, 38% of the students using this work process beat the level, but 62% did not. How can the same work process beat the level for some cases, but not others? 158 Figure 19. Correct outcome rates (rates of beating the level) for the six clusters identified in level Beanstalk 2-9. The answer lies in the response variables themselves. In Cluster 4, most students got part of the solution and then experimented to find the correct response. Beanstalk allows students to “trim” their flowers, effectively letting them undo a move and try something different when they see that the scale is not balancing well. Some of these students were able to eventually figure out the solution by observing the feedback onscreen (the scale balancing) and using that to inform their experimentation. There are several ways to solve Beanstalk level 2-9, but all solutions require that the cross product of the weights and locations of the flowers sums to 11. The defining feature of the cases in Cluster 4 for Beanstalk level 2-9 is that nearly every student placed two flowers at the end of the scale, four units from the fulcrum (see Figure 20a below). This flower placement yields a cross product of 4 × 2 = 8. In this level, the student’s final solution could only have four flowers in it, so for students who placed two flowers four units from the fulcrum, the remaining 3 points of the sum of the cross products had to come from the placement of one or two other flowers. Given the placement of the initial two flowers, there are only two solutions to the level that fit those criteria, shown in Figures 20b and 20c. 9% 19% 11% 38% 98% 96% 1 2 3 4 5 6 Cluster ID (Level 2-9) 159 Figure 20a. Cases in Cluster 4 in level 2-9 were characterized by students who placed at least two flowers at the end of the scale, four units from the fulcrum. From this position, there are two ways to balance the scale (Figures 20b and 20c), but only 38% of students with cases in Cluster 4 identified a correct solution from this position. Figure 20b.One possible correct solution for Cluster 4 in level 2-9 for students who have placed two flowers at the end of the scale, as depicted in Figure 20a. Note that the student is provided with four flowers, but this solution only requires three. Figure 20c.The other possible correct solution for Cluster 4 in level 2-9 for students who have placed two flowers at the end of the scale, as depicted in Figure 20a. Note that in this solution, the student is simply mirroring the positions of the ladybugs. Given that only 38% of cases in Cluster 4 (on level 2-9) were correct, one can infer that the remaining 62% of cases placed two flowers on the end of the scale and then deviated into another final placement. Figure 21 shows the distribution of flower placements for Cluster 4 in Beanstalk level 2-9, which could be stacked on top of each other up to a height of four flowers. Students with incorrect solutions continued stacking 160 higher, especially at units 3 and 4, which mirrors solutions in preceding levels. This explains how all these cases could be in the same cluster (a shared response pattern on two flower placements) even though only a minority of cases were correct—the differentiating factor is the placement of the remaining two flowers. Figure 21. Flower placement rates for Cluster 4 in level 2-9. Nearly all responses in this cluster have two flowers placed 4 units away from the fulcrum (depicted on the right as a triangle). The sum of cross products must equal 11 to beat this level. Figures 20b and 20c above show the possible correct placements, given the placement of two flowers on the left of the scale. This phenomenon was observed across many clusters where students did not have direct, concise solutions. Students in these clusters shared the key gameplay attributes that defined the cluster strategies, yet a minority of these students were able to turn that strategy into a correct solution. Whether that was luck or intelligent self-correction is impossible to say without looking at the students’ gameplay across levels using a method like Bayesian Knowledge Tracing (e.g., Baker, Wang, & Andres, 2017a). This already points to a weakness in using the clustering algorithms as the measurement model to evaluate the Beanstalk data, namely that, based on gameplay 4 3 2 1 P er ce n ta g es o f ca se s w it h f lo w er s p la ce d i n t h es e p o si ti o n s. ← Distance from fulcrum. 161 alone, the clusters do not provide information about the work processes that is instructionally valuable; the clustering algorithms do not do a good job at differentiating between students who are self-correcting and getting to the correct answer versus students who are simply lost. This has everything to do with the game design’s feedback. The scale balances in real time as more flowers are placed, and students have the opportunity to undo their moves and change course. In this example, validation with an external criterion (the correct/incorrect outcome of the level) has shown that the clusters align well with performance and do a good job at differentiating between students who efficiently solve the level versus those who cast about a bit. In this sense, the clusters are doing a good job at differentiating between novice and expert gameplay, which was proposed in Chapter 1 as one of the main reasons someone would want to evaluate work process data in the first place. The external criterion, however, also shows that there are degrees of separation within the novices, and some novices are able to use the game feedback better than others. The clusters do not differentiate between these novices well. If one were interested in identifying such novices, one would likely need to include variables for sequence of response and game state to show the order of actions and how the student moved toward a correct response after incorporating the feedback data of the game state. Step 7: Interpreting, Labeling, and Grouping Clusters In a GBA, this step is best executed by subject matter experts who interpret the clusters in the context of the assessment’s domain model. Both Kerr and Chung (2012) and Stevens and Casillas (2006) did this, linking the clusters back to behaviors that they associated with certain skills or misconceptions related to their games’ measured 162 constructs. In a long game with high autonomy, this may be a tedious process, but there is no way around it—a human being must interpret the clusters and assign them meaning in the context of the construct. In this tutorial example, however, such experts are not available. For the purposes of illustration, this section simply focuses on the gameplay itself. The work processes represented by each cluster can be interpreted through the variables that have high or low rates of occurrence in each cluster. For the Beanstalk data, recall that the clustering algorithms analyzed coded variables representing game actions (placement of flowers on the scale), hints, and use of the monster for getting the correct answer. These three groups of variables can serve as a rough basis for interpreting, labeling, and grouping clusters. In this example, clusters were given one of four possible labels: 1. Correct – the student beat the level immediately with few missteps and hints. 2. Correct with Giveaway – the student beat the level, but only with help from the monster, who provides guidance on a solution. 3. Incorrect with Use of Hints – the student often did not beat the level, but made use of the hints in his or her efforts. 4. Incorrect – the student typically did not beat the level and made several missteps, but the student did not make use of the hints or have help from the monster. Using common labels like these for the clusters is useful for analyzing gameplay across levels. One can see if students jump between cluster types or if certain clusters are 163 likely to result in another cluster type later in the game. Stevens and Casillas (2006) called this “strategic transition,” and mapped clusters to four different strategy types. They analyzed students longitudinally to demonstrate that students’ clusters advanced to different strategies as they played the game, and they used these findings as evidence that the GBA itself was helping students learn. Recall from Chapter 2 that one of the benefits of using a game as an assessment is that it encourages students to keep playing, even if they fail at first. Shute and Wang (2016) explained that games should be challenging in order to be fun, and students expect to fail sometimes in a game. Repeated gameplay allows the researcher to map the evolution of a student’s gameplay as a series of failures and successes over time. This series becomes more meaningful and easier to interpret when one creates a unifying method for labeling clusters by strategy type across the entire GBA, even if there are multiple levels and scenarios. To put it another way, each level should not be interpreted in isolation. One should interpret the student’s path from the start of the game to the end. This necessitates cluster labels that can provide meaning about a student’s transition from one cluster to another, both within and between levels. In the Beanstalk analysis, a cluster was labeled as “Correct” if at least 85% of the cases beat the level and there was no use of the monster for guidance. Clusters were labeled “Correct with Giveaway” using the same criteria, except that the attempts utilized the monster to find out the answer (either by exhausting the hints or following a failure on a previous attempt at the level). A cluster was labeled as “Incorrect with Use of Hints” if the correct outcome was less than 85% and at least one hint was used. All remaining clusters were labeled as “Incorrect.” 164 Step 8: Students’ Changes in Cluster Membership Over Time Once the clusters have been labeled and grouped into similar types, the students’ pattern of clusters can be investigated, but this is optional. For the purposes of assessment, one could simply keep a student’s cluster assignments for each level or scenario in the GBA as the assessment outcomes, but many games are not designed this way. Recall in the previous section that game levels are rarely designed to function in isolation. Instead, the levels in the game act as a structured path to move the student through the overall game arc, often moving through different concepts or getting more difficult as the game progresses in order to keep the gameplay challenging and fresh. Beanstalk is no different. In Beanstalk, the game levels progress through Siegler’s (1976) and Siegler & Chen’s (2002) mental models relating to the scale balancing rules, and the game levels get progressively more difficult in terms of both psychometric difficulty and game difficulty as the student progresses. Although this step is optional, it is clear that ignoring the entire pattern of gameplay across the game is leaving some potentially valuable information on the table. For example, the pattern of clusters within and across game levels can show if the student is learning, if the student is adapting strategies, or if the student is gaming the system. If the value of looking at work processes is to differentiate between novices and experts, and if the value of using a game as an assessment is to engage students so that they keep playing and improve their performance (as is argued in Chapters 1 and 2), then the researcher should feel obligated to examine whether students are moving from novice to expert work processes over the course of gameplay. 165 A simple way to investigate this is to plot students’ paths through the clusters across all the levels of the game. Stevens and Casillas (2006) called this a group transition map. Figure 22 plots students’ paths between the four types of clusters as they progress through the game, with game levels increasing along the horizontal axis. The group transition map in Figure 22 shows high level patterns of student gameplay. The vertical axis represents the four types of clusters identified in the previous section: Correct, Correct with Giveaway, Incorrect with Use of Hints, and Incorrect. The horizontal axis represents the game levels of Beanstalk. The following gameplay features are noted with callouts on the figure: A. Within each horizontal band, clusters are ordered in terms of size, with larger clusters appearing at the top of a band and smaller clusters appearing at the bottom of the band. Thus, students’ lines converging in a group on the bottom side of the Correct band represent students who used a rarer, perhaps less-obvious solution. B. In this game level, there are suddenly more lines in the Correct cluster band. More students are catching on and are easily beating this level. C. Early in the game, there are a variety of Incorrect clusters where students are not making use of the hints. These clusters have fewer students as the levels progress, suggesting students are learning to use the hint features or are solving the levels. D. At this point, something changed in the game and the strategies that had been working in earlier levels no longer apply. Many students drop down into the Incorrect with Hints and Incorrect clusters, although a few students are still present in the Correct clusters. 166 E. These are clusters of students using the monster to get the correct answer. There are fewer and fewer students in these clusters as the levels progress, suggesting that many students do not keep relying on the monster to beat the levels. F. This cluster provides a clear look at the effectiveness of the monster’s giveaway. This is a large Correct with Giveaway cluster where the student’s got the correct answer from the monster. The students in this cluster go in three different directions after this level. Some students learn from seeing the solution and move up to the Correct clusters. Some students did not quite learn the solution strategy, and drop back down into the Incorrect clusters. Some students continue on and use the monster again in the next level. The paths shown in Figure 22 are not just useful for investigating aggregate gameplay patterns. They can also be used for individual student reporting. In a simple game like Beanstalk where one has grouped clusters into four standard categories (as is the case above), little needs to be done to support individual student reporting. A researcher can report a single student’s path through the game, tying level concepts to clusters and identifying where students struggled, where they changed processes, and where they ultimately succeeded. 167 A B C D E Figure 22. Group transition map of students’ clusters across 75 levels of Beanstalk. F Clusters are grouped by type and ordered by cluster size for plotting. As levels progress, more students are in the “Correct” cluster groups. There is a change in the game, and many students suddenly stop exhibiting “Correct” cluster work processes. Students use the “Giveaway” to get the correct answer, and then split into three cluster groups in the next game level, depending on whether the hint helped them learn or not. The number of students in “Incorrect” clusters diminishes as gameplay progresses. Diminishing number of students relying on the “Giveaway” from the monster. 168 Tutorial Example Summary The tutorial provides an overview of the types of decisions, validations, analyses, and aggregations that must be made to begin to treat a game as an assessment. Even with a simple game like Beanstalk, there are many choices that must be made about how to select and code variables, choose the algorithms, choose the number of clusters, pick a final clustering solution, and interpret the results. Following an argument-based validity approach, one should be prepared to defend each of these decisions in the context of the inferences and uses of the game as an assessment. The steps demonstrated here will not generalize to every GBA design or clustering algorithm, but they frame the way one must think about using a clustering algorithm as a measurement model in a GBA, and from this perspective, these steps can serve as a rough guide to GBA researchers. The tutorial showed that the first step, recoding and selecting variables, requires human guidance, which should imply that there is a theory or rationale at work from the start. Although the clustering algorithms are designed to uncover hidden patterns in large datasets, an assessment researcher is not absolved of his or her responsibility to link evidence to theories about constructs. The data fed to clustering algorithms will affect the quality and utility of the cluster outputs, so this is a critical step and should be treated as part of the evaluation component of the evidence model. The tutorial also showed the potential value in validating results across different samples of students as well as different algorithms. Cross validating across groups of students helps build a case for the generalizability of the clustering solution to new students, whereas validating across cluster algorithms helps ensure that the result would generalize to other similar, appropriate clustering algorithms. The tutorial demonstrated 169 that using multiple algorithms for validation helps cast a broader net of solutions from which the researcher can choose, and it yields the possibility of switching between algorithms for different levels or sections of a GBA. The tutorial also showed a critical weakness of clustering algorithms as measurement models in GBAs, namely that the clustering aggregation can mask meaningful, subtle differences between students’ work processes. This possibility was underscored in Amershi and Conati’s (2009) cautions against using clustering algorithms to evaluate students. The distribution of responses in Figure 22 in Step 6 above showed that the single cluster consisted of students who shared many aspects of a common response pattern, yet only a few of them were able to determine the correct solution. What made these students different, and were they perceptive or just lucky? The cluster obfuscates these potentially meaningful differences, yielding a coarse and perhaps misleading label to all of the work processes assigned to it. These lessons illustrate why a clustering algorithm may be a poor measurement model for evaluating a student work process: they are simply too dull an instrument to provide defensible evaluation of a student work process in an assessment context. This does not, however, preclude their use for assessments. The inferences made from mapping changes in students’ clusters over time (such as those discussed in Step 8 above) may still provide meaningful insights into a student’s performance across multiple attempts or progressions of gameplay. These collections of clusters representing changes in work processes over time can provide a far richer picture of a student’s gameplay in a GBA and can arguably serve to better differentiate between students with respect to the construct of interest. 170 Chapter 7: Discussion This research sought to explore whether clustering algorithms could serve as measurement models in GBAs, and if so, the practical steps researchers should take to support the algorithms’ use and to validate their results. Games are appealing as educational instruments because they are engaging, and their design encourages repeated gameplay that helps educators or researcher observe changes in work processes over time. The observations of work process data are useful for differentiating between novices and experts or to illustrate a student’s path from novice to expert. By making repeated processes into a game, the student may be more engaged and more willing to participate for a longer period of time. Using a game as an assessment, however, presents challenges, because the assessment now includes construct-irrelevant relatedness elements that might impact performance without necessarily representing anything about the construct of interest. Performance on GBAs may also be dependent on both psychometric difficulty and game difficulty, and there may be no way to parse these concepts out when evaluating student performance. Nevertheless, GBAs may be valuable educational tools when used in low-stakes scenarios, and careful game design, evidence identification, and selection of measurement models can yield informative and valid GBAs. Summary of Results The findings of the simulation study underscore the need for selecting appropriate clustering algorithms for the GBA data and cross validating over multiple algorithms and datasets. The use of clustering algorithms as an assessment measurement model can be risky if the GBA inference occurs at the student level. For example, one might assign 171 students to a cluster based on the algorithm’s analysis of the work process response data, but the inference may not be valid for outliers. If student-level reporting is needed, researchers may be able to provide a more valid and meaningful evaluation of a student by considering the student’s clusters over multiple instances of gameplay rather than placing too much emphasis on the cluster membership of a single instance of a work process. When considering the individual algorithms, the k-means algorithm performed unexpectedly well. In Chapter 2, the limitations of the k-means algorithm were discussed, and based on Berkhin’s (2006) and Robinson’s (2017) warnings, it was thought that k- means would perform poorly when applied to the sparse, binary datasets in the simulation and the tutorial. This was not the case. The k-means algorithm had been included principally with the intention of showing how this popular and easy to access algorithm could yield misleading results, but instead, the k-means algorithm often outperformed the more complicated SOM algorithm and the binary-specific ROCK algorithm. In the Beanstalk tutorial, 59% of the clustering solutions selected were generated by the k- means algorithm. In the simulation study, k-means was the most versatile algorithm, consistently recovering the true clusters in the most cells and identifying the correct number of clusters (using DBI) more often than the SOM and ROCK algorithms. Why did the k-means algorithm perform so well when Berkhin (2006) and Robinson (2017) predicted it would perform poorly with binary data in non-spherical, differently-sized clusters? The simple explanation is the k-means software’s design of trying multiple starting center choices, creating multiple solutions from them, and then returning the solution that minimized the within cluster sum-of-squares (R Core Team, 172 2014). To put it another way, k-means has the advantage over algorithms like SOM because it is actually making multiple attempts and cherry-picking the best one. It may also be that the conditions under which Robinson (2017) warns of k-means failure must also be vastly more extreme than what was created in this simulation study. Nevertheless, k-means’ use and comparisons of multiple starting center choices provide two pieces of guidance: 1. When clustering algorithms’ solutions are heavily based on the starting centers (e.g., k-means, SOM), researchers should try multiple starting center choices and systematically compare the fit of the competing solutions. 2. Extending that idea, when a clustering solution is heavily dependent on parameters for the algorithm (e.g., SOM’s neighborhood radius and learning rate, ROCK’s theta value for neighbor similarity), researchers should also systematically experiment with different parameter values and compare the fit of competing solutions. Despite Robinson’s (2017) warnings against the k-means algorithm for binary data, and given the results of the simulation study and that the k-means algorithm is simple to access, use, and understand, researchers should consider using it (or one of its derivations) as one of their clustering algorithms when analyzing binary GBA work process data. The SOM algorithm performed well in many cases, but was susceptible to creating outlier solutions that did not recover true clusters well in the simulation. This was likely caused by a poor selection in starting centers, and the SOM algorithm performance could be improved by trying multiple starting centers and selecting the best- 173 fitting solution, as is the strategy used by the k-means algorithm in R (R Core Team, 2014). Researchers using SOM may also want to consider experimenting with different values for the SOM parameters, such as learning rate, neighborhood radius, grid dimensions, toroidal structure of the grid, and the number of data presentation iterations. The complexity of this algorithm, however, may make it difficult to use for some researchers, and when the GBA data scenario is similar to those in the simulation study, researchers may find that fine-tuning an SOM model does not offer noticeable benefits over the simple-to-use k-means algorithm,. Nevertheless, the SOM algorithm may be better-suited for more complex data scenarios and can be a useful second algorithm for validating results from the k-means algorithm. Despite being designed for sparse, binary datasets, the ROCK algorithm is probably the least useful of the three algorithms studied for general GBA analyses. Only 7% of the solutions used in the tutorial were generated by the ROCK algorithm. The ROCK algorithm did, however, show excellent true cluster recovery in the simulation data when response probability was 0.9, and the ROCK algorithm also performed the best when there were a large number of clusters (k = 35). The ROCK algorithm may have also been at a disadvantage, since it is well-suited for identifying outliers and isolating them into many small clusters. When forced to converge on a smaller number of clusters, the ROCK algorithm must increase its threshold for similarity, which can result in many cases being assigned to a handful of clusters. There are certainly GBA scenarios where the ROCK algorithm may perform very well; however, researchers should be familiar with the limitations of ROCK and apply it only when it seems suitable for their data conditions or when identifying outliers is of interest. 174 Study Limitations The study suffers from several limitations. First, it assumes that each cluster is defined by a high probability of a response in 20 variables and that that probability of a response is the same level for all clusters. In real world GBA data, clusters will likely have more variability, with clusters having different numbers of relevant variables and varying levels of response probability for those relevant variables. While this study’s simulated data may not reflect that kind of diversity within a dataset, the controlled cluster structure has the benefit of allowing for investigation into the impact the other conditions have on the accuracy of the clustering algorithms’ solutions. Second, the fact that fit and interpretability are not synonymous emphasizes the limitation of using DBI as an evaluation of the clustering solutions. Just because a clustering algorithm can return the best-fitting solution does not mean that it will always be the most useful algorithm. While this study focused on the best-fitting solution, recall that k-means and SOM can arrive at different solutions based on the random cases that are chosen for the starting centers. If one were to collect multiple solutions from these two algorithms, one might observe groups of alternative solutions that may not be the globally best-fitting but which still reflect a local minimum in the DBI measures across solutions. These solutions might represent alternative interpretations which may or may not be useful for application. The study also only represents data that have been coded with structured logging (Kerr & Chung, 2012); i.e., the simulated data and the tutorial data do not represent every mouse click and key stroke a student made. Binary datasets with these numbers of dimensions are far more likely to reflect logging in which a subject matter expert has 175 made a decision about which actions should be recorded, perhaps even combining variables into a new variable before the analysis. This is fine for representing many simple GBAs, but it may not be a realistic depiction of the vast datasets that could be generated in complex, high autonomy GBA environments like flight simulators. Similarly, this study is limited to only binary data, but some GBA researchers may be working with other values that better represent sequences or magnitudes of actions or game state. Another limitation is the dependence between the number of clusters and the total number of variables in a dataset. With more clusters, more variables must be added to create new relevant variables for each cluster, but this means that the datasets with more clusters have more variables than those with fewer clusters. Kriegel et al. (2009) discuss the difficulties of identifying clusters when there are a lot of variables, although they were specifically discussing datasets with thousands or hundreds of thousands of variables (the simulation datasets never have more than 1,000 variables). Nevertheless, it is difficult to ascertain whether differences in algorithms’ performance are due to changes in dimensionality (total number of variables) or differences in the number of clusters and their overlap. It is anticipated that the former has been the clear driver, but the differences in the number of variables may still limit interpretations. Despite this limitation, there are practical benefits to the decision to let the total number of variables vary: clusters in the study all have the same number of relevant variables, which supports comparisons, and smaller datasets were not padded with extra irrelevant variables, which would have hindered the identification of clusters. 176 Although the study manipulates several conditions of the simulated data, one must also consider that there are other aspects of the analysis that could be manipulated, particularly in the algorithms themselves. The SOM algorithm might return different results based on the starting values selected for the cluster centers, as well as the learning rate and neighborhood size. A researcher implementing the SOM algorithm might be well-advised to manipulate some of these variables to see how they impact the clustering solution. Additionally, the study only looked at three axis parallel soft-projected clustering algorithms. There are many other algorithms that could be compared, as well as methods that use probabilistic assignments to clusters, such as fuzzy clustering and latent class analysis. Finally, the use of DBI as a measure of fit appears to have some drawbacks. The DBI values can be misleadingly low if the clustering algorithm creates a small cluster around a few outliers, and when selecting the number of clusters, DBI often results in an overestimation of the number of clusters that should be selected. Baker et al. (2017b) recommend using AIC or BIC for selecting the number of clusters, and algorithms like those developed by Pelleg and Moore (2000) and Wang et al. (2005) may be good options, especially if the researcher does not have a subject matter expert’s help in interpreting clusters. Validating across several appropriate clustering algorithms and cross validating with other samples of data can also help avoid selecting a clustering solution with too many clusters that over-fits the training data. For the purposes of assessment, interpretability is the ultimate goal, and researchers should be able to defend the theoretical implications of their final clustering solution, without relying solely on measures of fit. 177 Implications This study provides operational and analytical guidance for researchers pursuing empirical approaches to evaluating response patterns. While the context is primarily the use of GBAs, these findings also inform similar measurement efforts in other assessment types like simulations, automatically scored essays, or intelligent tutoring systems. Like GBAs, these complex assessment designs can yield high-dimensional response data that may need to be evaluated and categorized with an exploratory approach if there is high autonomy in these instruments. If meaningful interpretations are theorized to exist in subspaces of the response variables, clustering algorithms may be useful tools for these other instruments, just as they can be for GBAs (Kriegel et al., 2009). When using an exploratory empirical approach—such as a clustering algorithm— to evaluate students’ knowledge, skills, or abilities, it is recommended that the clusters be validated with several algorithms (Xu et al., 2013). If reasonable agreement does not exist between the clustering solutions, then one cannot be confident in individual classifications of students, and these methods may only serve to describe aggregate patterns. This study demonstrates how researchers should compare clustering solutions to validate their results, and it does so using three clustering algorithms (SOM, k-means, and ROCK) that are well-suited to the sparse, binary datasets common to GBAs. At a minimum, this study exposes the relative strengths or weakness of these three algorithms for clustering sparse, binary datasets that would be typical for a GBA. The varying levels of the simulated data also point to cluster structures that are dependably identified between these three clustering algorithms. These findings will help inform GBA researchers’ decisions about when to use these clustering algorithms, especially 178 with regard to their sensitivity to varying sized clusters and clusters that overlap due to the sparseness of the data—common scenarios in GBA data. These findings will also help inform GBA design, perhaps guiding the design of adaptive rules to help focus response patterns into well-separated clusters. From a practical perspective, these inquiries have not necessarily cleared a path for GBA use or the use of clustering algorithms in GBAs. The use of a game as an assessment creates many opportunities by encouraging repeated gameplay and adapting difficulty to student abilities, but the introduction of construct-irrelevant relatedness elements and the potential tangling of psychometric difficulty versus game difficulty almost certainly preclude a game from any high stakes application. There is also the question of whether a game is appropriate for a high stakes decision, or whether high stakes assessments should be constrained to only serious, sterile assessment instruments. The practical aspect of clustering algorithms and GBAs may also be scrutinized simply on the basis of the complexity of using a clustering algorithm. Although some clustering algorithms like k-means are easy to implement, the steps for using a clustering algorithm as a measurement model (Chapter 6) are far more complex—perhaps more than an assessment researcher would be willing to take on in practice. It may be that clustering algorithms should be relegated to a purely investigative, exploratory role in GBA design, but that the measurement model itself should retain a principled approach, which was method used by Corrigan et al. (2014). In short, even if one can build a strong validity case for a clustering solution for evaluating student work processes, the complexity required may simply be too daunting. 179 One can also argue that clustering algorithms are able to serve other roles as evaluation components in an evidence model, rather than measurement models themselves. Clustering solutions are valuable tools for evaluating the evidence from the work process, and the cluster IDs can feed into a non-clustering measurement model. The group transition map in Chapter 6 and the discussion of evaluating students’ changes in cluster membership over time sets the stage for systematically evaluating a student’s string of clusters as its own work process. Based on the results of the simulation study (Chapter 5), future research needs to focus on fit-based methods for selecting the number of clusters. DBI appears to trend toward overestimating the number of clusters needed. Comparisons of DBI to AIC and BIC for selecting the number of clusters would be invaluable, and practitioners will benefit from examples of how to implement the findings. Just as this research recommended that researchers choose clustering algorithms that are appropriate for their data, researchers will find that they also need to be able to defend their choice in fit measures, especially if the fit measure is being used to pick the number of clusters or to compare competing cluster solutions from different algorithms. It is also important for future research to investigate other conditions for the simulation study, such as clusters with different numbers of relevant variables or datasets with dozens of more clusters. This is in part because the direction of the research must lay the groundwork for vastly more complex games and technology. The GBAs discussed here are relatively simple, but educational game designers continue to develop more complex and (arguably) interesting games. Harnessing such games for assessment requires measurement research to stay abreast of game technology. If there is a desire to 180 make meaningful evaluations of students based on gameplay, researchers must prepare for measurement challenges in large games with many types of work processes and games that do not end (e.g., virtual worlds that students join and participate in as they wish). For those who seek to interpret gameplay as an assessment inference, the expansion of game capabilities will increase the complexity around foundational assessment practices like identifying the boundaries of domains and quantifying reliability. These obstacles are not insurmountable, and the most valuable contributions will be made by researchers sharing their rationales, methodologies, and toolsets with each other. The inputs to the simulation study (Chapter 5) and many of the recommendations in the tutorial (Chapter 6) were culled from researchers who shared their process behind creating a clustering algorithm measurement model for an educational game, in effect creating a GBA. By continuing this kind of research and experience-sharing with more complex games, the lessons learned will help to inform future tutorials and best practices. Documentation that discusses both the GBA design and the development of a measurement model will be the most important, as ECD shows that these processes are dependent on each other for creating validity arguments. Future of GBA The use of games as assessment instruments and of machine learning techniques like clustering algorithms as assessment measurement models is exciting and represents a new frontier of educational assessment; however, one must not lose sight of assessment researchers’ responsibilities to support validity and reliability of inferences. Machine learning in non-assessment applications (e.g., targeted online advertising) may have more 181 flexibility to be wrong or to improve slowly over time. For example, if using machine learning to recommend products to people browsing a website, the data scientist only needs to get a handful of suggestions right for that application to be useful, and there is little risk should the customer get an irrelevant recommendation. This is not the case in assessments. Researchers need to be able to make accurate inferences in the majority cases, and widespread inaccuracy in a machine learning output in a GBA precludes its use for assessment. Assessment researchers who are excited and tantalized about the power and flexibility of clustering algorithms will do well to balance that enthusiasm with a sobering reminder about assessment developers’ responsibilities to their stakeholders to support valid inferences and uses of the results. Similarly, assessment researchers using machine learning algorithms do not have the same freedom of data use as most data scientists. In many non-assessment applications, prediction or classification accuracy is prized, and any data that improve prediction and classification can be included in the machine learning model. Conversely, assessment researchers still need to be able to build a validity argument, which will include (among other things) a logical rationale for why certain data were included in the evidence accumulation for the measurement model. In short, assessment researchers cannot throw in variables to improve the model without also being able to explain how they link back to the domain of behavior associated with the construct. Researchers must also consider the applications of GBA. For example, could GBA ever be used for a high stakes application? The simulation results in Chapter 5 showed that under some favorable conditions, clustering algorithms are able to recover true clusters with very high accuracy, but one must wonder if such results could be 182 expected in the field. Both Kerr and Chung (2012) and Stevens and Casillas (2006) reported identifying clusters that could not be easily interpreted in the context of their constructs, meaning one might reliably identify a cluster, but have no idea what it means. The relatedness elements designed to captivate students may actually turn off some students and cause them to disengage, or students may engage in off-task behavior like free-roaming the game. Admittedly, such behavior exists in traditional assessments, like when a student fills in an answer sheet with a pattern or draws a picture on an essay question, but it seems reasonable to expect that the lack of seriousness and potential higher autonomy in a game may invite a broader range of responses that may not interpretable in the context of the construct of interest. The potential for uninterpretable or irrelevant gameplay may be enough to preclude GBA from high stakes purposes, but one must also question whether stakeholders (e.g., students, parents, educators, employers) would accept the results of a game for a high stakes decision. If a high stakes outcome is dependent on a game, how fun can that game really be? Most high stakes assessment decisions (e.g., educational interventions, grade promotion, graduation, job hiring, licensure to practice) are not games—people’s lives are dependent on these decisions—and it may be inappropriate to make a game out of things that should be taken seriously. If one can accept that GBAs may be relegated to the low stakes, formative assessment world for the time being, what can their future be? This paper has argued that the power of GBAs is in their ability to capture how students change gameplay over time, which in turn may suggest that lengthier games may be the direction researchers should take. This goes against the trend of traditional assessments, which have sought to be 183 shorter and leave a smaller footprint in the school day, such as the use computer-adaptive tests. One could, however, imagine shifting to a GBA which is enormous and ongoing—a game environment that students might interact with for years, across grade levels. Many entertainment video games already use this strategy, creating large game universes that draw players in for a long time. Entertainment games also capitalize on the engagement of their players, releasing new expansions or modules that build off the existing game and fuel interest. One can imagine the same thing occurring with GBA in the education space. Students might interact with a system of games throughout their education, and the games might be skinned or tailored to students’ individual preferences without losing the key mechanics, rules, and connections that relate to the construct of interest. As unexpected clusters of gameplay are discovered by researchers, these might inform the development of new game expansions that either encourage and develop that behavior or limit and correct it, as deemed appropriate. A multiyear, adaptable, and expandable GBA—is such a thing possible? The concept exists in science fiction novels, but the practicality and utility of such an instrument may be outside the scope of what can or should be expected from an education system; however, if the GBA systems were engaging and adaptable enough, they could complement education efforts, creating a symbiotic relationship and data exchange between the initiatives within the classrooms and the play that occurs outside. For the time being, GBAs, like traditional assessments, will likely continue to be time-bounded, single-purpose instruments. GBA researchers should maintain a foundation of validity in their work, linking aspects of game design and gameplay through argument structures back to inferences about the constructs of interest. At the 184 same time, GBA researchers should be encouraged to experiment with broader ranges of task models and measurement models through GBAs, which in turn will necessitate broader collaboration with experts from a variety of fields and the use of principled development approaches like ECD to facilitate a common understanding of the instrument. Games will continue to be a presence in formal and informal education, and to realize the full potential of the data created by educational gameplay, assessment researchers need to be prepared to be as creative as the games’ designers. 185 Appendix A: R Scripts for Full Study #TITLE: An Evaluation of Clustering Algorithms for Modeling Game-Based Assessment Work #Processes #AUTHOR: W. Austin Fossey, EDMS PhD Candidate, University of Maryland #DATE: July 19, 2017 #DESCRIPTION: R script for simulating game-based assessment data and evaluating true #cluster recovery of k-means, SOM, #and ROCK clustering algorithms. #load libraries library("poLCA") library("kohonen") library("psych") library("clusterCrit") library("cba") #set working directory setwd("C:/Users/afossey/Documents/disco") #define cody function, which recodes cluster assignments to allow for Kappa calculations #between two lists of clusters #of the same data. cody=function(x,y){ tab=as.data.frame.matrix(table(x,y)) tab=tab[do.call(order, c(tab,decreasing=T)),] for(i in 1:nrow(tab)){ subtab=tab[c(i:nrow(tab)),] subtab=subtab[order(subtab[,i], decreasing=T),] tab[c(i:nrow(tab)),]=subtab } while(nrow(tab)=0.05){ rocktest=invisible(rockCluster(as.matrix(simdat$dat), n=k, theta=theta)) #authors recommend binary distance if(length(rocktest$size)==k){ break } theta=theta-.05 } rclust=as.integer(rocktest$cl) output[count,"dbir"]=intCriteria(as.matrix(simdat$dat), as.integer(rocktest$cl),"Davies_Bouldin")$davies_bouldin output[count,"theta"]=theta #####Check True Cluster Recovery##### #kappa values output[count,"kappak"]=try(cohen.kappa(cody(kclust,tclust)) $kappa) output[count,"kappas"]=try(cohen.kappa(cody(sclust,tclust)) $kappa) output[count,"kappar"]=try(cohen.kappa(cody(rclust,tclust)) $kappa) #percentage match values output[count,"perk"]=try(sum(diag(cody(kclust,tclust)))/sum (cody(kclust,tclust))) output[count,"pers"]=try(sum(diag(cody(sclust,tclust)))/sum (cody(sclust,tclust))) output[count,"perr"]=try(sum(diag(cody(rclust,tclust)))/sum (cody(rclust,tclust))) #save dataset and confusion matrices, if desired #write.csv(simdat$dat, paste("dat",count,"csv",sep=".") #write.csv(probs, paste("probs",count,"csv",sep=".") #write.csv(cody(kclust,tclust), #paste("ktab",count,"csv",sep=".") #write.csv(cody(sclust,tclust), #paste("stab",count,"csv",sep=".") #write.csv(cody(rclust,tclust), #paste("rtab",count,"csv",sep=".") count=count+1 } #write cell output to CSV file and progress to next cell write.csv(output[which(output$cell==cell),],paste("output",cell, "csv",sep=".")) cell=cell+1 } } 188 } #write full output to CSV file write.csv(output,"output.full.csv") 189 #TITLE: An Evaluation of Clustering Algorithms for Modeling Game-Based Assessment Work #Processes #AUTHOR: W. Austin Fossey, EDMS PhD Candidate, University of Maryland #DATE: July 19, 2017 #DESCRIPTION: R script for simulating game-based assessment data and evaluating true #cluster recovery of k-means, SOM, #and ROCK clustering algorithms. #load libraries library("poLCA") library("kohonen") library("psych") library("clusterCrit") library("cba") #set working directory setwd("C:/Users/afossey/Documents/disco2") set.seed(52309) #set seed for random numbers (used to select starting centers in k- means and SOM) nrel=20 #number of relevant variables defining each cluster ndat=400 #number of datasets per cell iter=1000 #number of iterations for k-means and SOM #simulation condtions simk=c(4,10,20,50) #number of clusters simn=matrix(c( #matrix of clusters' sizes and variance of cluster sizes 50,9, 100,9, 50,100, 100,100), ncol=2,byrow=T) simp=c(.3,.6,.9) #probability of observing a response in the 20 relevant variables #create output data frame to record results output=data.frame(cell=integer(),k=integer(),prob=double(),size=character(),dataset= integer(),nvars=integer(),n=integer(),rmsd=numeric(),mink=integer(),dbik=numeric(),mins=i nteger(),dbis=numeric(),minr=integer(),dbir=numeric(),stringsAsFactors=F) #set count variables for tracking count=1 cell=1 for(a in 1:length(simk)){ k=simk[a] #set number of clusters k nvars=nrel*k #number of variables (columns) in data set equals number of #relevant variables times k for(b in 1:nrow(simn)){ npar=simn[b,] #set parameters for generating cluster sizes for(c in 1:length(simp)){ resp.prob=simp[c] #set response probability for(d in 1:ndat){ #simulate and analyze 400 datasets #for the cell #add identification data to output output[count,1:6]=c(cell,k,resp.prob,paste(npar[1],npar[2], sep="-"),d,nvars) #create list of response probabilities for each variable #and each cluster probs=list() for(i in 1:nvars){ #default to pr(0)=.95, pr(1)=.05 probs[[i]]=matrix(rep(c(.95,.05),k), ncol=2, byrow=T) } #randomly select variables that will have higher response 190 #probability for each cluster for(i in 1:k){ relevant=sample(1:length(probs),nrel,replace=F) for(j in 1:length(relevant)){ probs[relevant[j]][[1]][i,]=c(1- resp.prob,resp.prob) } } #create cluster sizes using npar values clust.size=round(rnorm(k,npar[1],sqrt(npar[2])),0) #simulate data using LCA package, recode to 0/1 binary for #ROCK, record true cluster assignments simdat=poLCA.simdata(N=sum(clust.size),probs,P= clust.size/sum(clust.size)) simdat$dat=simdat$dat-1 tclust=simdat$trueclass output[count,"n"]=nrow(simdat$dat) #create matrix of probabilities for each item on each #cluster. these are the expected cluster centers (columns #are clusters, rows are variables) probs=do.call("rbind",lapply(probs, function(x) x[,2])) #calculate RMSD for the expected cluster centers to #represent their degree of overlap; i.e., higher RMSD #represents less overlap in the cluster centers--more #distinct response types output[count,"rmsd"]=sqrt(sum((as.matrix(dist(t(cbind( probs,rowMeans(probs))))) [k+1,1:(k)])^2)/k) #create comp dataframe to compare DBI values for clustering #solutions at different values of k comp=data.frame(dbik=numeric(),dbis=numeric(),dbir=numeric (),stringsAsFactors=F) for(i in -3:3){ #run through seven different #possible k values #####Run K-Means##### kmeantest=NULL attempt=1 while(is.null(kmeantest) && attempt<=10){ try({kmeantest=invisible(kmeans(simdat$dat, centers=(k+i), nstart=iter))}) attempt=attempt+1 } kclust=kmeantest$cluster comp[i+4,"dbik"]=intCriteria(as.matrix(simdat$dat), kmeantest$cluster,"Davies_Bouldin")$davies_bouldin #####Run SOM##### somtest=invisible(som(as.matrix(simdat$dat), grid = somgrid((k+i), 1,"hexagonal"),rlen=iter, alpha=c(.05,.01))) sclust=somtest$unit.classif comp[i+4,"dbis"]=intCriteria(as.matrix(simdat$dat), as.integer(somtest$unit.classif),"Davies_Bouldin") $davies_bouldin 191 #####Run ROCK##### theta=.95 while(theta>=0.05){ rocktest=invisible(rockCluster(as.matrix( simdat$dat), n=(k+i),theta=theta)) #authors recommend binary distance if(length(rocktest$size)==(k+i)){ break } theta=theta-.05 } rclust=as.integer(rocktest$cl) comp[i+4,"dbir"]=intCriteria(as.matrix(simdat$dat), as.integer(rocktest$cl),"Davies_Bouldin") $davies_bouldin } #find value of k that yielded lowest DBI for each #algorithm, record in output comp[comp==0]=NA output[count,c("dbik","dbis","dbir")]=apply(comp,2, function(x) min(x, na.rm=T)) output[count,c("mink","mins","minr")]=apply(comp,2, which.min)-4+k count=count+1 } #write cell output to CSV file and progress to next cell write.csv(output[which(output$cell==cell),],paste("output2",cell, "csv",sep=".")) cell=cell+1 } } } #write full output to CSV file write.csv(output,"output2a.full.csv") 192 #TITLE: An Evaluation of Clustering Algorithms for Modeling Game-Based Assessment Work #Processes #AUTHOR: W. Austin Fossey, EDMS PhD Candidate, University of Maryland #DATE: August 16, 2017 #DESCRIPTION: R script for analyzing work process response data in the Beanstalk math #game using k-means, SOM, and #ROCK clustering algorithms. #set working directory setwd("E:/Dissertation/Dissertation/Tutorial1") #load libraries library("poLCA") library("kohonen") library("psych") library("clusterCrit") library("cba") library("flexclust") #define cody function, which recodes cluster assignments to allow for Kappa calculations #between two lists of clusters #of the same data. cody=function(x,y){ tab=as.data.frame.matrix(table(x,y)) tab=tab[do.call(order, c(tab,decreasing=T)),] for(i in 1:nrow(tab)){ subtab=tab[c(i:nrow(tab)),] subtab=subtab[order(subtab[,i], decreasing=T),] tab[c(i:nrow(tab)),]=subtab } while(nrow(tab)=5&colSums(resp,na.rm=T)<(nrow(resp)-5))] #if only one or two variables are left for analysis, move to the next game level if(class(resp)=="integer") next if(ncol(resp)<3) next #create list to store models klist=list() #run clustering algorithms for k=2-10 clusters for(k in 2:10){ #record level, number of clusters, number of variables analyzed, and #number of cases output[count,1:4]=c(levels(dat$item)[i],k,ncol(resp),nrow(resp)) #set list for storing outputs from individual models model=list() model[[1]]=k #if number of clusters is greater than half the sample size, move to the #next game level if(k>nrow(unique(resp))/2) break #####Run K-Means##### kmeantest=NULL attempt=1 while(is.null(kmeantest) && attempt<=10){ try({kmeantest=invisible(kmeans(resp,centers=(k), nstart=iter))}) attempt=attempt+1 } kclust=kmeantest$cluster model[[2]]=c(kclust) model[[5]]=kmeantest #####Run SOM##### somcheck=0 while(somcheck!=k){ somtest=invisible(som(as.matrix(resp), grid = somgrid(k, 1, "hexagonal"),rlen=iter,alpha=c(.05,.01))) sclust=somtest$unit.classif somcheck=length(unique(sclust)) } model[[3]]=sclust model[[6]]=somtest #####Run ROCK##### theta=.95 while(theta>=0.05){ rocktest=invisible(rockCluster(as.matrix(resp), n=k, theta=theta)) #authors recommend binary distance if(length(rocktest$size)==(k+i)){ 194 break } theta=theta-.05 } rclust=as.integer(rocktest$cl) model[[4]]=rclust model[[7]]=rocktest #transform response data to nummeric class resp[]=lapply(resp,as.numeric) #record DBI output[count,"dbik"]=try(intCriteria(as.matrix(resp),kmeantest$cluster, "Davies_Bouldin")$davies_bouldin) output[count,"dbis"]=try(intCriteria(as.matrix(resp),as.integer( somtest$unit.classif),"Davies_Bouldin")$davies_bouldin) output[count,"dbir"]=try(intCriteria(as.matrix(resp),as.integer( rocktest$cl),"Davies_Bouldin")$davies_bouldin) #record Kappa between algorithms' solutions output[count,"kappa.ks"]=try(cohen.kappa(cody(kclust,sclust))$kappa) output[count,"kappa.kr"]=try(cohen.kappa(cody(kclust,rclust))$kappa) output[count,"kappa.sr"]=try(cohen.kappa(cody(sclust,rclust))$kappa) #move to next output row and save the model lists count=count+1 klist[[k]]=model } #plot and save charts of DBI values by number of clusters (k) png(filename=paste("DBI",levels(dat$item)[i],"png",sep=".")) par(family="serif",font=1,col="black") matplot(output[which(output$item==levels(dat$item)[i]),c("dbik","dbis","dbir")], type = c("b"),pch=1,col ="black",xlim=c(1,9),ylim=c(0,2),xlab="#Clusters", ylab="DBI",xaxt="n") legend("topright", legend=c("k-means","SOM","ROCK"), lty=1:3, cex=0.8) axis(1, at=1:9, labels=c(2:10)) dev.off() #save list of models for all levels of k for the game level and label by level #name itemout[[i]]=klist names(itemout)[i]=levels(dat$item)[i] } #write output to CSV file write.csv(output, "model comparison.csv") # At this point, the researcher analyzed the output in MS Excel to identify clustering # solutions with # the lowest DBI and where the solution had a Kappa value of at least 0.80 with another # algorithm's # solution. From this analysis, a Model Selection file was created, identifying the # number of clusters # to be used at each game level and the algorithm solution that should be used for that # level. This # file is read in as "orders" below. #read in model selection orders orders=read.csv("model selection.csv",header=T, stringsAsFactors=F) #create column in the response data to record the cluster ID associated with the work #process dat$clust=NA #create blank vector to record descriptive statistics for mean endorsement rate of the #gameplay variables, by cluster desc=c() 195 #use the orders to assign clusters to all responses at each game level for(i in 1:nrow(orders)){ #identify the algorithm to use, the game level to which it should be applied, and #the number of clusters alg=orders[i,"alg"] item.index=orders[i,"item"] k=orders[i,"k"] #assign cluster IDs to the training sample for the game level resp.m=names(itemout[[item.index]][[k]][[2]]) dat[resp.m,"clust"]=itemout[[item.index]][[k]][[alg+1]] #assign cluster IDs to the holdout sample, using the selected algorithm and the #training data to predict the #holdout's cluster IDs switch(alg, "1"={dat[which(dat$item==orders[i,"item"]&dat$holdout==1),"clust"]= predict(as.kcca(itemout[[item.index]][[k]][[alg+4]], data=dat[which(dat$item==orders[i,"item"]&dat$holdout==0),colnames( itemout[[item.index]] [[k]][[alg+4]]$centers)]), newdata=dat[which(dat$item==orders[i,"item"]&dat$holdout==1),colnam es(itemout[[item.index]] [[k]][[alg+4]]$centers)])}, "2"={dat[which(dat$item==orders[i,"item"]&dat$holdout==1),"clust"]= map(itemout[[item.index]][[k]][[alg+4]], newdata=as.matrix(dat[which(dat$item==orders[i,"item"]&dat$holdout= =1), colnames(itemout[[item.index]][[k]][[alg+4]]$data)]))$unit.classif} , "3"={dat[which(dat$item==orders[i,"item"]&dat$holdout==1),"clust"]= as.integer(predict(itemout[[item.index]][[k]][[alg+4]], x=as.matrix(dat[which(dat$item==orders[i,"item"]&dat$holdout==1), colnames(itemout[[item.index]][[k]][[alg+4]]$x)]))$cl)} ) #calculate DBI for holdout cluster assignments, add to "orders" dataframe subs=which(dat$item==orders[i,"item"]&dat$holdout==0) resp=dat[subs,gameplay] resp=resp[,(colSums(resp, na.rm=T)>=5&colSums(resp,na.rm=T)<(nrow(resp)-5))] subplay=names(resp) orders[i,"dbih"]=intCriteria(data.matrix(sapply(dat[which(dat$item==orders[i, "item"]&dat$holdout==1),subplay], FUN=as.numeric)), as.integer(dat[which(dat$item==orders[i,"item"]&dat$holdout==1),"clust"]), "Davies_Bouldin")$davies_bouldin #calculate descriptives for gameplay variables by item and cluster for(j in 1:k){ desc=rbind(desc,c(item.index,j,nrow(dat[which(dat$item== item.index&dat$clust==j),]), round(colMeans(dat[which(dat$item==item.index&dat$clust==j), c("CORRECT",gameplay)]),2))) } } #write CSV files for the output data and the descriptive statistics for each game level colnames(desc)[1:3]=c("item","k","n") desc[,"item"]=paste("'",desc[,"item"],"'",sep="") write.csv(desc, "cluster descriptives.csv") write.csv(orders, "orders.csv") 196 References Almond, R. G., Kim, Y. J., Velasquez, G., & Shute, V. J. (2014). How task features impact evidence from assessments embedded in simulations and games. Measurement, 12, 1-33. Amershi, S., & Conati, C. (2009). Combining unsupervised and supervised classification to build user models for exploratory learning environments. Journal of Educational Data Mining, 1(1), 1-54. Baker, R., Wang, E., & Andres, M. (2017a). Knowledge inference: Bayesian knowledge tracing [PDF Document]. Available from: https://courses.edx.org/courses/course- v1:PennX+BDE1x+1T2017. Baker, R., Wang, E., & Andres, M. (2017b). Clustering: Validation and selection of k [PDF Document]. Available from: https://courses.edx.org/courses/course- v1:PennX+BDE1x+1T2017. Baker, R., Wang, E., & Andres, M. (2017c). Cross-validation and over-fitting [PDF Document]. Available from: https://courses.edx.org/courses/course- v1:PennX+BDE1x+1T2017. Bejar, I. I., & Braun, H. I. (1994). On the synergy between assessment and instruction: Early lessons from computer-based simulations. Machine-Mediated Learning, 4(1), 5-25. Bennett, R. E. (2010). Cognitively based assessment of, for, and as learning (CBAL): A preliminary theory of action for summative and formative assessment. Measurement: Interdisciplinary Research and Perspectives, 8(2), 70-91. 197 Berkhin, P. (2006). A survey of clustering data mining techniques. In J. Kogan, C. Nicholas, & M. Teboulle (Eds.), Grouping multidimensional data (pp. 25–72). New York, NY: Springer. BrainPOP. (2015). Number Jumble [software]. Available from https://www.brainpop.com/games/numberjumble/. Braaten, K. N. (2009). An empirical investigation of unscalable components in scaling models (Doctoral dissertation). Available from: http://pqdtopen.proquest.com/doc/304920693.html?FMT=AI. (UMI Number: 3359337) Braun, H., Bejar, I. I., & Williamson, D. M. (2006). Rule-based methods for automated scoring: application in a licensing context. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer-based testing (pp. 83-122). Mahwah, NJ: Erlbaum. Bruer, J. T. (1993). The mind's journey from novice to expert: If we know the route, we can help students negotiate their way. American Educator: The Professional Journal of the American Federation of Teachers, 17(2), 38-46. Brüngger, A., Marzetta, A., Fukuda, K., Nievergelt, J. (1999). The parallel search bench ZRAM and its applications. Annals of Operations Research, 90, 45-63. Buchta, C., & Hahsler, M. (2014). cba: Clustering for business analytics [software]. R package version 0.2-14. Available from: http://CRAN.R-project.org/package=cba. Bullinaria, J. A. (2004). Self-organizing maps: Fundamentals (PDF Document). Available from: http://www.cs.bham.ac.uk/~jxb/NN/l16.pdf. 198 Bycer, J. (2012, January 4). Subjective difficulty: how plumbers can fight demons. Gamasutra. Available from http://gamasutra.com/view/feature/134950/examining_subjective_difficulty_.php. Carnegie Mellon, Educational Technology Center’s ENGAGE Program (2013). Beanstalk: A see-saw game of balance [software]. Available from http://beanstalk.etc.cmu.edu/. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37-46. Corrigan, S., Stephenson, S., Makany, T., Frenz, M., Hoffman, E., Bauer, M., … Menon, U. (2014, April). Combining fun, learning, and measurement: Game development in the classroom and beyond. Workshop presented at the Annual Meeting for the National Council on Measurement in Education, Philadelphia, PA. Crocker, L, & Algina, J. (2008). Introduction to Classical and Modern Test Theory. Mason, OH: Cengage Learning D’Angelo, C., Rutstein, D., Harris, C., Bernard, R., Borokhovski, E., & Haertel, G. (2014). Simulations for STEM learning: Systematic review and meta-analysis (executive summary). Menlo Park, CA: SRI International. Davies, D. L. & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1(2), 224–227. de Ayala, R. J. (2009). The theory and practice of Item Response Theory. New York, NY: The Guildford Press. Desgraupes, B. (2014). clusterCrit: Clustering indices (version 1.2.4) [computer software]. http://CRAN.R-project.org/package=clusterCrit. 199 Dimitriadou, E., Dolnicar, S., & Weingessel, A. (2002). An examination of indexes for determining the number of clusters in binary data sets. Psychometrika, 67(1), 137- 160. Dunietz, J. (2016). The fundamental limits of machine learning. Nautilus. Available from http://nautil.us/blog/the-fundamental-limits-of-machine- learning?utm_source=RSS_Feed&utm_medium=RSS&utm_campaign=RSS_Syn dication Eseryel, D., Ge, X., Ifenthaler, D., & Law, V. (2011). Dynamic modeling as a cognitive regulation scaffold for developing complex problem-solving skills in an educational massively multiplayer online game environment. Journal of Educational Computing Research, 45(3), 265-286. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37-54. Fishman, B., Riconscente, M., Snider, R., Tsai, T., & Plass, J. (2014). Empowering educators: Supporting student progress in the classroom with digital games. Ann Arbor: University of Michigan. Fraley, C., & Raftery, A. E. (2007). Model-based methods of classification: Using the mclust software in chemometrics. Journal of Statistical Software, 18(6), 1-13. Gee, J. P. (2007). What video games have to teach us about learning and literacy (2 nd ed.). New York: Palgrave. Gentile, D. A., Anderson, C. A., Yukawa, S., Ihori, N., Saleem, M., Ming, L. K., Shibuya, A., Liau, A. K., Khoo, A., Bushman, B. J., Huesmann, L. R., & Sakamoto, A. (2009). The effects of prosocial video games on prosocial 200 behaviors: International evidence from correlational, longitudinal, and experimental studies. Personality and Social Psychology Bulletin, 35, 752–763. Goodman, L. A. (1975). A new model for scaling response patterns: An application of the quasi-independence concept. Journal of the American Statistical Association, 70, 755-768. Granic, I., Lobel, A., & Engels, C. M. E. (2014). The benefits of playing video games. American Psychologist, 69(1), 66-78. Guha, S., Rastogi, R., and Shim, K. (1999). ROCK: A robust clustering algorithm for categorical attributes. In Proceedings of the 15th ICDE, 512-521, Sydney, Australia. Haladyna, T. M., & Downing, S. M. (1989). A taxonomy of multiple-choice item-writing rules. Applied Measurement in Education, 2, 51-78. Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). On clustering validation techniques. Journal of Intelligent Information Systems, 17, 107-145. Hall, L. O, Bensaid, A. M., Clarke, L. P., Velthuizen, R. P., Silbiger, M. S., & Bezdek, J. C. (1992). A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain. IEEE Transactions on Neural Networks, 3(5), 672-682. Hamerly, G., & Elkan, C. (2004). Learning the k in k-means. Advances in Neural Information Processing Systems, 16. Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining. Cambridge, MA: MIT Press. 201 Hardin, L. E. (2003). Problem solving concepts and theories. Journal of Veterinary Medical Evaluation, 3(30), 226-229. Hartigan, J. A. & Wong, M. A. (1979). A K-means clustering algorithm. Applied Statistics, 28, 100-108. Heitin, L. (2014, April 24). NAEP uses video-game design for new technology and engineering test. Available from http://blogs.edweek.org/edweek/curriculum/2014/04/naep_uses_video- game_design_fo.html. Hennig, C. (2015). fpc: Flexible Procedures for Clustering (version 2.1-10) [computer software]. https://CRAN.R-project.org/package=fpc. Hennig, L, & Liao, T. F. (2011). How to find an appropriate clustering for mixed type variables with application to socioeconomic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics), 62(3), 309-514. Hoffman, E., John, M., & Makany, T. (2014, April). How do game design frameworks align with learning and assessment design frameworks? In A. Oranje (Chair), Integrating games, learning, and assessment. Symposium conducted at the Annual Meeting for the National Council on Measurement in Education, Philadelphia, PA. Hothorn, T., & Everitt, B. S. (2009). A handbook of statistical analysis using R. (2 nd ed.). Boca Raton, FL: Chapman & Hall. Houle, M. E., Kriegel, H., Kröger, P., Schubert, E., & Zimek, A. (2010). Can shared- neighbor distances defeat the curse of dimensionality? In M. Gertz, & B. Ludäscher (Eds.), Scientific and statistical database management: 22 nd 202 international conference, SSDBM 2010, Heidelberg, Germany, June 30-July 2, 2010 (pp. 482-500). Berlin: Springer. Hynar, M., Burda, M., & Šarmanová, J. (2005). Unsupervised clustering with growing self-organizing neural network: A comparison with non-neural approach. In K. Richta, V. Snášel, & J. Pokorný (Eds.), Dateso (pp. 55-68). Available from http://www.cs.vsb.cz/dateso/sbornik/dateso05.pdf#page=66. Ishioka, T. (2005). An expansion of x-means for automatically determining the optimal number of clusters. Proceedings from The Fourth IASTED International Conference: Computational Intelligence (CI 2005), Calgary Canada, July 4-July 6, 2005 (pp. 91-96). Available from http://www.rd.dnc.ac.jp/~tunenori/doc/487- 053.pdf. Johnson, S. (2001). Emergence: The connected lives of ants, brains, cities, and software. New York, NY: Scribner. K20 Center, University of Oklahoma. (2011, October 27). McLarin’s Adventures – mapping scenario [video]. McLarin’s Adventures. Available from http://dgbl.ou.edu/mclarin/. Kato, P.M., Cole, S.W., Bradlyn, A.S., & Pollock, B.H. (2008). A video game improves behavioral outcomes in adolescents and young adults with cancer: A randomized trial. Pediatrics, 122, 305-317. Kerr, D., & Chung, G. (2012). Identifying key features of student performance in educational video games and simulations through cluster analysis. Journal of Educational Data Mining, 4(1), 145-182. 203 Kerr, D., Chung, G. K. W. K, & Iseli, M. R. (2011). The feasibility of using cluster analysis to examine log data from educational video games. (CRESST Report 790). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Koedinger, K. R., Baker, R. S. J.d., Cunningham, K., Skogsholm, A., Leber, B., Stamper, J. (2010) A data repository for the EDM community: The PSLC DataShop. In Romero, C., Ventura, S., Pechenizkiy, M., Baker, R. S. J.d. (Eds.) Handbook of educational data mining (pp. 43-56). Boca Raton, FL: CRC Press. Kohonen, T. (1990). The Self-Organizing Map. Proceedings of the IEEE, 78(9), 1464- 1480. Kriegel, H. P., Kröger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data, 3(1), 1-59. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159-174. Lattin, J. M., Carroll, J. D., & Green, P. E. (2003). Analyzing multivariate data. Pacific Grove, CA: Brooks/Cole—Thomson Learning. Levy, R., & Mislevy, R. J. (2004). Specifying and refining a measurement model for a computer-based interactive assessment. International Journal of Testing, 4(4), 333-369. Linzer, D. A., & Lewis, J. B. (2011). poLCA: An R package for polytomous variable latent class analysis. Journal of Statistical Software, 41(10), 1-29. 204 Maisuria, L. K., Ong, C. S., & Lai, W. K. (1999). A comparison of artificial neural networks and cluster analysis for typing biometrics authentication. International Joint Conference on Neural Networks, 5, 3295-3299. Mandler, G. & Sarason, S. B. (1952). A study of anxiety and learning. Journal of Abnormal and Social Psychology, 47, 166-173. Margolis, M. J., & Clauser, B. E. (2006). A regression-based procedure for automated scoring of a complex medical performance assessment. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer- based testing (pp. 123-167). Mahwah, NJ: Erlbaum. Milligan, G. W., & Cooper, M. C. (1985). An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2), 159-179. Mislevy, R. J. (1996). Test theory reconceived. Journal of Educational Measurement, 33(4), 379-416. Mislevy, R. J., Behrens, J. T., Dicerbo, K. E., & Levy, R. (2012). Design and discovery in educational assessment: Evidence-Centered Design, psychometrics, and Educational Data Mining. Journal of Educational Data Mining, 4(1), 11-48. Mislevy, R. J., Oranje, A., Bauer, M. I., von Davier, A., Hao, J., Corrigan, S., Hoffman, E., DiCerbo, K., John, M. (2014). Psychometric considerations in game-based assessment. Redwood City, CA: GlassLab Games. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3-67. 205 Mislevy, R.J. (2011). Evidence-centered design for simulation-based assessment. (CRESST Report 800). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Muncy, J. (2015, December 3). Open-world games are changing the way we play. Available from https://www.wired.com/2015/12/open-world-games-2015/. Nash, P., & Shaffer, D. W. (2011). Mentor modeling: the internalization of modeled professional thinking in an epistemic game. Journal of Computer Assisted Learning, 27, 173-189. National Council of Architectural Registration Boards. (2012). 4RE exam guide. Available from http://www.ncarb.org/~/media/Files/PDF/ARE-Exam- Guides/SPD_Exam_Guide.pdf. Newell, A., & Simon, H. A. (1972) Human problem solving. Englewood Cliffs, NJ: Prentice Hall. Newell, M. A., Cook, D., Hofmann, H., & Jannink, J. (2013). An algorithm for deciding the number of clusters and validation using simulated data with application to exploring crop population structure. The Annals of Applied Statistics, 7(4), 1898- 1916. Nitko, A. J., & Brookhart, S. (2007). Educational assessment of students (5 th ed.). Upper Saddle River, NJ: Pearson Education, Inc. Pelleg, D., & Moore, A. (2000). X-means: Extending k-means with efficient estimation of the number of clusters. Proceedings from the 17 th International Conference on Machine Learning, 2000 (pp. 727-734). Available from https://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf. 206 Perron, B., & Wolf, M. J. P. (Eds.). (2009). The video game theory reader 2. New York, NY: Routledge. Powers, D. E. (1999). Test anxiety and test performance: comparing paper-based and computer adaptive versions of the GRE general test. (ETS Report RR-99-15). Princeton, NJ: Educational Testing Service. Prensky, M. (2012). From digital natives to digital wisdom: Hopeful essays for 21st century learning. Thousand Oaks, CA: Corwin Press. Quellmalz, E. S., Timms, M. J., Silberglitt, M. D., & Buckley, B. C. (2012). Science assessments for all: integrating science simulations into balanced state science assessment systems. Journal of Research in Science Teaching, 49(3), 363-393. R Core Team (2014). R: A language and environment for statistical computing (version 3.1.1) [computer software]. Vienna, Austria: R Foundation for Statistical Computing. Available from http://www.R-project.org/. Revelle, W. (2015) psych: Procedures for personality and psychological research (version 1.5.4) [computer software]. Evanston, IL: Northwestern University. Available from http://CRAN.R-project.org/package=psych. Robinson, D. (2017). K-means clustering is not a free lunch. Available from http://varianceexplained.org/r/kmeans-free-lunch/. Romero, C., Gonzalez, P., Ventura, S., del Jesus, M. J., & Herrera, F. (2009). Evolutionary algorithms for subgroup discovery in e-learning: A practical application using Moodle data. Expert Systems With Applications, 39, 1632–1644. Salen, K., & Zimmerman, E. (2004). Rules of play: Game design fundamentals. Cambridge, MA: MIT. 207 Salminen, M., & Ravaja, N. (2008). Increased oscillatory theta activation evoked by violent digital game events. Neuroscience Letters, 435, 69 –72. Schleiner, A. (2001). Does Lara Croft wear fake polygons? Gender and gender-role subversion in computer adventure games. Leonardo, 34(3), 221–226. Schmeiser, C. B., & Welch, C. J. 2006). Test development. In R. L. Brennan (Ed.), Educational Measurement: Fourth Edition (pp. 307-353). Westport, CT: Praeger Publishers. Schwartz, K. (2014a) Profile: Illinois. In T. Barseghian & K. Sung (Eds.), Mind shift: Guide to digital games + learning (pp. 30-31). Available from http://www.kqed.org/assets/pdf/news/MindShift- GuidetoDigitalGamesandLearning.pdf. Schwartz, K. (2014b) Profile: New York. In T. Barseghian & K. Sung (Eds.), Mind shift: Guide to digital games + learning (pp. 32). Available from http://www.kqed.org/assets/pdf/news/MindShift- GuidetoDigitalGamesandLearning.pdf. Shapiro, J. (2014a, December 18). Study shows video games’ impact in face-to-face teaching. Available from http://www.forbes.com/sites/jordanshapiro/2014/12/18/study-shows-video-games- impact-on-face-to-face-teaching/2/. Shapiro, J. (2014b, May 30). Can games make high-stakes tests obsolete? Available from http://blogs.kqed.org/mindshift/2014/05/can-games-make-high-stakes-tests- obsolete/. 208 Shapiro, J., Teknibaş, K. S., Schwartz, K., & Darvasi, P. (2014) What the research says about gaming and screen time. In T. Barseghian & K. Sung (Eds.), Mind shift: Guide to digital games + learning (pp. 6-13). Available from http://www.kqed.org/assets/pdf/news/MindShift- GuidetoDigitalGamesandLearning.pdf. Shute, V. J. & Wang, L. (2016). Assessing and supporting hard-to-measure constructs. To appear in A. Rupp, & J. Leighton (Eds.), Handbook of cognition and assessment (pp. 535-562). New York, NY: Springer. Shute, V. J. (2011). Stealth assessment in computer-based games to support learning. In S. Tobias, & J. D. Fletcher (Eds.), Computer games and instruction (pp. 503-524). Charlotte, NC: Information Age Publishers. Shute, V. J. (2014). Stealth assessment in games. Proceedings from the Maryland Assessment Research Center Conference. Available from http://marces.org/conference/InnovativeAssessment/2Shute.pdf. Shute, V. J., & Ke, F. (2012). Games, learning, and assessment. In D. Ifenthaler, D. Eseryel, & Ge, X. (Eds.), Assessment in game-based learning: Foundations, innovations, and perspectives (pp. 43-58). New York, NY: Springer. Shute, V. J., Ventura, M., & Kim, Y. J. (2013). Assessment and learning of qualitative physics in Newton’s Playground. The Journal of Education Research, 106, 423- 430. Shute, V. J., Ventura, M., Kim, Y. J. & Wang, L. (2014). Video games and learning. In W. G. Tierney, Z. Corwin, T. Fullerton, and G. Ragusa (Eds.), Postsecondary 209 play: The role of games and social media in higher education (pp. 217-235). Baltimore, MD: John Hopkins University Press. Siegler, R. S. (1976). Three aspects of cognitive development. Cognitive Psychology, 8, 481-520. Siegler R. S., & Chen, Z. (2002). Development of rules and strategies: Balancing the old and the new. Journal of Experimental Child Psychology, 81, 446-457. Silverman, S. (2011, November 29). Target’s cashier game – is it really a game? Available from http://www.levelspro.com/targets-cashier-game-is-it-really-a- game/. Song, S. & Li, C. (2006). Improved ROCK for text clustering using asymmetric proximity. Proceedings from the 32nd Conference on Current Trends in Theory and Practice of Computer Science. Available from https://www.researchgate.net/publication/221513060_Improved_ROCK_for_Text _Clustering_Using_Asymmetric_Proximity. SRI International. (2015). Independent research and evaluation on GlassLab games and assessments. Available from http://www.sri.com/work/projects/glasslab-research. Stevens, R. H., & Casillas, A. (2006). Artificial neural networks. In D. M. Williamson, R. J. Mislevy, & I. I. Bejar (Eds.), Automated scoring of complex tasks in computer- based testing (pp. 259-312). Mahwah, NJ: Erlbaum. Stevens, R., Galloway, T., Wang, P., Berka, C., Tan. V., Wohlgemuth, T., Lamb, J., & Buckles, R. (2012). Modeling the neurodynamic complexity of submarine navigation teams. Computational and Mathematical Organization Theory, 19, 346-369. 210 Suits, B. (2005). The grasshopper: Games, life, and utopia (2 nd ed.). Toronto, ON: Broadview. Takeuchi, L. M., & Vaala, S. (2014). Level up learning: A national survey on teaching with digital games. New York: The Joan Ganz Cooney Center at Sesame Workshop. Teknibaş, K. S. (2014) Getting in the game. In T. Barseghian & K. Sung (Eds.), Mind shift: Guide to digital games + learning (pp. 4-5). Available from http://www.kqed.org/assets/pdf/news/MindShift- GuidetoDigitalGamesandLearning.pdf. Wang, Y., Yang, C., Mathee, K., & Narasimhan, G. (2005). Clustering using adaptive self-organizing maps (ASOM) and applications. In Computational Science–ICCS 2005 (pp. 944-951). Berlin: Springer. Wehrens, R., & Buydens, L.M.C. (2007). Self- and super-organising maps in R: the Kohonen package. Journal of Statistical Software, 21(5), 1-19. Xu, B., Recker, M., Qi, X., Flann, N., & Ye, L. (2013). Clustering educational digital library usage data: A comparison of latent class analysis and k-means algorithms. Journal of Educational Data Mining, 5(2), 38-68.