ABSTRACT Title of dissertation: A VISUAL ANALYTICS APPROACH TO COMPARING COHORTS OF EVENT SEQUENCES Sana Malik, Doctor of Philosophy, 2016 Dissertation directed by: Professor Ben Shneiderman Department of Computer Science Sequences of timestamped events are currently being generated across nearly every domain of data analytics, from e-commerce web logging to electronic health records used by doctors and medical researchers. Every day, this data type is re- viewed by humans who apply statistical tests, hoping to learn everything they can about how these processes work, why they break, and how they can be improved upon. To further uncover how these processes work the way they do, researchers often compare two groups, or cohorts, of event sequences to find the differences and sim- ilarities between outcomes and processes. With temporal event sequence data, this task is complex because of the variety of ways single events and sequences of events can differ between the two cohorts of records: the structure of the event sequences (e.g., event order, co-occurring events, or frequencies of events), the attributes about the events and records (e.g., gender of a patient), or metrics about the timestamps themselves (e.g., duration of an event). Running statistical tests to cover all these cases and determining which results are significant becomes cumbersome. Current visual analytics tools for comparing groups of event sequences empha- size a purely statistical or purely visual approach for comparison. Visual analytics tools leverage humans’ ability to easily see patterns and anomalies that they were not expecting, but is limited by uncertainty in findings. Statistical tools empha- size finding significant differences in the data, but often requires researchers have a concrete question and doesn’t facilitate more general exploration of the data. Combining visual analytics tools with statistical methods leverages the benefits of both approaches for quicker and easier insight discovery. Integrating statistics into a visualization tool presents many challenges on the frontend (e.g., displaying the results of many different metrics concisely) and in the backend (e.g., scalability challenges with running various metrics on multi-dimensional data at once). I begin by exploring the problem of comparing cohorts of event sequences and understanding the questions that analysts commonly ask in this task. From there, I demonstrate that combining automated statistics with an interactive user interface amplifies the benefits of both types of tools, thereby enabling analysts to conduct quicker and easier data exploration, hypothesis generation, and insight discovery. The direct contributions of this dissertation are: (1) a taxonomy of metrics for comparing cohorts of temporal event sequences, (2) a statistical framework for exploratory data analysis with a method I refer to as high-volume hypothesis testing (HVHT), (3) a family of visualizations and guidelines for interaction techniques that are useful for understanding and parsing the results, and (4) a user study, five long-term case studies, and five short-term case studies which demonstrate the utility and impact of these methods in various domains: four in the medical domain, one in web log analysis, two in education, and one each in social networks, sports analytics, and security. My dissertation contributes an understanding of how cohorts of temporal event sequences are commonly compared and the difficulties associated with applying and parsing the results of these metrics. It also contributes a set of visualizations, algorithms, and design guidelines for balancing automated statistics with user-driven analysis to guide users to significant, distinguishing features between cohorts. This work opens avenues for future research in comparing two or more groups of temporal event sequences, opening traditional machine learning and data mining techniques to user interaction, and extending the principles found in this dissertation to data types beyond temporal event sequences. A VISUAL ANALYTICS APPROACH TO COMPARING COHORTS OF EVENT SEQUENCES by Sana Malik Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2016 Advisory Committee: Professor Ben Shneiderman, Chair/Advisor Dr. Catherine Plaisant, Co-Advisor Professor Margret Bjarnadottir Professor Hector Corrada-Bravo Professor Niklas Elmqvist c© Copyright by Sana Malik 2016 Dedication To Zaan, Mahum, Ismael, and Humza ii Acknowledgments I love too many people. I would first like to thank my advisors, Dr. Ben Shneiderman and Dr. Cather- ine Plaisant, for their continued support throughout the last three years. Thank you, Ben, for your endless patience and optimism – I could not have asked for a more positive and encouraging advisor. Catherine, thank you for always being practical, available, and willing to provide feedback on everything I’ve asked. I’ve learned so much from you both, not just about research, but about being part of a team and always remaining positive. I’d also like to thank the members my proposal and dissertation committees who made my work considerably stronger: Hector Corrado Bravo, Margret Bjar- nadottir, Niklas Elmqvist, and Alan Sussman. Thank you for providing feedback throughout the entirety of my research. Thank you also to all my case study part- ners for putting up with countless bugs, usability issues, and confusing errors and still providing valuable feedback: Rachel Webman, Randall Burd, Leah MacFadyen, Eberechukwi Onukwugha, Jim Gardner, Eunyee Koh, and Sean Barnes. I would like to thank Fan Du, for being a wonderfully reliable collaborator and friend; I am so glad I’ve had you by my side for the past two years. Megan Monroe, Cody Dunne, and John Alexis Guerra-Gomez: thank you for your guidance and mentorship. I’ll always look up to you! I am so grateful for the entire HCIL iii and each of its members. Thank you for always being a bright, friendly place where ideas come to grow and practice talk standards are unreal. I don’t even know how to begin this next group. The past five years are when I’ve found “my people.” I’m so grateful for all the friends I’ve made, for countless game nights where we spent more time learning the rules than playing the game, for trivia Thursdays, and for always teaching me new things. Philip and Robin (and Ada) Dasler, Cody Buntain, Leigh Cook, Steve Bach, Matt Mauriello, Jay Pujara, Alex Malozemoff. You guys are the best. Brenna McNally. I didn’t include you in the previous list, because you are too special (no offense, everyone else). Thank you for doing too much for me. For baby-sitting me and forcing me to rehearse talks and write when I did. not. want. to. For cleaning the apartment without me noticing and for having enough energy to spare some of yours for me. And lastly, thank you, Steven Lee, for always seeing the silver lining and for always encouraging me. To the CCL: Anam A., Amina, Annya, Asema, and Anam R.: who would I even be without you? Thanks for accepting me even though my name doesn’t begin with an A and for countless birthday dinners, text messages, and road trips over the past 10 years. Lastly, I’d like to thank my family. My parents, for being the hardest working people I know and my siblings for never letting me feel alone. iv Table of Contents List of Tables viii List of Figures ix 1 Introduction 1 1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Background and Related Work 11 2.1 Event Sequence Visualization and Comparison . . . . . . . . . . . . . 11 2.1.1 Single Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 Visual Comparison . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.3 Event Sequence Comparison . . . . . . . . . . . . . . . . . . . 13 2.2 Statistics for Comparing Cohorts . . . . . . . . . . . . . . . . . . . . 19 2.3 Exploratory Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . 22 2.4 Temporal Data Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.5 Scalability in Visual Analytics . . . . . . . . . . . . . . . . . . . . . . 24 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3 A Taxonomy of Metrics for Comparing Cohorts 28 3.1 Summary Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2 Record Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Sequence Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.1 Occurrence Metrics . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.2 Time Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3.3 Event Attribute Metrics . . . . . . . . . . . . . . . . . . . . . 40 3.4 Combining Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4 Statistical Framework for High-Volume Hypothesis Testing 43 4.1 System Overview: Backend . . . . . . . . . . . . . . . . . . . . . . . . 47 4.1.1 Code Structure and Organization . . . . . . . . . . . . . . . . 47 4.1.2 Data Processing Pipeline . . . . . . . . . . . . . . . . . . . . . 52 v 4.2 Guidelines for Scaling HVHT to Large Event Sequence Datasets . . . 54 4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Design of CoCo: Frontend 60 5.1 Description of the User Interface . . . . . . . . . . . . . . . . . . . . . 61 5.1.1 Sequence Scattergram and Sequence Filters . . . . . . . . . . 62 5.1.2 Cohort Overviews . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.1.3 Result Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.1.4 Results Panel . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.1.5 Sequence Details . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Design Guidelines for HVHT Visual Analytics Tools . . . . . . . . . . 69 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 6 Evaluation and Case Studies 80 6.1 Preliminary User Study . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.1.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2 Case Studies: Introduction . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3 CS1: Exploring Adherence to Advanced Trauma Life Support Protocol 89 6.3.1 System Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.3.2 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.4 CS2: Student Course Enrollments . . . . . . . . . . . . . . . . . . . . 95 6.4.1 System Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.4.2 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.5 CS3: Medication Adherence Patterns of Hypertension Patients . . . . 102 6.5.1 System Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 6.5.2 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6 CS4: Customer Web Logs . . . . . . . . . . . . . . . . . . . . . . . . 109 6.6.1 System Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.6.2 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.7 CS5: In-Classroom Student Behaviors . . . . . . . . . . . . . . . . . . 113 6.7.1 System Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 6.7.2 Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 6.8 CS6: Distinguishing Types of Radiation to the Bone . . . . . . . . . . 117 6.9 CS7: Children’s AIM2 . . . . . . . . . . . . . . . . . . . . . . . . . . 118 6.10 CS8: Computer Activity Logs . . . . . . . . . . . . . . . . . . . . . . 119 6.11 CS9: Social Media Messages . . . . . . . . . . . . . . . . . . . . . . . 120 6.12 CS10: Baseball Career Trajectories . . . . . . . . . . . . . . . . . . . 121 6.13 8 Incomplete Case Studies . . . . . . . . . . . . . . . . . . . . . . . . 122 6.14 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7 Discussion and Future Work 125 7.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.1.1 Difference Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.1.2 Statistical False Positives . . . . . . . . . . . . . . . . . . . . . 127 vi 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.2.1 Supporting Comparison of Three or More Groups . . . . . . . 127 7.2.2 Integrated Cohort Selection . . . . . . . . . . . . . . . . . . . 128 7.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 7.2.4 Database Backend . . . . . . . . . . . . . . . . . . . . . . . . 133 7.2.5 Interval Events . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.2.6 Extending to Other Data Types . . . . . . . . . . . . . . . . . 136 7.2.7 Journaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 8 Evolution of CoCo 139 8.1 Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 8.2 Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 8.3 Version 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.4 Version 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 8.5 Version 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 8.6 Version 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 9 Case Study Questionnaires 147 Bibliography 153 vii List of Tables 3.1 This table shows the applicable metrics for each sequence type (de- noted by an X). Metrics with shaded cells are those that were imple- mented in the final version of CoCo. . . . . . . . . . . . . . . . . . . . 34 4.1 The 5 Scalability Guidelines for extending high-volume hypothesis testing to large datasets. . . . . . . . . . . . . . . . . . . . . . . . . . 46 5.1 The 7 Design Guidelines for balancing automated high-volume hy- pothesis testing with integrated visualization and interaction (Sec- tion 5.2). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6.1 Number of hypotheses generated by metric and sequence type. . . . . 105 viii List of Figures 1.1 Two datasets, each containing about a thousand patients as they are transferred throughout a hospital, are being compared using CoCo: patients who lived and patients who died (demo dataset; no real data). Along the top are high-level overviews of each dataset: a scat- terplot displaying the sequences in the dataset and how often they occur and each cohort is visualized as an EventFlow graph.The bot- tom panel displays a rich compact view of the results of high-volume hypothesis testing, ranked by significance with a legend pairing each event with a color. To the right of the list, details-on-demand for a selected hypothesis (comparing the average timing between the blue and red event) provides more details and context for the results. A set of control panels (top right panel) allows analysts to sort and filter the results by event sequence length, event types, sample size, significance, or metric. . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Current approaches to comparing event sequences involve putting two separate windows side-by-side for a visual comparison. . . . . . . . . 4 2.1 EventFlow visualizes an aggregated view of a single group of event sequences. CoCo borrows event icon representations from EventFlow. 12 2.2 Outflow visualizes groups of temporal event sequences for outcome analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3 MizBee measures the similarity between genomes by visualizing re- gions of shared sequences. . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Variant View is a genome browser tool that aligns sequences by sim- ilarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 FeatureLens visualizes frequent patterns in text collections. . . . . . . 16 2.6 History Flow visualizes changes between versions of the same document. 16 2.7 The TreeJuxtaposer system to help biologists explore structural de- tails of phylogenetics and focuses only on structural differences in the trees but not any attributes about the nodes (such as timestamps). . 18 2.8 TreeVersity visually compares trees with similar structures. . . . . . . 19 ix 2.9 The Kaplan-Meier Estimator is used to compare the survival rates of groups of patients receiving different treatments. The estimator shows the maximum possible likelihood of survival (as a percentage) for each group as a function of time. . . . . . . . . . . . . . . . . . . 20 2.10 CAVA combines visual analytics and statistics by allowing users to interactively refining cohorts and perform statistics on a single group. 21 2.11 imMens uses aggregation to scale to large datasets. . . . . . . . . . . 25 2.12 Progressive Insights allows users to see in-progress visualizations in order to allow users to guide the algorithm and ignore subspaces of the data that may not be relevant. . . . . . . . . . . . . . . . . . . . 26 3.1 The dataset used as an example for the remainder of this chapter consists of records of patients who were admitted to the emergency room and follows their movement through their stay at the hospital: being administered aspirin, being admitted into the hospital room, transferring between a normal floor bed and the intensive care unit (ICU), and ultimately being discharged either dead or alive. . . . . . 30 3.2 Because cohorts do not necessarily need to be the same size, it is important to report on the number of records in each cohort. An un- derstanding of the number of records allows analysts to understand broad trends between the cohorts (e.g., is the selection criteria bal- anced?) In this example, there are only 4 patients who died versus 6 who lived. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 The number of events is the raw number of events in each cohort. Combined with the number of records metric, this can reveal interest- ing information about the frequency of events and the average length of records. In this example, though there are 50% more patients who lived than those who died, the number of events is only 20% greater, indicating that patients who died have longer sequences, on average, than those who live. . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Prevalence of record attributes reports on the percent of records who have a particular value. This example is comparing the proportion of male and female patients between the two groups. . . . . . . . . . . . 33 3.5 The prevalence of an event is calculated as a percentage of records that contain that particular event. . . . . . . . . . . . . . . . . . . . . 35 3.6 The prevalence of a subsequence is calculated as a percentage of records that contain that particular subsequence. . . . . . . . . . . . 35 3.7 Co-occurring events are a pair of events which occur within a single record, and may or may not have other events between them. . . . . . 36 3.8 Absolute time metrics look at the timestamp of a particular event. For example, the prevalence of the day of the week can differ between the two cohorts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.9 Relative time metrics involve comparing the average gap between two consecutive events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 x 3.10 Relative time metrics involve comparing the average gap between two events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.11 Attribute metrics are similar to other event metrics, but the events are further broken down by the attribute’s value. In this example, the doctor that is on-call when the patient arrived at the emergency room is noted. Dr. Smith was on-call more often in patients who lived than those who didn’t. . . . . . . . . . . . . . . . . . . . . . . . 40 4.1 A chart of the average runtime to find all sequences (a) versus the number of unique sequences and (b) versus the number of records. Finding all subsequences within both datasets grows proportionally with the number of records in the dataset, whereas the number of unique sequences has no effect. . . . . . . . . . . . . . . . . . . . . . . 44 4.2 A chart of the average runtime to calculate all hypotheses (a) versus the number of unique sequences and (b) versus the number of records. Calculating all hypotheses depends both on the number of records and the number of unique sequences in the dataset. . . . . . . . . . . . . . 45 4.3 Code structure and organization. . . . . . . . . . . . . . . . . . . . . 48 4.4 CoCo data processining pipeline. CoCo processes data in five major steps: (1) Analysts select two datasets from the interface. (2) The data files are sent to the server. (3) Sequences and counts are ex- tracted. (4) The results for the sequence counts are sent back to the client. (5) CoCo begins (a) calculating metric results and (b) sends them back as they are completed, until all metrics have been calculated. 53 5.1 CoCo is comprised of five main panels: sequence scattergram and filters, cohort overviews, result filters, results panel, and sequence details. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.2 Methods for sorting and filtering the result set. Results can be filtered by two ways using a table that shows the number of hypotheses that were tested according to metric and sequence type. The table further breaks down filtering the results based on p-value into three groups: ≤ 0.01, ≤ 0.05, and > 0.05. . . . . . . . . . . . . . . . . . . . . . . . 64 5.3 The main results panel (Figure 5.3) displays all the results of the hypothesis tests according to the sorting and filtering preferences set by the analysts. To the left is a legend which each event category that is found in the dataset, assigned a color. Each result is encoded as a row, where the center shows the hypothesis that was tested. Colored bars in the center indicate the sequence that the hypothesis refers to and the icons to the left indicate the corresponding metric. Depending on the value of the result, a bar grows out from the center in the direction where the value is larger, on a ratio scale. The bar is then colored by the p-value of the result. . . . . . . . . . . . . . . . . 66 xi 5.4 Analysts can view details about a result by clicking it. Results that correspond to comparing averages (such as average duration or aver- age frequency) will show the distributions of all the values and statis- tics about the average, minimum, maximum, and standard deviation in both cohorts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.5 Analysts’ responses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.6 Current scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.7 Mockups of expert analysts’ responses (left) and resulting glyphs (right) for visually differentiating four properties of event sequences: (a) whole record sequences, (b) concurrent events, (c) consecutive sequences, and (d) nonconsecutive sequences. . . . . . . . . . . . . . . 77 5.8 Designs considered for presenting difference results between cohorts and : (a) juxtaposition (directly comparing two bars), (b) superposi- tion (overlaying bars darkened area is the shared amount while the lightened area indicates the difference), and (c) explicit encoding only, which encodes only information about the direction and magnitude of the difference. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.9 Analysts can view details about a result by clicking it. Results that correspond to comparing averages (such as average duration or aver- age frequency) will show the distributions of all the values and statis- tics about the average, minimum, maximum, and standard deviation in both cohorts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1 Average number of insights per participant per category using Event- Flow versus CoCo. The only statistically significant difference (p < 0.05) is in insights about subsequences, where participants found more insights using CoCo. . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Analysts at Children’s National Medical Center used CoCo to un- derstand potentially distinguishing attributes between patients who are treated according to the Advanced Trauma Life Support (ATLS) protocol versus those who are not. . . . . . . . . . . . . . . . . . . . . 89 6.3 An analyst at the University of British Columbia (UBC) was inter- ested in using CoCo to better understand the pathways UBC’s stu- dents typically pursue towards degree completion . . . . . . . . . . . 95 6.4 Researchers at the University of Maryland used CoCo to compare whether drug adherence affected the cost that patients incurred over a year. In other words: Could taking medication as prescribed result in lower overall medical costs? . . . . . . . . . . . . . . . . . . . . . . 102 6.5 Final results and usage of drug pattern case study. Analysts used the Sequence Occurrence panel (c) to control sample size, and the Filter panel (b) to control significance and sequence length. This resulted in only 10 hypotheses (a) for the researchers to manually review. . . . 107 6.6 Analysts at Adobe were interested in comparing user click logs using CoCo to understand which events lead to a product purchase versus don’t. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xii 6.7 An analyst at the University of British Columbia (UBC) used CoCo to compare the in-classroom behaviors of students in the top quartile versus bottom quartile. . . . . . . . . . . . . . . . . . . . . . . . . . . 113 8.1 The first version of CoCo was largely textual, with results grouped by metric type. Analysis could select the results they wished to view using the metric list in the middle panel. . . . . . . . . . . . . . . . . 139 8.2 CoCo version two brought a variety of usability fixes. . . . . . . . . . 141 8.3 Version 3 added more utility to parsing the result set through meth- ods for filtering and sorting, layout changes, and explicit difference encodings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.4 CoCo v4 introduced important changes in the way sequences and hypothesis results were displayed. . . . . . . . . . . . . . . . . . . . . 143 8.5 The fifth version of Coco introduced the most major changes: remov- ing the metrics list, redesigning hypothesis results, sequence scatter- plot, and details on demand. . . . . . . . . . . . . . . . . . . . . . . . 144 8.6 The final version of CoCo (v6) streamlined the process model ob- served through the case studies. . . . . . . . . . . . . . . . . . . . . . 146 9.1 Entry questionnaire, page 1. . . . . . . . . . . . . . . . . . . . . . . . 148 9.2 Entry questionnaire, page 2. . . . . . . . . . . . . . . . . . . . . . . . 149 9.3 Exit questionnaire, page 1. . . . . . . . . . . . . . . . . . . . . . . . . 150 9.4 Exit questionnaire, page 2. . . . . . . . . . . . . . . . . . . . . . . . . 151 9.5 Exit questionnaire, page 3. . . . . . . . . . . . . . . . . . . . . . . . . 152 xiii Chapter 1: Introduction Sequences of timestamped events are currently being generated across nearly every domain of data analytics. Consider a typical e-commerce site tracking each of its users through a series of search results and product pages until a purchase is made. Or consider a database of electronic health records containing the symptoms, medications, and outcomes of each patient who is treated. Every day, this data type is reviewed by humans who apply statistical tests, hoping to learn everything they can about how these processes work, why they break, and how they can be improved upon. Human eyes and statistical tests, however, reveal very different things. Sta- tistical tests show metrics, uncertainty, and statistical significance. Human eyes see context, confirm what they already know, and discover patterns that are unexpected. Visualization tools strive to capitalize on these latter, human strengths. For example, the EventFlow visualization tool [1] supports exploratory, visual analyses over large datasets of temporal event sequences. This support for open-ended explo- ration, however, comes at a cost. The more that a visual analytics tool is designed around open-ended questions and flexible data exploration, the less it is able to ef- fectively integrate automated, statistical analysis. Automated statistics can provide 1 answers, but only when the questions are known. The opportunity to combine these two approaches lies in the middle ground. By all accounts, the goal of open-ended questions is to generate more concrete ques- tions. As these questions come into focus, so too does the ability to automatically generate the answers. I introduce a visual analytics tool, CoCo (for “Cohort Com- parison”, Figure 1.1), that is designed to capitalize on one such scenario. Consider again the information that is tracked on an e-commerce site. From a business perspective, the users of the site fall into one of two groups: people who bought something and people who did not. If the goal is to convert more of the latter into the former, it is critical to understand how these two groups, or cohorts, are different. Did one group look at more product pages? Or spend more time on the site? Or have some clear demographic identifier such as gender, race, or age? Similar questions arise in the medical domain as well. Which patients responded well to an experimental medication? How did their treatment patterns differ from the patients who received the standard treatment? Although comparing two groups of data is a common task, with temporal event sequence data in particular, the task of running many statistical tests becomes complex because of the variety of ways the cohorts, sequences, and events can differ. In addition to the structure of the event sequences (e.g., order, co-occurrences, or frequencies of events), the attributes about the events and records (e.g., gender of a patient), and the timestamps themselves (e.g., an event’s duration) can be distinguishing features between the cohorts. For this reason, running statistical tests to cover all these cases and determining which results are significant becomes 2 Figure 1.1: Two datasets, each containing about a thousand patients as they are transferred throughout a hospital, are being compared using CoCo: patients who lived and patients who died (demo dataset; no real data). Along the top are high- level overviews of each dataset: a scatterplot displaying the sequences in the dataset and how often they occur and each cohort is visualized as an EventFlow graph.The bottom panel displays a rich compact view of the results of high-volume hypothesis testing, ranked by significance with a legend pairing each event with a color. To the right of the list, details-on-demand for a selected hypothesis (comparing the average timing between the blue and red event) provides more details and context for the results. A set of control panels (top right panel) allows analysts to sort and filter the results by event sequence length, event types, sample size, significance, or metric. 3 Figure 1.2: Current approaches to comparing event sequences involve putting two separate windows side-by-side for a visual comparison. cumbersome. Based on three years of case studies, I present a taxonomy of metrics for comparing cohorts of event sequences. Additionally, the factor on which the cohorts are formed may call for different types of questions to be asked about the data. For example, in a set of medical records split by date (e.g., last month’s trials vs. this month’s), a researcher may be interested in how outcomes for the patients differ between the cohorts, whereas a dataset split by the patient’s outcome (e.g., patients who die vs. those who live) would ignore such a metric. Current tools for cohort comparison of temporal event data (Section 2) empha- size one of two strategies: 1) purely visual comparisons between groups (Figure 1.2), with no integrated statistics, or 2) purely statistical comparisons over one or more 4 features of the dataset. By contrast, CoCo is designed to provide a more balanced integration of both human-driven and automated strategies. Purely statistical methods of comparison would benefit from user interven- tion. With the sheer number of metrics, it is time consuming to run every metric ahead of time, especially when not every metric may be required for analysis. Users with domain knowledge about the datasets would ideally be able to select from the metrics and easily eliminate unnecessary metrics. Further, questions asked during cohort comparison may vary based on how the cohorts were divided. If the cohorts were divided by outcome (e.g., patients who lived versus patients who died), the sequence of events leading up to them becomes more important. Analysis might revolve around determining what factors (time or attributes) or events lead to the outcome by determining how the metrics differ between the groups. Conversely, if the cohorts were split based on an event type, questions may revolve around finding distinguishing outcomes (e.g., patients who took Drug A may result in more strokes than patients who took Drug B). Exploration of cohorts that are split by time (e.g., the same patients over two different months) may be more open-ended and require all metrics. The cohorts can be distinguished by time factors, event attributes, or events themselves (sequences of events or outcomes). Results from purely statistical methods can also be difficult to parse and un- derstand. Analysts may have different priorities and questions, which require dif- ferent methods for sorting the results. For example, analysts may be interested in any difference between the datasets, regardless of the direction of the difference, whereas other analysts may be interested only in results that occur more frequently 5 in Cohort A. Integrated interaction techniques would allow analysts to specify their priorities when viewing results. The contribution of this thesis is to enable researchers to be far more flexi- ble in examining cohorts and facilitate human intervention where it can save time and effort. Because of the pre-defined problem space of comparing temporal event sequences, analysts can save time by having answers to common questions readily available and giving them a starting point for their exploration. It is important to note that CoCo is intended for exploratory data analysis which will reveal areas of interest to analysts, not as a means of displaying final statistical results, so more complex controls are left for future work. Analysts are expected to conduct follow- up (and more controlled) tests after they have identified possible hypotheses - such as clinical trials in the medical domain or A/B testing in the-commerce domain. Purely visual tools for temporal event sequences are a good starting point for developing analysis tools for cohort studies, but can be improved by the inclusion of the statistical tests used in automated approaches. For example, EventFlow assumes that each patient record consists of time-stamped point events (e.g. heart attack, vaccination, first occurrence of symptom), temporal interval events (e.g. medication episode, dietary regime, exercise plan), and patient attributes (e.g. gender, age, weight, ethnic background, etc.). In multiple case studies with EventFlow, the researchers repeatedly observed users visually comparing event patterns in one group of records with those in an- other group. In simple terms the question was: what are the sequences of events that differentiate one group from the other? A common aspiration is to find clues 6 that lead to new hypotheses about the series of events that lead to particular out- comes, but many other simple questions also involved comparisons. Epidemiologists analyzing the patterns of drug prescriptions [2] tried to compare the patterns of dif- ferent classes of drugs. Hospital administrators looking at patient journeys through the hospital compared the data of one month with the previous month. Researchers analyzing task performance during trauma resuscitation [3] wanted to compare per- formance between cases where the response team was alerted of the upcoming arrival of the patient or not alerted. Transportation analysts looking at highway incident responses [4] wanted to compare how an agency handled its incidents differently from another. Their observations suggest that some broad insights can be gained by visu- ally comparing pairs of EventFlow displays (e.g., analysts could see if the patterns were very similar overall between one month and the next) or very different (e.g., a lot more red or the most common patterns were different) but analysts repeatedly expressed the desire for more systematic ways to compare cohorts of records. My research aims to bridge the gap between statistical and visual analyses to enable more efficient insight discovery and hypothesis generation. With this comes many practical challenges of implementing a high-volume hypothesis testing framework and presenting its result set in an understandable and useful way. On the backend, I consider the scalability of automatically running metrics on com- plex event sequences. The problem with scalability is two-fold: first, as the number of events grows, the number of possible event sequences grows exponentially. Sec- ond, with large numbers of metrics, developers must think about how to efficiently and simultaneously apply many metrics to the dataset at once. On the frontend 7 are considerations with displaying various metrics in a unified way so analysts can understand and parse the results. Additionally, with the sheer number of results, analysts must be given intelligent interaction techniques for parsing, filtering, orga- nizing, and sorting the results. On a broader level, my dissertation contributes an understanding of how co- horts of temporal event sequences are commonly compared and the difficulties as- sociated with applying and parsing the results of these metrics. It also contributes a set of visualizations, algorithms, and design guidelines for balancing automated statistics with user-driven analysis to guide analysts to significant, distinguishing features between cohorts. This work opens avenues for future research in comparing two or more groups of temporal event sequences, opening traditional machine learn- ing and data mining techniques to user interaction, and extending the principles found in this dissertation to data types beyond temporal event sequences. With the enormous amount of temporal data being collected in medical trials, consumer web logs, and sensor-based technologies, the opportunities for gaining insights are vast. With a tool like CoCo, analysts will be able to improve analysis that will lead to more efficient processes in medical health, business, education, and many other areas. I begin by showing that the task of cohort comparison is specific enough to support automatic computation against a bounded set of potential questions and ob- jectives, a method I refer to as High-Volume Hypothesis Testing (HVHT). From this starting point, I demonstrate that the diversity of these objectives, both across and within different domains, as well as the inherent complexities of real world datasets, 8 still require human involvement to determine meaningful insights. I explore how visualization and interaction better support the task of exploratory data analysis and understanding HVHT results (how significant they are, why they are mean- ingful, and whether the entire dataset has been exhaustively explored). Through interviews and case studies with domain experts, I iteratively design and implement visualization and interaction techniques in a visual analytics tool, CoCo, which is used by real-world analysts performing cohort comparison on their own datasets. 1.1 Contributions The contributions of this dissertation are: A taxonomy of metrics for comparing cohorts of temporal event se- quences. Through a systematic literature review of EventFlow and other case studies, I identified common questions that analysts ask when comparing two or more groups of event sequences and organized these questions in a taxonomy of metrics. A statistical framework for exploratory data analysis. I implement a subset of the metrics introduced in the taxonomy and identify and solve the major practical challenges of applying thousands of statistical tests, a method I refer to as high- volume hypothesis testing (HVHT), A family of visualizations and guidelines for interaction techniques. Through an iterative design process with case study partners, I develop and implement visu- 9 alizations and interaction techniques that are useful for understanding and parsing large sets of hypothesis results. Evaluations to demonstrate the utility and impact of these methods. I preform three types of evaluation through the development of CoCo: • a preliminary user study comparing CoCo to EventFlow for the task of cohort comparison, • six long-term case studies, and • five short-term case studies. 1.2 Dissertation Organization This dissertation is organized in the following parts: Chapter 2 discusses re- lated work in event sequence visualization, statistics for comparing cohorts, machine learning techniques for identifying meaningful event sequences, and methods for ef- ficient computation on multi-dimensional data. Chapter 3 discusses the taxonomy of cohort comparison metrics. Chapter 4 discusses the challenges and solutions on the backend for implementing a high-volume hypothesis framework and Chapter 5 discusses the challenges and solutions on the frontend, including a set of visualiza- tion design guidelines. Chapter 6 details the evaluation . Chapter 7 concludes the dissertation and discusses avenues for future work. 10 Chapter 2: Background and Related Work 2.1 Event Sequence Visualization and Comparison Work on visualization of sequential data is described here in two parts: visu- alizations of a single group of event sequences and visualizations comparing two or more sequences. 2.1.1 Single Groups EventFlow [1] (Figure 2.1) and OutFlow [5] (Figure 2.2) visualize a simplified view of collections of event and interval sequences. Both tools aggregate a single cohort and the whole sequences of records. EventFlow allows users to explore the underlying dataset through this visualization. However, they only support visualiz- ing a single group of records, though comparison can be facilitated by using multiple instances of the visualize. In this case however, the tools do not provide statisti- cal information about the differences. CoCo borrows some event icon motifs from EventFlow (such as using colored markers to represent events). 11 Figure 2.1: EventFlow visualizes an aggregated view of a single group of event sequences. CoCo borrows event icon representations from EventFlow. Figure 2.2: Outflow visualizes groups of temporal event sequences for outcome anal- ysis. 12 2.1.2 Visual Comparison Gleicher et al. [6] provide an extensive survey of visual comparison techniques classified into three categories and combinations thereof: juxtaposition, superpo- sition, and explicit encoding. This characterization was used as a framework for exploring designs for visualizing comparison results. Though many visualization tools have been designed for event sequence visualization [1,7] there has been little research on visualizing event sequence comparison until recently. Zhao et al. [8] design MatrixWave, a visualization designed to compare the flow of users in click- stream datasets. MatrixWave focuses on differences in the occurrence of immediate, pairwise steps in the event stream, whereas CoCo generalizes to differences in single events and sequences of any length, as well as differences dealing with time. Besides finding differences in datasets, event sequence comparison has been explored in the context of finding similarities. Vrotsou et al. [9] introduce a set of event sequence similarity measures. They explore using visualization and interactive data mining to cluster similar groups of event sequences. While CoCo focuses on difference metrics, this work can be extended to applicable similarity measures. 2.1.3 Event Sequence Comparison Solutions for comparing sequential data have been explored in many different fields, including comparative genomics, text mining, and tree comparison. They are discussed here in the context of event history data and discrete-time models [10]. I draw first on methods to compare collections of general sequences without 13 Figure 2.3: MizBee measures the similarity between genomes by visualizing regions of shared sequences. the notion of time, most notably the fields of comparative genomics and text mining, where the data is ordered with respect to some index [11]. Genome browsers [12–17] have been developed to visualize genome sequences. They compare genomes by visualizing the position of each nucleotide, and consider a genome as a long and linear sequence of nucleotides. Scientists also compare genomes at the gene level. However, most of the existing tools are only able to compare ei- ther only the similarities or only the differences of collections of gene sequences. For example, MizBee [18] (Figure 2.3) measures the similarity between genomes by visu- alizing the regions of shared sequences. Variant View [19] (Figure 2.4), cBio [20] and MuSiC [21] only support displaying sequence variants. Further, genome sequences 14 Figure 2.4: Variant View is a genome browser tool that aligns sequences by similar- ity. are often compared as a sequence of linear positions, which does not lend itself to distinctions between point events versus interval durations. Texts are often compared by extraction of frequent n-grams [22]. FeatureLens (Figure 2.5) by Don et al. [23] defined n-gram as a contiguous sequence of words and used a visualization approach to compare the co-occurrences of frequent n-grams of text. However, it only supports comparison among sections of a single document. Jankowska et al. [24] proposed to convert documents into vectors of frequent charac- ter n-grams and designed a relative n-gram signature to encode the distance between n-gram vectors. Vie´gas et al. presented history flow [25] (Figure 2.6) to visually compare between versions of a document. Their approach assumes that the later version of a document is developed based on the earlier one, which is not applicable to event history data. 15 Figure 2.5: FeatureLens visualizes frequent patterns in text collections. Figure 2.6: History Flow visualizes changes between versions of the same document. 16 Most of the techniques mentioned above (in both genomics and text mining) only provide a visual comparison between single long sequences, whereas event his- tory data consists of many, short transactional sequences. Temporal event sequences are often represented as trees. While many compar- ison techniques exist for trees, many do not take into account values or attributes of nodes and none are specifically designed for temporal data. Munzner presented the TreeJuxtaposer [26] (Figure 2.7) system to help biologists explore structural details of phylogenetics, but focuses only on structural differences in the trees and not any attributes about the nodes (such as timestamps). Bremm [27] studied the compar- ison of phylogenetic trees in a more statistical way by extending the algorithms of TreeJuxtaposer to compare more than two trees and considers “edge length” which could be generalized to durations of gaps between sequential events. Holten [28] presented an interactive visualization method to compare different versions of hi- erarchically organized data. He proposed two methods of tree comparison: icicle plot and hierarchical sorting, but does not propose any statistical comparison tech- nique, and focuses more on “leaf-to-leaf” matching, which considers whole paths (or sequences) only. TreeVersity2 [29] (Figure 2.8) compares by tree structure and the node values. Though TreeVersity2 is general to all trees, it leaves out temporal-specific analysis such as duration of or between interval events. TreeVersity2 compares two datasets over time, but assumes these time periods are disjoint. CoCo does not assume that the datasets are split by a time attribute and treats time of the nodes as another comparable attribute in the dataset. TreeVersity2 also includes a textual reporting 17 Figure 2.7: The TreeJuxtaposer system to help biologists explore structural details of phylogenetics and focuses only on structural differences in the trees but not any attributes about the nodes (such as timestamps). 18 Figure 2.8: TreeVersity visually compares trees with similar structures. tool that highlights outliers in the data. Many of these comparison techniques also lack a statistical significance test for the comparisons. In this work, the comparison supports both visual and statistical approaches. 2.2 Statistics for Comparing Cohorts In medical cohort studies, the most prevalent approach for comparison is sur- vival analysis. In survival analysis, survival time is defined as the time from a defined point to the occurrence of a given event [30], and the Kaplan-Meier method is often used to analyze the survival time of patients on different treatments and to compare their risks of death [30–33]. Based on the Kaplan-Meier estimate, survival 19 Figure 2.9: The Kaplan-Meier Estimator is used to compare the survival rates of groups of patients receiving different treatments. The estimator shows the maximum possible likelihood of survival (as a percentage) for each group as a function of time. time of two groups of patients can be visualized (Figure 2.9) and compared with survival curves, which plot the cumulative proportion surviving against the survival times [30]. Also, the log-rank test is often used to statistically compare two survival curves by testing the null hypothesis [30]. Dupont et al. applied survival analysis in their clinical study [32]. Compared with survival analysis, the event sequences data used in this work is much more complicated, and requires a more advanced analysis model. Currently tools that combine visualization and statistics for medical cohort analysis focus on single cohorts. CAVA [34] (Figure 2.10) is a visualization tool for interactively refining cohorts and performing statistics on a single group. Recently, Oracle published a visualization tool for cohort study [35]. Based on patients’ clin- ical data, it supports interactive data exploration and provides statistics as well 20 Figure 2.10: CAVA combines visual analytics and statistics by allowing users to interactively refining cohorts and perform statistics on a single group. 21 as visualization functionalities. These tools similarly focus on combining visualiza- tion with automated statistics and providing an interactive interface for selecting cohorts; however, both tools aim at grouping and identifying patient cohorts for further characterization, while my work focuses on comparing two existing cohorts based on their event histories. 2.3 Exploratory Hypothesis Testing John Tukey describes statistical methods for summarizing data set charac- teristics in Exploratory Data Analysis [36], some of which are employed in CoCo. As event sequence datasets grow larger and larger, researchers are moving towards more exploratory methods for hypothesis generation and testing. The statistical implications of high-volume hypothesis testing (e.g., inevitable false positives) have been extensively researched [37,38]. CoCo treats each result independently, leaving the application of statistical corrections for future research. Liu et al. [39] explore the statistical and technical implications of automati- cally generating and testing many hypotheses. Similar to this work, they find that interactive techniques such as sorting and filtering are necessary for parsing these result sets, but their display is largely textual. This work explores more visual methods for displaying both the hypothesis and results. 22 2.4 Temporal Data Mining Automated hypothesis testing is closely related to big data mining. Previous work studying temporal data mining has mostly focused on discovering frequent tem- poral patterns and computing temporal abstractions of time-oriented data. Gupta et al. [40] provide a survey on outlier detection for temporal data sets. There are many established algorithms for frequent sequence mining [41, 42] and association rule (itemset) mining [43]. The majority of data mining techniques focus on mining sequences in a single dataset and not comparing across two datasets. While two data mining techniques can be used in tandem to facilitate similar com- parisons (e.g. comparing frequent sequence results across two datasets), more spe- cialized methods are needed to answer “which sequences occur significantly differ- ently between these datasets?” Bay and Pazzani introduce contrast mining sets [44], an algorithm for detecting differences between groups based on record attributes, such as age, gender, or occupation. In addition to record attributes, CoCo also looks at differences in event sequences, based on both occurrence and timestamps. Pattern discovery is an open-ended problem which aims to unearth all patterns of interest [11]. Much of the literature is concerned with developing efficient algo- rithms to automatically discover frequent temporal patterns and extract temporal association rules [45–51]. To constrain the search procedure, some algorithms [45,47] allow users to provide initial knowledge and rules. Many of the algorithms are gen- eralized to any sequence of tokens, however some tools [52] modify existing sequence mining algorithms to incorporate temporal attributes as well. To show the results, 23 Nore´n et al. [53] used a graphical approach to visualize temporal associations. Temporal abstraction focuses on obtaining a succinct and meaningful descrip- tion of a time series [54]. Klimov et al. [55] developed VISITORS to visualize patient records by grouping the event attribute values at different temporal granularities. Moskovitch et al. [54] aggregated values of point data by state and trend, to obtain its interval representation. Batal et al. [56] converted time series data into vectors of frequent patterns, which can be used with standard vector-based algorithms. How- ever, most of the work in this topic only focused on the time and value dimensions of an event category (a concept), which is considered as event attributes in this work. Typical data mining algorithms are a blackbox, allowing little user involvement during the process. Recent work has been done on interactive sequence mining [52, 57–59], though these system focus primarily on mining frequent patterns in a single dataset. Little work has been done on involving the user in mining differences between datasets. 2.5 Scalability in Visual Analytics Scalability in visual analytics has two main components: scaling of the visual- ization itself when displaying large amounts of data and optimization of algorithms for processing and analyzing this data. Approaches for visualizing a large volume of data include displaying only a sample of the data, providing interaction techniques to “drill-down”, or aggregating the display. Fishet et al. [60] simplify large data visualization by using random 24 Figure 2.11: imMens uses aggregation to scale to large datasets. sampling to incrementally display results to users. EventFlow [1] and imMens [61] (Figure 2.11) use aggregation techniques to display a large volumes of data. There has been less work on optimizing the computation portion of visual analytics. Stolper et al. [7] introduce Progressive Insights (Figure 2.12), which allows users to see in-progress visualizations in order to allow users to guide the algorithm and ignore subspaces of the data that may not be relevant. In databases, multiple query optimization [62] is a technique to use the results of previous queries to reduce execution time on future, related query. However, a large part of this research falls under range queries, where the results of one query might be a subset of another. 25 Figure 2.12: Progressive Insights allows users to see in-progress visualizations in order to allow users to guide the algorithm and ignore subspaces of the data that may not be relevant. 26 2.6 Summary This chapter covers the related work for cohort comparison. Event sequence cohort comparison lies at the intersection of event sequence visualization, statisti- cal methods, exploratory hypothesis testing, temporal data mining and scalability. Though much work has been done toward supporting cohort comparison with re- gard to visualizing single groups of event sequences and the visual comparison of complex objects, the areas of event sequence comparison and balancing automated hypothesis testing with an interactive user interface are largely unexplored. 27 Chapter 3: A Taxonomy of Metrics for Comparing Cohorts The first phase of my research was to explore the space of temporal event sequence comparison and to identify what questions analysts were asking when performing cohort comparison. I conducted a literature review of seven case studies with EventFlow [63] and current methods for cohort comparison. Overwhelmingly, there was a disconnect between the questions that were being asked and the answers existing tools provided. The results suggested that some broad insights can be gained by visually com- paring pairs of EventFlow displays (e.g., analysts could see if the patterns were very similar overall between two groups) or very different (e.g., a lot more red or the most common patterns were different) but analysts repeatedly expressed the desire for more systematic ways to compare cohorts of records. However, existing tools were not designed to support exploration, but instead focused on answering a con- crete hypothesis. For instance, analysts were asking simply “What patterns lead to two different outcomes?” where as the tool supported simple yes or no queries such as, “Does XYZ lead to a specific outcome?” The who, what, when, and why of the inquiry was difficult for the analyst to explore. Following this observation, I looked at what insights analysts had discovered 28 through their use of the tools and discovered that a number of common patters of inquiry existed. The most commonly explored aspect of sequence comparison focuses on the structure of the sequences (e.g., order of consecutive events, co-occurrences of non-consecutive events) and the frequency of sequences. However, event and record attributes (e.g., gender of a patient) and the timestamps themselves (e.g., duration of an event) can also be distinguishing features between the cohorts. I constructed a taxonomy based on the observations made through the litera- ture review and by observing analysts with three overarching goals: (1) support more open-ended questions that answer the who, what, when, and why when comparing event sequences, (2) organize these questions in a way that promotes systematic exploration, and (3) provide a more holistic comparison, beyond looking at only the structure of the sequences. The taxonomy is organized in three parts: (1) summary metrics, (2) record metrics, and (3) event sequence metrics. Though this taxonomy can be applied to a variety of fields, the dataset used as an example for the remainder of this chapter consists of records of patients who were admitted to the emergency room and follows their movement through their stay at the hospital (Figure 3.1): being administered aspirin, being admitted into the hospital room, transferring between a normal floor bed and the intensive care unit (ICU), and ultimately being discharged either dead or alive. The dataset is split into two cohorts: patients who died and patients who lived. While this taxonomy is derived from numerous case studies in seven domains and aims to show the complexity and variety of questions asked during cohort com- 29 Figure 3.1: The dataset used as an example for the remainder of this chapter consists of records of patients who were admitted to the emergency room and follows their movement through their stay at the hospital: being administered aspirin, being admitted into the hospital room, transferring between a normal floor bed and the intensive care unit (ICU), and ultimately being discharged either dead or alive. parison, it can be expanded upon with the inclusion of more metrics that may be required for alternate domains and situations (e.g., similarity metrics or metrics dealing with the absence of events). 3.1 Summary Metrics Summary metrics deal with the cohorts as a whole and provide a high-level overview of the datasets. Number of records. Raw number of records in each cohort (Figure 3.2). Number of events. Raw number of events in each cohort (Figure 3.3). Number of unique records. Total number of unique records in each cohort based on the sequence of events (timestamps are not considered). 30 Figure 3.2: Because cohorts do not necessarily need to be the same size, it is im- portant to report on the number of records in each cohort. An understanding of the number of records allows analysts to understand broad trends between the cohorts (e.g., is the selection criteria balanced?) In this example, there are only 4 patients who died versus 6 who lived. Figure 3.3: The number of events is the raw number of events in each cohort. Combined with the number of records metric, this can reveal interesting information about the frequency of events and the average length of records. In this example, though there are 50% more patients who lived than those who died, the number of events is only 20% greater, indicating that patients who died have longer sequences, on average, than those who live. 31 Number of each event. Total number of occurrences for each event cateogry per cohort. Minimum, Maximum, and Average length of records. The length of a record is considered as the number of events in that record. 3.2 Record Metrics Record-level attributes (such as patient gender or age) compare the cohorts as population statistics. General statistics across the entire dataset is a problem already tackled by analytics tools such as Spotfire [64] or Tableau [65], however these tools look at a single attribute. For example, they might compare the number of males versus females or patients on Wednesday versus Thursday. There may be implications about the combinations of record attributes (e.g., the women on Wednesday versus the women on Thursday versus the men on Wednesday versus the men on Thursday). In clinical trials, it is important that all patient attributes are balanced and currently no tools exist for visually confirming that all attribute combinations are balanced (Figure 3.4). 3.3 Sequence Metrics Sequence metrics deal with hypotheses at a sequence-level and can refer to (1) the occurrence of sequences, (2) the timing of sequences, or (3) event-level attributes. Sequences are differentiated by type and can refer to any number of types: 32 Figure 3.4: Prevalence of record attributes reports on the percent of records who have a particular value. This example is comparing the proportion of male and female patients between the two groups. Sequence A record’s entire history. Subsequence A consecutive part of a record, consisting of two or more events. Event A subsequence of length one, or a single event category. Co-occurring pair Two events, that may occur non-consecutively within a single record. Outcome The last event in a record. Many of the metrics can be applied to multiple sequence types, but not to all. For example, metrics dealing with event gaps can only be applied to sequences of length 2 (consecutive or non-consecutive). Table 3.1 shows which matrix are appli- cable to which sequence types and the following sections describe each, organized by occurrence, time, and attribute metrics. 33 Table 3.1: This table shows the applicable metrics for each sequence type (denoted by an X). Metrics with shaded cells are those that were implemented in the final version of CoCo. 3.3.1 Occurrence Metrics Prevalence of an event. The percent of records or total number of events that a particular event occurs in (Figure 3.5). *Implemented in CoCo. Prevalence of a subsequence. The percent of records in which the subsequence appears. For example, patients who lived are given aspirin before going to the emergency room more often than the patients who died (Figure 3.6). *Implemented in CoCo. Prevalence of a whole sequence. Percent of records with a given sequence. *Implemented in CoCo. 34 Figure 3.5: The prevalence of an event is calculated as a percentage of records that contain that particular event. Figure 3.6: The prevalence of a subsequence is calculated as a percentage of records that contain that particular subsequence. 35 Figure 3.7: Co-occurring events are a pair of events which occur within a single record, and may or may not have other events between them. Prevalence of Co-occurring Events. The percent of records containing both events A and B (with any number of events between them, Figure 3.7). *Imple- mented in CoCo. Prevalence of Outcomes. If a single event is prevalent as an “outcome” (i.e., the last event in the sequence). This metric in particular applies only to cohorts that are not already split on an outcome event. Frequency of an event. The number of times per record an event occurs. Be- cause this is a distributed numerical metric, the system can report on minimum, maximum, mean, median, mode, and the distribution as a histogram. *Implemented in CoCo. Order of consecutive events in a subsequence. The percent of records con- taining event A directly preceding event B versus B preceding A. For example, perhaps patients who go to the ICU before the floor are more likely to live than 36 Figure 3.8: Absolute time metrics look at the timestamp of a particular event. For example, the prevalence of the day of the week can differ between the two cohorts. patients who have these events in the reverse order. *Implemented in CoCo. 3.3.2 Time Metrics Time metrics deal with the timestamps at both the event and sequence levels – relative and absolute. All of these metrics result in distributed numerical values, so the system can report on minimum, maximum, mean, median, mode, and the distribution as a histogram for each. Cyclicity The time between repeat occurrences of a sequence. Gap The gap between two events. Duration The duration a sequence takes to complete. Absolute time of an event. Prevalence of a particular timestamp of an event or multiple events (e.g., if all events in one cohort occurred on the same day, Figure 3.8). 37 Duration from a fixed point in time. The length of time from a user-specified, fixed point – aligned by either a selected event or absolute date-time. Duration of interval events. The duration of a particular interval event. For example, this can be the length of exposure to a treatment or the duration of a prescription. Duration of a subsequence. The length of time from the beginning of the first event in a subsequence to the end of the last event in the subsequence. Duration of overlap in interval events. The overlap (or lack thereof) of inter- val events. For example, the overlap of Drug A and Drug B could be more common in the cohort of patients who lived versus those who died. Event Gap between consecutive events. The time between the end of one event and the beginning of the next. For example, the average length of time between hospital patients entering the emergency room and being transferred to the ICU is under two hours in patients who lived and over two hours in those who died. *Implemented in CoCo. Event Gap between co-occurring (non-consecutive) events. The length of time between non-consecutive events (two events with some number of other events occurring between them, Figure 3.10). *Implemented in CoCo. Cyclic events. The duration between cyclic events and sequences. 38 Figure 3.9: Relative time metrics involve comparing the average gap between two consecutive events. Figure 3.10: Relative time metrics involve comparing the average gap between two events. 39 Figure 3.11: Attribute metrics are similar to other event metrics, but the events are further broken down by the attribute’s value. In this example, the doctor that is on-call when the patient arrived at the emergency room is noted. Dr. Smith was on-call more often in patients who lived than those who didn’t. 3.3.3 Event Attribute Metrics Any of the above metrics can be applied over values of an attribute of the events instead of the event category itself. This can be done by swapping an event category by the values of a particular attribute. For example, in a medical dataset, analysts might be interested in seeing how a particular emergency room doctor might be related to the outcome of a patient. Analysts would then switch all events of “Emergency” with the value of its “doctor” attribute. If there are three doctors, this would create 3 new pseudo-event categories. Analysts can use the metrics from above to see the difference in event sequences, times, or prevalence of each doctor in either cohort (Figure 3.11). 40 3.4 Combining Metrics The number of metrics is further multiplied because any combination of the above metrics is a new metric. Survivor analysis. Survivor analysis is a common metric in cohort comparison studies in the medical field, for understanding how an event or sequence occurs or diminishes over time. This is equivalent to combining prevalence with time – how does the prevalence of an event occur over time. 3.5 Summary This chapter presents Contribution 1: a taxonomy of metrics for comparing cohorts of event sequences. Although comparing two groups of data is a common task, with temporal event sequence data in particular, the task becomes complex because of the variety of ways the cohorts, sequences (entire records), subsequences (a subset of events in a record), and events can differ. Through a literature review of seven case studies of EventFlow and evaluation of cohort comparison, I work to understand how analysts perform cohort comparison and categorize their common questions. Though much work has been done in differentiating between the structure of the event sequences (e.g., order, co-occurrences, or frequencies of events), many analysts and tool miss the opportunity to explore the attributes about the events and records (e.g., gender of a patient), and the timestamps themselves (e.g., an event’s duration) as distinguishing features between the cohorts. In this chapter, I present 41 a taxonomy of 23 metrics of how cohorts can differ organized by cohort summary metrics, event sequences, and record attributes. While this taxonomy aims to be a holistic view of the cohort comparison space, there is the potential for expansion to many more metrics not mentioned here (Section 7.2). This taxonomy serves as an example of the complexity of questions that are possible when comparing cohorts of event sequences and to demonstrate that these questions can be asked systematically. 42 Chapter 4: Statistical Framework for High-Volume Hypothesis Test- ing In any form of high-volume data analysis, wait times are a given, but this prob- lem is especially prevalent when dealing with groups of event sequences because of the exponential number of unique sequences that exist in a single dataset. Consider the simple case of a dataset with only two events: A and B. Without considering repetitions, there are 5 unique event sequences that can occur: A B A→ B B → A AB, where AB represents two events occurring concurrently (at the same times- tamp). When allowing repetition, the number of event sequences becomes infinite: A → A B → B A → A → B AB → B . . . Further, each event sequence can have multiple metrics applied to it. For ex- ample, with the sequence A → B, we can consider the prevalence among records (i.e., percent of records containing this sequence), frequency (i.e., average number of occurrences per record, duration (i.e., average time from A to B). When comparing cohorts, the application of each of these metrics to each cohort is equivalent to a 43 Figure 4.1: A chart of the average runtime to find all sequences (a) versus the number of unique sequences and (b) versus the number of records. Finding all subsequences within both datasets grows proportionally with the number of records in the dataset, whereas the number of unique sequences has no effect. hypothesis. Does A → B occur similarly in both cohorts, or does it occur signif- icantly more in one than the other? Is the duration of A → B the same in both cohorts, or is it longer in one than the other? Thus, a simple dataset with only five event categories can have hundreds of hypotheses applied to it and larger datasets quickly become challenging to process. To provide a sense of timing, I conducted timing tests with the final version of CoCo with datasets of varying numbers of records (250, 500, 1000, 1500, 3000, 6000, 9000, 18,000, 36,000, and 72,000) and numbers of unique sequences (250, 500, 1000, 1500, 3000) in each cohort. I collected the runtimes for 100 runs each of finding all sequences within the datasets and calculated all the metrics. All tests were performed on a machine with an 2.2 GHz Intel Core i7 processor with 8 GB of memory. No multithreading was used. 44 Figure 4.2: A chart of the average runtime to calculate all hypotheses (a) versus the number of unique sequences and (b) versus the number of records. Calculating all hypotheses depends both on the number of records and the number of unique sequences in the dataset. Figure 4.1 provides a chart of the average runtime to find all sequences (a) versus the number of unique sequences and (b) versus the number of records. Find- ing all subsequences within both datasets grew proportionally with the number of records in the dataset, with the largest dataset (9000 records in each cohort) taking about 6.4 seconds to complete, regardless of the number of unique sequences. The number of unique sequences (and therefore, the number of events) did not have an effect on the time to find all sequences, because all sequences are mined based on the existing patterns in the dataset. That is, all sequences that are mined are a subsequence of each record as a whole, so every record must be checked. Figure 4.2 provides the results for the average runtimes to calculate all hy- potheses (a) versus the number of unique sequences and (b) versus the number of records. Calculating the metrics scaled proportionally with the number of unique sequences found, more rapidly than linearly. This is due to the fact that as for each 45 new unique sequence, the number of new subsequences is potentially more than one, so every new sequence introduces at least two new subsequences. The effect is fur- ther multiplied by applying numerous metrics to a single sequence (e.g., prevalence AND time). Overall, the time to calculate hypotheses took much longer than the time to find all subsequences, with the largest dataset (9000 records and 3000 unique sequences in each cohort) taking over 27 seconds to calculate, due to the fact that approximating p-values requires the most time. Because calculating the p-values is the most time-consuming aspect of hypoth- esis calculation, it becomes more important to present analysts with methods for reducing wait times, both on front- and back-ends. Through case studies with users, I identified five scalability guidelines for extending high-volume hypothesis testing to large datasets (Table 4.1). Table 4.1: The 5 Scalability Guidelines for extending high-volume hypothesis testing to large datasets. In this chapter I describe the implementation of CoCo’s statistical framework and how the components in its web-based client-server system are designed and organized. Lastly, I present five guidelines for performing high-volume hypothesis testing on event sequences to address these five issues. 46 4.1 System Overview: Backend CoCo is a web application that uses the client-server method to divide the frontend and the backend. This section provides an overview of the architecture used in the backend, which was written in Python 2.7. Python was chosen because the extensive availability of packages, ability for fast development, and flexible, multi-paradigm nature. Flask [66] was chosen as the webserver framework for being easy-to-install and lightweight. For computing statistics, the well-known package SciPy [67] was used. 4.1.1 Code Structure and Organization CoCo is organized into four main packages: (1) the main controller, (2) data processing, (3) metric computation, and (4) utilities. Figure 4.3 lays out each of the packages and the classes they contain. 1. Main controller. The main controller contains the necessary elements for setting up the main server, providing web endpoints for the frontend, and housing all variables needed on the server-side. All data is kept in memory with simple objects and native Python data structures, though it is organized such that future work may adapt it to a relational database backend (Section 7.2.4). A database backend was not implemented in the prototype in order to minimize the number of package dependencies when case study partners were installing CoCo on their machines. Additionally, the use of native Python lists and dictionaries allowed for 47 Figure 4.3: Code structure and organization. 48 fluid data transfer between the JavaScript-based frontend and the Python-based backend. Variables. Each cohort is represented by a global variable as a dictionary mapping record IDs to Record objects, named alpha and beta. The dictionary allows for rapid looking for particular records, as well as quick collection of the records themselves. An attribute dictionary maps event attribute keys to a set of event attribute values. The record attribute list maps record attribute keys to another dictionary, for each cohort. The second dictionary maps the record ID to the record’s attribute value. This allows for easy look-up by attribute, attribute value, cohort, and specific record. The event legend is a set of all event categories found in the datasets. The sequence counts for every sequence are stored in a list because it is the most accessed metric category. This datastructure is used by the sequence scatterplot, as well as to calculate all prevalence metrics, so it was important to store this data after it is calculated once. Lastly, the metrics object is a dictionary which defines metrics by category and granularity. Each metric stores its title (e.g., “most differentiating events”), a description which is used in the tooltip, and results, which are empty to start. Each result object contains a list of hypothesis results, represented as 3-tuple containing value in cohort A, value in cohort B, and the p-value. Web endpoints. The CoCo backend has four main endpoints: • / – This is the index which renders the CoCo HTML template. 49 • / upload files – This endpoint is called when the analysts select and up- load their data files. It is responsible for parsing all the input files into their respective data structures. • /get sequence counts – After the data files are loaded and parse, the main JavaScript controller access this endpoint to retrieve the data for the sequence scatterplot. • /calculate metrics – This endpoint calls all necessary methods to automat- ically compute all hypotheses. Other endpoints are used for shutting down the system cleanly (/shutdown) and serving and streaming server-sent events (/stream). 2. Data processing. This package contains classes dealing with processing the data in CoCo, including loading files and preparing data for visualizations. File processing.. Within this package are methods for processing the files inputted by the user. Processing includes reading the data file, creating the cohort objects, assigning which cohort is which (left vs. right cohort), parsing the file name and giving it a human readable name, and processing any attribute and configuration files, if applicable. Sequence extraction. This package computes the counts all the sequences and subsequences of length 1 to 10, and records how many times each sequence appears in each cohort record. The result is provided to the sequence scatterplot visualization. 50 EventFlow tree builder. This package processed data for the EventFlow visu- alization and returns an object that can displayed. This includes 3. Metric Computation. All methods dealing with computing metrics are grouped into this package, and organized by metric type: prevalence, time, fre- quency, record attributes, and event attributes. Prevalence. There are three main methods for computing prevalence: con- secutive sequences, single events, and non-consecutive pairs. This class also contains many helper functions which are used for counting each of these types of sequences in each dataset. All prevalence metrics are compared using chi-squared. Time. This package calculates metrics dealing with time using a Wilcoxon rank-sum test to compare means. Frequency. This package calculates metrics dealing with frequency of events using a Wilcoxon rank-sum test to compare means. Attributes. This package uses chi-squared to run metrics dealing with record and event attributes. 4. Utilities. Models. CoCo defines two simple classes, as explained in the previous section: Event and Record. The Event class contains: 51 event – the event category, time – the timestamp of the event, recordID – the ID of the record which this event is a part of, and attributes – a dictionary mapping event attributes to values. Record contains: recordID – the record’s unique identifier, eventList – a nested list of Event objects, sorted by time. Each new event is inserted into its sorted position. Each inner list is a group of events with the same timestamp, which represent concurrent events, and attributes – a dictionary mapping record attributes to values. Helpers. This package contains various helper functions that are used through- out the system. Messaging. This package contains the class and methods for creating and sending server-sent events (SSEs), which allows the server to communicate with the frontend. 4.1.2 Data Processing Pipeline This section presents an overview of the data processing pipeline for CoCo (Figure 4.4). CoCo reads in two files (one for each cohort) in the same format as EventFlow: 5-column, tab-delimited text file where each column is as follows: 1. Record ID. The ID of the record to which the event belongs. Each record 52 Figure 4.4: CoCo data processining pipeline. CoCo processes data in five major steps: (1) Analysts select two datasets from the interface. (2) The data files are sent to the server. (3) Sequences and counts are extracted. (4) The results for the sequence counts are sent back to the client. (5) CoCo begins (a) calculating metric results and (b) sends them back as they are completed, until all metrics have been calculated. 53 should have a unique ID. Record IDs do not have to be unique between the two cohorts. 2. Event Category. The type of event. 3. Start Time. The start timestamp of the event. 4. End Time (optional). The end time of an interval event, left blank if a point event. 5. Attribute List (optional). Attributes for this event. Each attribute is semi- colon separated and defined as “attribute=value” After the analysts load both datasets, CoCo identifies all sequences of lengths 1 to 10. The sequences are mined using n-grams. Next, all pairs of non-consecutive events are found in each sequence. The sequences are stored in a dictionary that also counts the number of times each sequence occurs in the datasets. Next, the metrics are applied to each of the sequences. As each metric is com- pleted, the server returns the set of results to the frontend using Server-Sent Events (SSEs). Though the result list is not shown until all metrics have finished calcu- lating, a table shows progress of which metrics have been calculated, and overview visualizations are shown. 4.2 Guidelines for Scaling HVHT to Large Event Sequence Datasets Due to the complexity of mining sequences in multiple event sequence cohorts and running hypothesis tests on all of them, many challenges arise dealing with wait 54 times, result set size, and statistical errors. In this section, I describe the five major guidelines for scaling a statistical framework to larger datasets. Guideline 1: Reduce wait times during computation. I applied two different methods for computing the hypothesis tests: (1) per- forming all calculations ahead of time and providing results only when all results are complete, and (2) calculating hypothesis tests by category (e.g., single event frequency, sequence frequency, time gaps, etc) and allowing analysts to see results as they are available. The first method resulted in long wait times, but allowed the results to be ranked in a more meaningful way. That is, by waiting for all results to be completed, the most “differentiating” or significant results can be displayed first, thus offering more guidance to the analysts about which results are important. The second method allows analysts to see partial results as soon as they are ready. When a metric is fully calculated, the analysts can select that metric to see all results in that category. In early case studies, this enabled analysts to narrow their focus, although they found that they weren’t necessarily interested in specific metrics, just the most major differences – regardless of metric type. The metrics were structured to first calculate the most simple metrics first, e.g., single event metrics, which enabled analysts to understand their datasets on a broader level before going into detailed sequence metrics. Additionally, the metrics should be calculated in an order conducive to both the analysis and optimal time. Through case studies, users 55 common process models were observed. Combined with timing tests to determine the results which are quickest to compute, a recommended computation order is suggested: • Prevalence – Events • Prevelence – Non-consecutive pairs • Time – Consecutive pairs • Time – Non-consecutive pairs • Frequency – Events • Prevalence – Consecutive subsequences • Record Attributes – prevalence Guideline 2: Reduce time testing all hypotheses. Long wait times can cause an analyst to lose concentration and incur more time recalling their task. In an effort to minimize long waits, I implemented a sequence length limit on the sequence mining step in the CoCo pipeline. The original version of CoCo counted every sequence that appeared in the loaded datasets. However, in the weblog clickstream data, there were some records that had as many as 320 events. Based on observations of the previous three case studies, the analysts often did not look at results of sequences of length more than four. Typically, longer sequences were more obscure and analysts were not able to 56 derive meaningful insights from them. Sequences are limited to length 10, to be adequately long enough for the longer sequences found in clickstreams. In the use of this new limited version, analysts still looked mostly at sequences of length 45 at most, so there was no need to extend the range beyond length 10. The limit was not further reduced because performance at this stage was reasonable. Limiting the sequence length offered a speed up of about 15x. If future datasets would benefit from a shorter limit, I leave determining the ideal limit for future work. Guideline 3: Minimize data sizes when transferring to browser Larger datasets require more hypotheses to be tested, thus larger result sets to return to the browser. Aside from the computation time, this results in a much large space requirement. Due to some browser limitations, it is not possible to send data over a certain size. Thus, to reduce the volume of the result set, those sequences that occur in less than 1% of the records are automatically filtered. Additional hypothesis results can be loaded on demand. Guideline 4: Highlight chance of false positives. The potential for false positives are highlighted by providing the distribution of p-values to the analysts in a filterable table. Two statistical experts that I consulted with suggested this, because with any statistical test that is applied many times to a single dataset, there is some likelihood of false positives. By providing the analysts the distribution of the resulting p-values, the analysts can see if the actual 57 distribution of p-values is what would be expected by random chance or if it is in fact affected by the content of the dataset. The chance of statistical uncertainty is further highlighted by providing context to the analysts about each result. For example, by providing related statistic results for the same sequence and showing the prevalence of subsequence results. Lastly, we place an emphasis on effect size, rather than p-value ,by primarily showing and sorting by the difference and grouping the p-values into broad ranges. Guideline 5: Enable filtering of event categories unimportant to anal- ysis. By default, CoCo starts by showing only the results for single event categories (sequences length 1) so analysts can make informed decisions about which events occur frequently and which might be important to the analysis. After determining if any events can be dismissed, analysts can filter out those events that they deem unrelated or unimportant to their questions. In our example case study, the analysts were able to reduce their number of events from over 100 events to under 20. 4.3 Summary This chapter presents Contribution 2: a statistical framework for performing high-volume hypothesis testing when comparing cohorts of event sequences. The nature of event sequences present unique challenges, including an exponential vol- ume of sequences, increased chance of false positives and statistical errors, and long 58 wait times for analysts to begin analysis. Towards addressing these challenges, I designed a system to better support statistical event sequence comparison tasks and presented an overview of its implementation (Section 4.1). Through case studies, I confirmed the usefulness of these techniques and present the lessons learned as five guidelines for scaling HVHT to large datasets (Section 4.2). 59 Chapter 5: Design of CoCo: Frontend After running thousands of hypothesis tests, analysts must then be able to parse through the large result set. In doing so, there are two main challenges: 1) the sheer volume of the result set makes it difficult to identify meaningful and significant results, and 2) when running a large number of statistical tests on a single dataset, the chance of false positives increases. Through a user study and ten case studies, I aimed to solve these challenges and understand how to leverage the benefits of user-guided exploration to parse high-volume hypothesis results. To develop the initial design and icons, I conducted interviews with three analysts experienced with event sequence visualization: a medical researcher from a local hospital, a graduate student at the University of Maryland, and a business school professor. All had used EventFlow [1] extensively and had active research projects comparing cohorts of patients. An initial version was implemented. After the three analysts had used the initial version with their own data and analytic goals, I interviewed them during a period of a month to collect feedback on the benefits and pitfalls of the initial version, and analysts’ needs when reviewing hypothesis results. Feedback was also collected from a eighteen other short-term detailed demonstrations, some of which lead to long-term case studies. 60 I distill lessons learned into seven design guidelines for balancing automated high-volume hypothesis testing with integrated visualization and interaction (Ta- ble 5.1). Table 5.1: The 7 Design Guidelines for balancing automated high-volume hypothesis testing with integrated visualization and interaction (Section 5.2). In this chapter, I provide an overview of the final interface of CoCo. From there, I describe seven design guidelines learned from the design process and the rationale behind the design decisions that led to CoCo’s final design. Appendix 8 provides details on previous versions of CoCo and the changes between each iterta- tion. 5.1 Description of the User Interface CoCo (Figure 5.1) is comprised of five main panels: sequence scattergram, cohort overviews, result filters, results panel, and sequence details. 61 Figure 5.1: CoCo is comprised of five main panels: sequence scattergram and filters, cohort overviews, result filters, results panel, and sequence details. 5.1.1 Sequence Scattergram and Sequence Filters The first panel provides an overview of all the sequences in the dataset, as well as a method for filtering the sequences by length and by type (consecutive, non-consecutive). Each dot represents a sequence that is found in the datasets and is placed on the axes according to the number of records that contain that sequence in each cohort. Sequences of length 1 are single event categories. A consecutive sequence is one that occurs in the dataset with no events between them, and a non- consecutive sequence may contain extra events within it. Consecutive sequences are indicated with a solid black circle and non-consecutive sequences are represented by a circle divided by a white line. Analysts can filter the results based on the sample 62 sizes to exclude rows with very low or very high sample sizes. This can be used as a method for quality control (e.g., removing results with an insufficient sample size) or as a method for segmenting the results into more manageable pieces. For example, analysts may want to evaluate more frequent sequences first (e.g., sequences with 50% or more record coverage), before moving viewing less frequent sequences. 5.1.2 Cohort Overviews The second panel provides a high-level overview of the sequences contained in each cohort using an EventFlow [1] display. The heights of the cohorts are adjusted in proportion with the number of records in each dataset (e.g., the dataset with more records will take up more vertical space). 5.1.3 Result Filters The third panel (Figure 5.2) provides methods for filtering, sorting, and cor- recting the hypothesis test results. Results can be filtered by two ways using a table that shows the number of hypotheses that were tested according to metric and sequence type. Analysts can choose to see only time, frequency, or prevalence metrics. Similarly, analysts might be interested in only single events, whole record histories, or partial subsequences. The table further breaks down the results based on p-value into three groups: ≤ 0.01, ≤ 0.05, and > 0.05. Each cell contains the total number of hypotheses currently shown out of the total number of hypotheses testing for that metric, sequence type, 63 Figure 5.2: Methods for sorting and filtering the result set. Results can be filtered by two ways using a table that shows the number of hypotheses that were tested according to metric and sequence type. The table further breaks down filtering the results based on p-value into three groups: ≤ 0.01, ≤ 0.05, and > 0.05. 64 and p-value group. Analysts can sort the results based on what they find most important: • Ratio and significance. Sort first by the significance level (p-value in three groups: ≤ 0.01, ≤ 0.05, and > 0.05) then within each group, by magnitude of the difference (descending). This is the default sorting option. • Significance only. Sorted by the raw p-value (descending). • Ratio only. Sorted by the absolute ratio (ascending or descending) • Alpha value. Sorted by the absolute value in alpha (descending). • Beta value. Sort by the absolute value in beta (descending). Lastly, analysts can apply a Bonferroni correction [68] to the results. 5.1.4 Results Panel The main results panel (Figure 5.3) displays all the results of the hypothesis tests according to the sorting and filtering preferences set by the analyst. To the left is a legend which each event category that is found in the dataset, assigned a color. Each result is encoded as a row, where the center shows the hypothesis that was tested. Colored bars in the center indicate the sequence that the hypothesis refers to and the icons to the left indicate the corresponding metric. Depending on the value of the result, a bar grows out from the center in the direction where the value is larger, on a ratio scale. The bar is then colored by the p-value of the result: 65 Figure 5.3: The main results panel (Figure 5.3) displays all the results of the hypoth- esis tests according to the sorting and filtering preferences set by the analysts. To the left is a legend which each event category that is found in the dataset, assigned a color. Each result is encoded as a row, where the center shows the hypothesis that was tested. Colored bars in the center indicate the sequence that the hypothesis refers to and the icons to the left indicate the corresponding metric. Depending on the value of the result, a bar grows out from the center in the direction where the value is larger, on a ratio scale. The bar is then colored by the p-value of the result. • Black indicates a p-value ≤ 0.01. • Grey indicates a p-value ≤ 0.05. • White indicates p-value > 0.05. When reviewing a large list of results, it is unclear to analysts when everything has been reviewed, especially when they use filtering methods to view smaller pieces of the results at a time. A simple progress bar at the right of the results shows the analysts progress through the result set. It is a heatmap where each result is a 66 single line and color indicates: • Grey: result has been reviewed. • Red: result has been calculated and is not reviewed. To make it more obvious that the analysts has not missed potentially signif- icant results, CoCo also encodes the p-value using a colored border, matching the above p-value colors. The progress bar serves as the scrollbar and minimap for the result set. An- alysts can page through the data by scrolling along the progress bar. A thickened border indicates the portion of the data that is currently being viewed. The order in the progress bar matches the order of the detailed results and is determined by the analyst, based on the sort options provided. 5.1.5 Sequence Details Context is given using details on demand. Analysts are able to see the under- lying data for a selected result. Depending on the type of metric, analysts will see different information. Because metrics dealing with prevalence are only a matter of percentage, all this data is shown in the result snapshot and the details on demand don’t show any additional information. For metrics that show an average (e.g., all time metrics and frequency metrics), the details on demand show the exact distribu- tion for all values (Figure 5.9). Additionally, the details on demand show high-level statistics about the distribution: sample size (n), average, minimum, maximum, and standard deviation. 67 Figure 5.4: Analysts can view details about a result by clicking it. Results that cor- respond to comparing averages (such as average duration or average frequency) will show the distributions of all the values and statistics about the average, minimum, maximum, and standard deviation in both cohorts. 68 5.2 Design Guidelines for HVHT Visual Analytics Tools Guideline 1: Convey hypotheses succinctly. In an initial implementation, the LifeLines2 [69] triangle scheme was used to display event sequences and organized results by metric (e.g., all results dealing with the occurrence of sequences were grouped together; all results dealing with co- occurrences of events were grouped together, etc.). With this organization scheme, the metric selected by the analysts implied a lot about the sequences in its result set and all sequences looked identical. In feedback on this design, many analysts felt that only visualizing the sequence (with no indication of what the hypothesis was), was confusing and they would often have to remember which metric was selected. I conducted interviews with three domain experts to determine how to distin- guish between various event sequence features. In the interviews, each expert was asked how they would visually differentiate the following types of sequences and their properties: • Whole record sequence • Concurrent events • Consecutive vs. non-consecutive sequence Mockups of responses are shown in Figure 5.5. Analysts suggested differenti- ating whole record sequences (a) by adding markers indicating the beginning and end of the sequence, to signify no events occur before or after the sequence. Square 69 markers were chosen over the angled brackets to avoid ambiguity with the notion of a “set.” Analysts showed concurrent events (b) by either overlapping them or grouping them with a circle. The overlapping method was preferred because it was more compact. Analysts had more variation in how they chose to differentiate con- secutive (c) vs. non-consecutive (d) events. Two analysts chose to keep consecutive sequences the same, while differentiating non-consecutive sequences by placing a marker between events. The third analyst suggested the opposite: show no dif- ferentiation between non-consecutive events, but place a bar to join consecutively occurring events. In a later version, the event icons were changed from triangles to slim rectangles to conserve screen real estate (and give analysts the option to toggle between the two versions). The current scheme is shown in Figure 5.6. Guideline 2: Visualize statistical results and differences. In designing the result displays, the design needed to convey information about the difference in value (both magnitude and direction) and the statistical significance of the result. Additionally, color was already used to encode the event categories and needed to be avoided. The three methods of visual comparison outlined by Gleicher et al. [6] were tried to encode this data: juxtaposition, superposition, and explicit encoding. Figure 5.8 shows the designs that were considered. Juxtaposition (a) showed the absolute values in each cohort, and worked well for values that had a fixed range (e.g., percentages for 0% to 100%). However, it was not adaptable 70 for variable range values (e.g., time, where a difference can be as small as 1 minute or as large as 3 months) or for displaying time and prevalence metrics in the same view. It is also not ideal for scanning for differences easily because the difference is not explicitly encoded. Superposition (b) has the advantage of displaying the raw values and direction of difference more clearly, but had similar problems to juxtaposition in displaying time and prevalence results on the same axes; because it is axis dependent, it is not possible to display time and percentage in the same view. I found that an explicit encoding only (c) offered the best option by allowing analysts to easily see and interpret differences between the datasets, despite the absolute values in each cohort are obscured. The absolute value information was available using interactions such as hover or details-on-demand to display them. With the explicit encoding method, it is also able to explore different meth- ods for encoding the differences: absolute difference, relative difference, and ratio. The values in datasets can be categorized in three groups. Take for example, the occurrence of a sequence: 1. Occurs in both datasets the same way (no difference) 2. Occurs in both datasets, but more in one 3. Occurs in only one dataset Again, providing the absolute difference is sufficient when presenting prevalence results, because percentages are bounded to 100%. However, with results dealing with time, a single scale does not accurately convey differences because 1) time is unbounded, and 2) simply scaling the axis does not always work because even 71 within a single dataset, different time granularity may exist (e.g., a hospital stay is on the order of days whereas a prescription is on the order of months). Relative differences and ratios eliminate the problems of multiple units and granularities, however I found analysts understand ratios more clearly than relative differences. For example, it is easier to interpret “hospitals stays are two times longer in cohort A than cohort B” rather than “hospitals stays are 100% longer.” Because ratios can be anywhere from 1 (in case 1) to infinity (case 3), I bound the axis to an analyst-defined maximum (default: 4x). If the ratio is above the maximum, the bar grows off the side, and if it is infinite, an infinity symbol is displayed next to the ratio bar. Guideline 3: Allow flexible methods for organizing results. I explored four methods for organizing the results, each with its own benefits: 1. Metric hierarchy. This was the approach taken in the initial version and it worked well in guiding the analysts. Analysts typically started based on se- quence length looking at single events before looking at longer sequences, then progressing based on their specific questions. This method worked best when analysts had specific questions about the datasets (e.g., if they were only concerned with whole record sequences). 2. Flattened. In open-ended and unstructured exploration, the analysts do not seem to care about what the metric is, just how important or distinguishing the result is. A flat design displays all hypothesis results in a single list view, 72 regardless of metric or sequence type, and orders them by the significance and magnitude of difference. 3. Sequence. Some researchers may have questions about a specific sequence of events. For these questions, it is best to group results by event sequence. 4. Metric/flat list hybrid. In this view, the top 10 results for each metric are displayed. A hybrid view will give a good overview of the most important features of the dataset. In the initial implementation, results were organized based on their category (Method #1). However, interviews with domain experts and analysts indicated that in their analyses, they didn’t always care which metric was significant they wanted all the significant results in one place, regardless of what type of metrics they corresponded to. Method #2 is the most flexible for most uses and that analysts can use result filters if they have specific questions dealing with a particular sequence or metric. Guideline 4: Provide flexible interactions for parsing results. Displaying large result sets presents challenges in parsing them. Three in- teraction techniques are provided in Coco for parsing the results. First, with so many hypotheses, not every hypothesis will apply to the dataset or domain. Re- searchers might have different priorities based on questions they already have. For example, some analysts might only be concerned with whole record sequences, while others want to see patterns across shorter subsequences. Some analysts might be 73 concerned with only metrics dealing with prevalence, whereas others are interested in both time and prevalence metrics. Filtering and sorting provides flexibility by allowing analysts to manage their data based on what is relevant to their questions. Second, as analysts sort through the results, they might easily disregard some hypothesis given their domain knowledge (e.g., results that are spurious correla- tions) and would need some way to keep track of everything they care about or have hidden. For this, I suggest simple journaling options: starring, hiding, and annotating (Section 7.2.7). Lastly, when there are thousands of tested hypotheses, it is difficult for analysts to keep track of how many hypotheses they have viewed, how many are left to view, and of those results that are unviewed, which are significant. A progress bar that indicates analysts’ progress through the result set, so analysts feel comfortable that all possibly meaningful results have been reviewed. Guideline 5: Provide context. As analysts progress through the result set, it is difficult to understand if a result is meaningful based on a single result, especially when dealing with event sequences. For example, if patients visit the ICU after the emergency room more often in a cohort of patients who died versus lived, it may only be significant because the “ICU” event occurs more often in the cohort of patients who died. CoCo provides details on demand and provides the analysts with an overview of their progression through the result set. 74 Analysts are able to see the underlying data for a selected result. Depending on the type of metric, analysts will see different information. Because metrics deal- ing with prevalence are only a matter of percentage, all this data is shown in the result snapshot and the details on demand don’t show any additional information. For metrics that show an average (e.g., all time metrics and frequency metrics), the details on demand show the exact distribution for all values (Figure 5.9). Addition- ally, the details on demand show high-level statistics about the distribution: sample size (n), average, minimum, maximum, and standard deviation. Guideline 6: Provide an overviews of both cohorts With a large dataset, an analyst may not know what his or her data looks like. EventFlow displays are embedded for each cohort to provide this overview. EventFlow was chosen because many of the analysts are familiar with it, and its aggregate display provides an overview of the most frequent patterns across the cohorts in a compact view that will scale to large datasets without using more space. Guideline 7: Provide guidance on beginning analysis With hundreds of thousands of hypothesis results, it might be daunting for analysts to know where to start with their analysis. To simplify this process, I suggest two methods. First, follow a recommended process model and arrange the layout to match 75 this process. The panels on CoCo to suggest the order in which analysts should ex- plore their dataset. CoCo first provide methods for seeing an overview of the all the data (scattergram and cohort overviews) on the top left, followed by more detailed views of the result set. Controls for filtering and sorting this list are prominently displayed on the top right. Second, CoCo provide default values for all result filters and sorting methods. While these result filters are customizable, the default values provide the simplest starting point for the analysts. It is important that the default values are carefully chosen. For example, for sequence length, I decided to start with length 1, since analysts are often overwhelmed after looking for at the long results list. Starting with length 1 allows analysts to get a bearing on the events in their cohorts and allowing them to choose when they are ready to move onto the next result set. 5.3 Summary In this chapter, I present Contribution 3: A family of visualizations and guide- lines for interaction techniques. High-volume hypothesis testing results in large result sets which are difficult for analysts to parse. Through an interactive user interface, analysts are able to more easily identify important results. I provide an overview of the visual analytics tool, CoCo, and discuss the design decisions that lead to its development. Through case studies with CoCo, I explore the utility of these designs and interactions and distill the lessons learned into seven design guidelines. 76 (a) t Figure 5.5: Analysts’ responses. (a) t Figure 5.6: Current scheme. Figure 5.7: Mockups of expert analysts’ responses (left) and resulting glyphs (right) for visually differentiating four properties of event sequences: (a) whole record se- quences, (b) concurrent events, (c) consecutive sequences, and (d) nonconsecutive sequences. 77 Figure 5.8: Designs considered for presenting difference results between cohorts and : (a) juxtaposition (directly comparing two bars), (b) superposition (overlaying bars darkened area is the shared amount while the lightened area indicates the difference), and (c) explicit encoding only, which encodes only information about the direction and magnitude of the difference. 78 Figure 5.9: Analysts can view details about a result by clicking it. Results that cor- respond to comparing averages (such as average duration or average frequency) will show the distributions of all the values and statistics about the average, minimum, maximum, and standard deviation in both cohorts. 79 Chapter 6: Evaluation and Case Studies 6.1 Preliminary User Study To refine CoCo’s design and to observe actual practice of analyzing real-world dataset using a combined CoCo and EventFlow tool, we conducted an early, pre- liminary user study with volunteers who expressed interest in learning about data visualization and about a new form of statistical analysis. 6.1.1 Method Our evaluation design was based on the VDAR scenario [70]. More specifically, the goals of our user study were as follows: 1. To learn about the insights users would find. 2. To gain insights into the strategies users would follow. Before this study, we tested our materials with four participants, where they used either CoCo or EventFlow and we counted the number of insights. We observed that they tended to report everything from CoCo as insights, without considering their actual meanings and importance. In this study, we asked participants to provide further suggestions based on the their insights to guide research at the 80 hospital. As a result, participants were more engaged in the analysis and on average provided 3.5 (SD = 1.07) suggestions. Participants and Settings. We recruited 10 computer science graduate students (7 male, 3 female) through our university’s mailing list. The participants’ ages ranged from 23 to 29(M = 26, SD = 2.06). All participants had normal color vision. We ran CoCo and EventFlow on the same computer. CoCo was displayed on a 1440× 900 screen while two side-by-side EventFlow windows were displayed on a 1920× 1200 screen (Figure 1.2). For simplicity, we began by implementing only a subset of all possible metrics that users may want: • Prevalence of events, subsequences, and record sequences. • Duration between sequentially occurring event pairs. • Prevalence of events and subsequence by attribute. Procedure. Each 45-minute session included training, data analysis and post- study interview. Training started with a 2-minute introduction on each interface’s features. For each interface, participants performed 5 simple tasks and were en- couraged to ask questions. We used a pair of synthetic datasets for the training. Questions included clarifying the difference between a “sequence” and a “subse- quence” (CoCo), the difference between an event category and attribute (CoCo) 81 and the meaning of gaps between bars (EventFlow) were frequently asked. After the training, all participants said they understood everything. In the 30-minute data analysis session, a different pair of datasets (representa- tive of hospital room transfer data) were used: patients discharged alive or patients who died. Participants were asked to play the role of a data scientist and analyze the datasets using both CoCo and EventFlow. Their job was to provide insights into the similarities and differences between the paths of the two groups in the hospital. We encouraged thinking aloud and an experimenter took notes of their findings. In particular, we asked them to provide a reason when they switched between CoCo and EventFlow. During the post-study interview, participants provided comments and reflec- tions about their experience. 6.1.2 Results During the analysis, no participant asked any interface-related questions and instead concentrated on finding insights. Every participant used both interfaces. In particular, three said they prefer CoCo, while two said they prefer EventFlow. Five expressed no preference. On average, eleven (SD = 3.67) insights were reported and four (SD = 1.12) interface switches were made per participant. During the interview, all participants stated they wanted to use both interfaces. Below we summarize the results in the context of our user study goals. 82 Types of Insights. All insights can be categorized into four categories: events, whole record sequences, subsequences and time (Figure 6.1). Seven out of the ten participants mentioned that it is easier to find subsequence patterns with CoCo while it is easier to find whole record sequence patterns with the side-by-side EventFlow display. They thought EventFlow didn’t actually support detecting subsequence patterns because it only showed records as sequences, and they had to visually scan each record to compare subsequences. On the other hand, CoCo specifically provided a metric for subsequences. As a result, significantly more subsequence insights were found by using CoCo than EventFlow (p < 0.05) (Figure 6.1). As for whole record sequences, they preferred EventFlow and stated that it visually encoded the number of records into the heights of color bars, which made it more comprehensive, e.g., “I just have no feeling about the numbers in CoCo. EventFlow is more interesting.” Meanwhile, six out of the ten participants men- tioned that it was hard to compare a specific sequence using the side-by-side windows of EventFlow because they had to search for that sequence on each side separately and had to visually compare them by height, which was not very accurate. In con- trast, a participant noted that “CoCo does the comparison for me.”. There are no significant differences between the two interfaces for the number of insights under the whole record sequence category (p = 0.39) (Figure 6.1). Participants found most of the insights in the event category by using CoCo. This might be related to the fact that CoCo specifically provides a metric for com- paring the distribution of events. “It shows numbers and details.” One participant 83 said when he surprisingly found in CoCo that the Intermediate Care event only occurred in the group of patients who died. “I didn’t notice that in EventFlow!” he added. Also, as the event metric was listed at the top of the metric list in CoCo, all participants started using CoCo by looking at that metric. As for the time metric, six out of the ten participants mentioned that they liked the way EventFlow combined gaps with sequences together. They also mentioned that in CoCo, they had to switch back and forth between the time metric and the sequence metric to look for insights. “It provides a better big picture and the time is more visible,” one participant commented when he was looking at EventFlow to help himself understand the time gaps in the dataset. More insights in the time category were found by using CoCo but it was not significant (p = 0.25). One potential reason might be that CoCo was able to average the gap between event pairs, while those pairs were more difficult to find in EventFlow since they are not aggregated and could appear in multiple places in the visualization. Description of Strategies. During the study, participants were allowed to switch freely between the two interfaces, but we asked participants to describe why they switched. Nine out of ten participants chose to start with EventFlow. Of these nine participants, three said EventFlow provided a better overview, four said Event- Flow seemed simpler and more comprehensive than CoCo (e.g., “EventFlow is more friendly to my eyes!”), and the other two said they wanted to look at the actual data with EventFlow before doing the analysis. The only participant who chose to 84 Figure 6.1: Average number of insights per participant per category using EventFlow versus CoCo. The only statistically significant difference (p < 0.05) is in insights about subsequences, where participants found more insights using CoCo. start with CoCo said he liked the statistical summary provided by CoCo. “It looks like a dashboard,” he added. After exploring for five to fifteen minutes, the nine participants who had started with EventFlow switched to CoCo. Three spent about fifteen minutes before making the switch. They explained that they preferred to find everything with one interface first. Four said they got stuck with EventFlow and wanted to get some in- spirations from CoCo. Two said EventFlow didn’t support comparing subsequences very well, so they temporarily switched to CoCo to do the comparison. The only participant who started with CoCo also switched to EventFlow. He mentioned that he had found an interesting pattern so he switched to EventFlow to have a look at the actual data, which might help to confirm the finding. After the first switch, all participants were familiar with both interfaces and 85 on average made three more switches. The reasons they provided mainly fell into three categories: (1) switching from CoCo to EventFlow for the overview of event sequences and gaps, (2) switching from CoCo to EventFlow to “get a sense” of the patterns they found in CoCo, and (3) switching from EventFlow to CoCo to see the statistical information and to confirm their findings. “I like the fact that you give me two different tools. I can look at the data in different ways,” one participant commented at the end of his analysis session. Usability of CoCo. During the post-study interview, all 10 participants stated that training was important to CoCo. In particular, 8 said they could use CoCo skillfully after the training and 2 said they needed more practice. Nine participants liked the visualization of CoCo. The other 1 said he thought traditional scatter plot was more effective. Background knowledge in statistics, especially in p-value was critical to understanding CoCo. Eight participants said the statistics was easy to understand while the other 2 said they didn’t get the idea of the p-value. How- ever, the 2 participants added they remembered the rule that black dots were more important than gray or white ones, and it helped a lot in the analysis. 8 partici- pants thought the sequence aggregation feature was useful, because it provided an overview and could show details on demand. One participant commented he seldom expanded it but added that “It helped me to focus though.”. The other 1 said he had difficulty locating the main sequences after doing expansions. Nine participants liked the layout and navigation of CoCo. However, 2 of them commented the inter- face was not impressive (e.g., “The gray background is boring.”, “It didn’t catch my 86 eyes.”). The other 1 disliked the layout because several panels were seldom used in his analysis. 6.2 Case Studies: Introduction To investigate the strengths and limitations of CoCo as an automated cohort comparison tool, we conducted case studies following the procedure of a Multi- Dimensional, Long-term In-depth Case Study (MILCS) [71]. This methodology was chosen for its emphasis on evaluating the use of the entire system with partners who are real-world analysts using their own data, outside a traditional laboratory study. The MILCS process begins with understanding the partners’ data, needs, and analysis objectives through an introductory questionnaire and interview. From there, we set a schedule for regular meetings and observation, where the analyst is able to explore their data and provide feedback, while I iterate over the design of the system, fix bugs, and address the analysts’ feedback. At the end of the speci- fied period (which may range from a few weeks to several months), I reflect on the outcomes of the study and lessons learned for both the analyst and my own work. Fifteen groups expressed interest in using CoCo for their analysis purposes for a total of eighteen case studies. Eight were aborted for various reasons: incompati- ble problem, lack of time, or the data wasn’t prepared. The remaining eleven were successful case studies performed with CoCo at different stages, over the course of two years. Five of these successful case studies were long-term (CS1 – CS5) and five were short-term (CS6 – CS10). Table 6.2 summarizes all eighteen case studies. Sec- 87 88 tions 8–6.7 describe the successful long-term case studies in detail, Sections 6.8–6.12 describe short-term, one-time use case studies, and Section 6.13 discusses aborted case studies. 6.3 CS1: Exploring Adherence to Advanced Trauma Life Support Protocol Figure 6.2: Analysts at Children’s National Medical Center used CoCo to under- stand potentially distinguishing attributes between patients who are treated accord- ing to the Advanced Trauma Life Support (ATLS) protocol versus those who are not. Participants. I worked with Dr. Rachel Webman at Children’s National Medical Center, a pediatric care provider in the Washington, D.C. area. 89 Procedure. After an initial phone conversation to discuss potential analyses, I provided the partners with a demonstration and tutorial of CoCo to Dr. Webman. After converting the data to the CoCo format and installing CoCo on her machine, we meet weekly thereafter to discuss the analysis process, which included data clean- ing, visualization (in both EventFlow and CoCo), and data analysis (using Stata). I met with Dr. Webman for two sessions to observe her use of CoCo. The case study ran from August 2014 to December 2014. Analysis goals. In a previous study [3], they found that about 50% of resusci- tations did not follow the ATLS protocol. As a follow-up, the researchers’ were interested in: 1. What percent of patients are treated in adherence to protocol? 90 2. Are there distinguishing attributes (e.g., time of day, patient gender, team lead) between protocol adherence and non-adherence? 3. What are the most common deviations from the protocol? Dataset. We began by cleaning the data. Based on the question at hand, the original dataset included many extra event categories and attributes that weren’t relevant to the analysis. This initial filtering was done in excel and reduced the number of event categories from 22 to 6. All twenty patient attributes were retained, because they did not add any complexity to the visual display and were important in determining associated attributes for protocol deviation. Patient attributes included injury severity score (ISS), the day of week, length of hospital stay, time between notification and arrival at the hospital, and if the patient was admitted to the hospital, among others. Next, the data was displayed in EventFlow. Further data manipulation was performed here. Because Coco only accepts point events, we converted the secondary scan interval event into a single point event representing the start of the interval. Additionally, there were two types of pulse events which were merged into a single event. Lastly, 7 records that had inconsistencies in the dataset, such as the secondary scan ending before it begins, or the patient arriving after other events had happened were removed after verifying the data in the original data sheets. The resulting dataset consisted of 171 patient records, with event categories for the five steps in the ATLS protocol: airway evaluation, listening for breath sounds, assessment of circulation, evaluation of neurological status disability, and 91 temperature control. 6.3.1 System Use The first observational session was three hours. Over the course of the 3 hours, we split the dataset in six ways to load six different pairs of cohorts in CoCo as they explored different hypotheses: 1. Patients treated in adherence to the ATLS protocol versus those that showed any deviation. 2. Patients admitted to the floor versus ICU (with discharged patients removed). 3. Resuscitations where the trauma team received at least 5 minutes advanced notice and teams where there was fewer than 5 minutes notice (“now” resus- citations). 4. Patients with a high (> 15) versus low ISS. 5. Patients treated on the weekend versus on a weekday. 6. Patients treated during the day versus at night. In every comparison group, the analysts began by looking at the prevalence of single events, to determine how often they occurred. The analysts then looked at the most differentiating entire record sequences, because the subsequences were less informative about how the protocol was followed. They would then make their way down the provided metrics list, in the order that they appeared: most differ- 92 entiating time gaps and then prevalence of record attributes. They did not look at the prevalence of record attribute combinations for any of the datasets. 6.3.2 Outcomes For analyst. For this dataset, the analysts expected to see that each patient record contained every event category. However in two of the comparisons, (1) cor- rectly treated patients versus those with deviations and (2) day versus night patients, the latter of both groups received the airway check significantly less than former. In the day versus night group, the analyst also found that the “most differentiating sequence” was the correct order, meaning that the nighttime patients were treated in the correct order significantly less often than daytime patients. Additionally, pa- tients treated at night had more variance in the procedure, with 26 unique sequences in the 83 patients versus 20 unique sequences in the 101 daytime patients. A possi- ble reason for this finding is that during the day, nurse practitioners perform these procedures, but at night, junior residents, who may have less experience with this type of task at this particular institution, are on-call instead. From these results, the analysts presented these findings at an internal symposium on pediatric care and a city-wide conference on trauma care. For CoCo. In the closing interview, one analyst said, “We don’t need to solve everything with EventFlow and CoCo. These tools let us explore the data and narrow our hypothesis.” This early case study was indication that CoCo can be effective for exploratory analysis and hypothesis generation. Additionally, through observation 93 of the analyst, we were able to develop the basis for a process model that was implemented and tested in later versions of CoCo. Overall, the analysts noted that CoCo was useful for their needs, and without it, the analysis would have been possible, but much more difficult to perform. One area that CoCo could be improved is to include multivariate analysis, as this is a central need for their analyses. 94 6.4 CS2: Student Course Enrollments Figure 6.3: An analyst at the University of British Columbia (UBC) was interested in using CoCo to better understand the pathways UBC’s students typically pursue towards degree completion Participants. I worked with Dr. Leah MacFadyen who is the Program Director of Evaluation and Learning Analytics in the Faculty of Arts at the University of British Columbia (UBC). Procedure. All sessions, except one, were conducted remotely, through Skype and screensharing. These sessions were at irregular intervals that were scheduled as needed. The sessions would consist of troubleshooting, guidance on how to use CoCo, feedback from Dr. MacFadyen on her experience, as well as observation of how CoCo was used. In between these sessions, Dr. Macfadyen used CoCo inde- 95 pendently and would regularly email her thoughts, experiences, results, questions, and requests related to CoCo. Analysis goals. The University of British Columbia (UBC) permits students to register in and complete courses towards their degree without a required or pre- specified order. Some temporal ordering in course enrollments is imposed by re- quiring “core” courses to be completed first or by requiring pre-requisites. Beyond these constraints, however, flexibility of enrollment results in a complex and highly heterogeneous record of student enrollment patterns. Dr. Macfadyen was interested in using both EventFlow and CoCo to better understand the pathways UBC’s students typically pursue towards degree comple- tion. Major research questions included: 96 • Are some enrollment pathways more common than others? • Are some course sequences more frequently associated with success in a given degree program or specialization? • Do certain course combinations or sequences seem to channel students towards or away from Majors or Honours programs? • Which course sequences have the highest rates of attrition? In understanding student enrollment (and dropout) patterns over time could, the results could inform curriculum, course planning, and student advising. Dataset. The dataset consisted of the course enrollment records and selected de- mographic and graduation data of 796 students in the enrolled in three departments in the School of Library and Information Science (iSchool): • Masters of Library and Information Studies (MLIS), • Masters in Archival Studies (MAS), and • a joint Library and Information Studies/Archival Studies Degree Program (MASLIS). The course enrollments were all in the period 2004-2013. Each event category was a course the student had enrolled in, aggregated by department (e.g., Archival Studies, Information & Society, etc). Record attributes imported from the student records included: 97 • Degree program: MLIS, MAS, MASLIS • “Grad Group:” 1 = bottom 50% by graduation average grade, 2 = top 50% by graduation average grade • International/Domestic status • Student citizenship • Student gender The iSchool was interested in enrollment pathway differences between male and female students, domestic and international students, and any relationship between course enrollment patterns and student performance, as represented by weighted average on graduation. 6.4.1 System Use Three paired sets of data for exploration were then exported from these data sets: 1. MLIS students distinguished by gender 2. MLIS students differentiated by achievement (graduation group 1 or 2) 3. iSchool students differentiated by degree program (MLIS vs MASLIS) 98 6.4.2 Outcomes For analyst. Dr. Macfadyen was able to make numerous insights about the stu- dents’ enrollment behavior. 1. Gender differences in MLIS student course enrollment choices. Female students are over-represented in the MLIS cohort by a ratio of 2:1 (as in the Faculty of Arts as a whole). CoCo analysis suggested that female students are significantly more likely to complete courses in Library Services for Children (p = .002) and Professional courses (p = .036). Analysis of most differentiating subsequences suggested that male students are more likely to complete multiple IT & Systems courses alongside their LIBR core courses, while female students are more likely to combine LIBR core courses with Library Services for Children courses. In line with these observations, analysis of “most differentiating co-occurrences” showed that female students are significantly more likely to combine Library Services for Children courses (yellow) with selected other courses. 2. Top 50% versus Bottom 50% of MLIS students by graduation GPA. The only moderately significant (p < 0.05) differentiating event (course enrollment) be- tween these two groups is that the lower-achieving group are more likely to have enrolled in Information & Society courses. There were no significantly different sequences or subsequences of courses were observed between the two groups, and co-occurrence of two Professional courses is significantly more likely (p = .012) in 99 the higher-achieving group. 3. Comparing course enrollment choices of MLIS and MASLIS students. For these comparisons, Archival Studies (ARST) courses were excluded, since MLIS students do not complete ARST courses. Analysis of most differentiating events indicates that MASLIS students are significantly more likely to complete Library Services and Texts & Collections courses than MLIS students (p < .01). Meanwhile, a range of co-occurring event combinations involving Text & Col- lections and Professional courses are observed significantly more frequently for MASLIS students. For CoCo. Because students take multiple courses on a semesterly basis, the dataset naturally contained many concurrent events. This was CoCo’s first major case study that involved concurrent events and because of this, this case study was the single most helpful in determining the requirements for dealing with concurrent events. On the backend, this introduced the need for an altered datastructure and new metrics.The data structure only provided a minor change: instead of including all events in a flat list, the list was converted to nested lists, where events are bucketed by timestamp. The addition of new metrics provided more challenges in dealing with how concurrent events should be treated in the context of subsequences. First, when two events occur at the same timestamp, it is unclear whether to count it as a sequence of length 1 or length 2, because as there is only one timestamp, this may 100 not necessarily be considered a “sequence.” We chose to count this as a sequence of length 1, because it provides additional insight into how often two events occur concurrently, versus on its own. Second, it is not immediately clear how to handle concurrent events in the context of subsequences. For example, suppose we have the sequence (AB)D, where A and B occur concurrently and are followed by D. Depending on the analysis, it may or may not be necessary to count the sequence as an instance of “AD” or “BD” on its own. We added the new metrics to the taxonomy, but leave their implementation for future work. On the frontend, CoCo needed to be adjusted to show sequences that have overlap. Section 5.2 shows our method for determining how to represent concurrent events. Along the way, Dr. Macfadyen also provided valuable feedback for usability and bug fixes, such as allowing customizable colors and adjusting the display for multiple screen sizes. 101 6.5 CS3: Medication Adherence Patterns of Hypertension Patients Figure 6.4: Researchers at the University of Maryland used CoCo to compare whether drug adherence affected the cost that patients incurred over a year. In other words: Could taking medication as prescribed result in lower overall medical costs? Participants. I worked with Dr. Margre´t Bjarnado´ttir and Dr. Eberechukwu Onukwugha for this case study. Dr. Bjarnado´ttir is a Professor at the Smith School of Business specializing in operations research methods using large scale data. Dr. Onukwugha is a Professor at the Department of Pharmaceutical Health Services Re- search and specializes in cost-effectiveness analysis, health disparities, and medical decision-making by individuals. 102 Procedure. The analysis was done in chauffer-mode, with me using Coco and being advised by Dr. Bjarnado´ttir and Dr. Onukwugha based on the results. Analysis goals. The researchers were analyzing the medication adherence pat- terns of patients on diuretics (i.e. are patients taking their drugs as prescribed, in which combinations, what characterizes the gaps between prescriptions, etc.). In particular they are interested in the differences between high-cost versus low-cost diuretics patients and want to know what patterns are representative of each group. The researchers wanted to compare whether drug adherence affected the cost that patients incurred over a year. In other words: Could taking medication as prescribed result in lower overall medical costs? 103 Current methods for adherence analysis consist merely in calculating a Med- ication Possession Ratio (MPR) [72] or similar aggregated measures that do not represent the diversity of patterns found in the data. The MPR for a pre-defined period is calculated as: MPR = number of days prescribed days elapsed over period . For example, if patients only refilled one 30-day prescription over a period of 90 days, their MPR is 30/90, or 1/3. This method oversimplifies a patient’s prescription history into a single number, which may not provide an accurate representation. A patient might refill prescriptions early when planning to leave on vacation, thus leaving a larger-than-usual gap in their refill history, when in fact they were taking their medication regularly. Conversely, a patient who switches to another medication after a recent prescription refill may have a history that incorrectly indicates that the patient regularly took their prescription. Dataset. The data these researchers gathered consisted of prescription refill his- tories of five drugs commonly used to treat hypertension. The data spanned one year and contained over 1 million patients. The data also included the total cost of all prescription costs over the year. We report here only on the analysis of the adherence patterns of patients who took medications from only one drug class: diuretics, which consisted of a total of 113,401 patients. The dataset consisted of two event categories: diuretic and gap, where diuretic indicated the start time of a prescription and gap indicated the start time of no medication usage. The patients were categorized into “HIGH” 104 Table 6.1: Number of hypotheses generated by metric and sequence type. versus “LOW” cost patients based on the distribution of prescription costs for the patients. Patient costs ranged from $0 to $9,528 (USD). Most patients (55%) had no prescriptions costs and the average cost was $25.39. We excluded patients with $0 costs and patients with more than $380 (top %1), to exclude outliers due to multiple prescriptions or medical costs that were not associated with hypertension (e.g., automobile accident). The final dataset consisted of 3,958 patients categorized as HIGH cost and 38,175 patients categorized as LOW cost. 6.5.1 System Use The third version of CoCo was used to compare prescription patterns of high- versus low-cost patients. In total, CoCo generated results for a total of 94 hypothe- ses. The hypotheses are broken down by metric and sequence type in Table 6.1. The analysts first used the Sequence Occurrence panel to review only results with a sufficient sample size. The threshold was set at 10% of each cohort, or 395 in the HIGH cost group and 3,817 in the LOW cost group, which reduced the number 105 of hypotheses to review to 24. Next, the remaining insignificant results (p > 0.05) were removed using the Filter by Significance feature, leaving a more simplified display of 21 results. Finally, the results were Filtered by Sequence Length, to view only sequences of length 1 (single events) or 2 (event pairs). Because there are only two event categories in the dataset, longer sequences were just repetitions of length 2 or less, so this was all that was necessary to view all unique patterns. Thus, there were 10 remaining hypotheses to review in detail. The final result display and settings are shown in Figure 6.5. The analysts then evaluated the remaining hypotheses one by one, using con- text information provided in the details on demand panel. 6.5.2 Outcomes For analyst. This made it easy to conclude that high-cost patients tended to have longer sequences, with more gaps and prescription refills, whereas low-cost patients had shorted sequences, most commonly filling only a single prescription. Low-cost patients also took significantly longer gaps between prescription refills. As a follow- up, analysts will incorporate medical claims data to understand the more serious medical implications of medication adherence, such as heart attacks or stroke. For CoCo. Better understanding on having all metrics in a single view. First testing of prescribed analysis process. This case study provides an illustrative ex- ample of the challenges that researchers and analysts encounter, and describes how 106 Figure 6.5: Final results and usage of drug pattern case study. Analysts used the Sequence Occurrence panel (c) to control sample size, and the Filter panel (b) to control significance and sequence length. This resulted in only 10 hypotheses (a) for the researchers to manually review. 107 the implementation of new visualization interaction techniques for event sequence hypotheses in CoCo enables the automatic analysis of two groups of records. 108 6.6 CS4: Customer Web Logs Figure 6.6: Analysts at Adobe were interested in comparing user click logs using CoCo to understand which events lead to a product purchase versus don’t. Participants. I worked with Dr. Eunyee Koh, a Senior Research Scientist at Adobe Research. Her research focuses on semantic analysis and metadata extraction from media, and how to visualize those extracted metadata interactively for people. Procedure. I worked with Dr. Koh to train her on the use of CoCo and form the objective on the analysis. We met biweekly over Skype, where she provided feedback dealing with the scalability of CoCo. After becoming an advanced user of CoCo, Dr. Koh then performed independent evaluations of CoCo with two analysts at Adobe. 109 Analysis goals. The analysts were interested in understanding user behaviors and exploring the data in a free-form way. Dataset. The dataset contained users’ events on a product website, such as view- ing the display ads, signing up for promotions or free trials, and purchasing products. 6.6.1 System Use All three analysts used the same dataset to compare the group of users who purchased the products without using trials versus with using product trials. In particular, the analysts explored the occurrence of the display ads and retargeting events (e.g., an ad for a product the user has already viewed) between the two cohorts. By exploring events that are statistically significant in the result panel, 110 analysts found one group viewed display ads more than the other group, and that group also contained more retargeting events than the other group. By investigating more on other events such as product trial and adoption using CoCo, analysts hypothesized that the first group, who viewed the display ads more, seemed fairly new to the websites’ product offerings (“explorers”) while the other group, who were exposed to fewer display ads and retargeting, seemed to have good knowledge about the websites’ products and offerings (“experience users”). Since the datasets contained a lot of events (over 120), the analysts found the event filtering panel most helpful and they were able to focus the analysis on specific events. In addition, the reduced metric calculation time provided a much better user experience for data analysis, as the analysts did not need to wait for CoCo to load data and finish hypothesis testing before they could begin their explorations. Analysts all mentioned that they would like to explore the individual event sequences in the dataset more freely. They said that the results were a bit linear, and they would prefer to have freeform exploration and interactions. 6.6.2 Outcomes For user. The work was useful for analysts to discover attributes about user behaviors. In the exit interview, analysts stated that the use of CoCo made finding these insights much easier than with the use of other tools and they would use CoCo again in the future. 111 For CoCo. Previous versions of CoCo had only been used on relatively small datasets of up to 2,000 records per cohort and up to 50 event categories. Web log datasets, on the other hand, record millions of users who access the website per day and hundreds of clickstream events per user. The increased volume of records and variety of event categories presented new challenges for CoCo on both the front- and back-ends. Through an iterative process over six weeks, we proposed solutions, implemented them into CoCo, and received feedback from analysts. The scalability techniques used were formalized into guidelines that were presented in a joint paper [73]. 112 6.7 CS5: In-Classroom Student Behaviors Figure 6.7: An analyst at the University of British Columbia (UBC) used CoCo to compare the in-classroom behaviors of students in the top quartile versus bottom quartile. Participants. I worked with Dr. Macfadyen, Program Director of Evaluation and Learning Analytics in the Faculty of Arts at the University of British Columbia. Procedure. All sessions were conducted remotely, through Skype and screenshar- ing. These sessions were at irregular intervals that were scheduled as needed. The sessions would consist of troubleshooting, guidance on how to use CoCo, feedback from Dr. Macfadyen on her experience, as well as observation of how CoCo was used. In between these sessions, Dr. Macfadyen used CoCo independently and 113 would regularly email her thoughts, experiences, results, questions, and requests related to CoCo. Analysis Goals. In 2013, Smith et al. [74] outlined their development and use of a new tool called COPUS, which stands for the Classroom Observation Protocol for Undergraduate STEM. As part of a focus on improving student learning, they devel- oped COPUS to facilitate the collection of information on the range and frequency of in-class teaching practices at department-wide and institution-wide scales. They and others have subsequently reported results generated through use of the tool, but almost exclusively present this data in pie chart form indicating student and instructor activity as percent of total time or activity intervals. To date, analyses appear to have ignored the sequential element of the data. 114 Dr. Macfadyen wanted to explore and compare the actual sequencing of in-class activity in relation to student learning. Specifically, her question in comparing two classes was to undercover correlat- ing in-classroom behaviors with student performance across different classes. Dataset. The COPUS data comprised manually collected observational data for student and instructor in-class activity in 14 different biology courses. Each class has an outcome variable, “performance,” which has been computed for each class as an average “% learning gain” based on pre- and post-tests. Perfor- mance is computed as normalized changes in test performance per class per student. The class histories were then grouped based on their performance quartile and Dr. Macfadyen compared the top quartile against the bottom quartile. The dataset contained 8 event types, grouped into “passive” and “active” actions for both students and instructors. 6.7.1 System Use After intitially exploring the dataset in EventFlow, Dr. Macfadyen used CoCo to conduct two comparisons of top and bottom quartile students. The first com- parison was of all courses, whereas the second was filtered by first year courses only. 115 6.7.2 Outcomes For analyst. Overall, Dr. Macfadyen found CoCo to be useful, though early iterations of CoCo struggled to analyst the dataset. Dr. Macfadyen found that the frequency of use of clicker questions (CQ) and moments of independent student work (SIW) are significantly higher in top quartile-achieving courses. As a result of this exploration, Dr. Macfadyen presented her work with EventFlow and CoCo at a workshop on Learning Analytics and Knowledge (LAK) [75]. For CoCo. This case study deepened CoCo’s ability to handle concurrent events. Because the COPUS data is bucketed into 2-minute timeslots, all events are concur- rent. Significant changes to CoCo, as a result of this case study, include extending “sequences of length 1” to include concurrent events. That is, the occurrence of overlapping events is shown as a sequence of length 1 because they occur at a single timestamp. In doing so, analysts can more easily see which events commonly occur with other events and which do not. This case study also served as an example of CoCo’s usefulness in datasets with a low-volume of records but a high-volume of events. Though there were only 43 classroom histories, the volume of the events allowed for sufficient sample sizes and showed that CoCo is still suitable for relatively low-volume datasets. 116 6.8 CS6: Distinguishing Types of Radiation to the Bone We worked with partners at the Department of Pharmaceutical Health Ser- vices Research at the University of Maryland School of Pharmacy in Baltimore. In previous work, the researchers were interested in developing an algorithm using claims data to differentiate between radiation delivered to the bone versus radiation delivered to the prostate gland, because billing codes available in claims data do not distinguish the site of radiation. Reliable measures for identifying the receipt of radiation to the bone are important in order to avoid bias in estimating the preva- lence and/or mortality impact of skeletal-related events, including radiation to the bone. Studies using healthcare claims employ various claims-based algorithms to identify radiation to the bone and mostly condition on prior claims with a bone metastasis diagnosis (billing) code [76–78]. They developed three classification al- gorithms that were compared using CoCo and EventFlow to investigate the timing of possible radiation to the bone among patients diagnosed with incident metastatic and nonmetastatic prostate cancer. One algorithm was based on prior literature while the other two were based on insights gained from data visualization software. Based on clinical input regarding the duration of palliative [79, 80] versus curative radiation, the researchers investigated the length of radiation episodes and found dif- ferences between cohorts in terms of the length of radiation. As expected, patients diagnosed with metastatic disease received shorter course radiation than patients diagnosed with nonmetastatic disease. 117 The feedback on CoCo was positive and the team valued the opportunity to visually compare cohorts of patients using summary statistics that pertained to the timing and frequency of events. The graphical results were shared with clinicians on the research team in order to determine whether the patterns were consistent with their expectations. The researchers felt the meaning of metrics could be explained more clearly; it was sometimes unclear what the x-axis represented and what statistical tests were used. They also suggested always showing the event labels, particularly for single-event metrics, to make understanding the icons a bit easier. The researchers expressed a need to be able to sort the rows of results with different factors, including by raw percentage of values in each cohort. We implemented this feature before the formal case study. 6.9 CS7: Children’s AIM2 After the initial case study with Children’s National Medical Center (Sec- tion 8), we began working on another dataset. Similar to how the ATLS protocol is standardized for the trauma bay, researchers were interested in seeing if similar patterns emerge during resuscitations dealing with head injuries which might guide in the development of guidelines or protocol. The case study lasted about two months, while the analysts and I worked together to clean the data. Because this was a relatively new dataset however, there weren’t enough records to form statistically significant conclusions (n < 20). However, CoCo was helpful in finding errors in the datasets and determining what 118 remained to be cleaned. Although ultimately the case study didn’t provide signif- icant insights, it was helpful to understand CoCo’s limitations in terms of number of records and number of event categories that can be supported and it helped the Children Hospital team understand how CoCo may be helpful in the future when the number of records increases. 6.10 CS8: Computer Activity Logs Fan Du, a PhD student in Computer Science at the University of Maryland, used CoCo to identify patterns to detect insider threats using computer activity logs. The dataset contained approximately 180 million events from monitoring computer usage of employees, consisting of 6 event categories (e.g., login, email, web browsing, etc). The users were divided into “suspicious” versus “normal” users. After much data cleaning, the data was reduced to only several thousand events. The analyst was interested in identifying event categories which indicate suspicious user activity, in order to further simplify the datasets. For each subset, CoCo identified event categories that occurred significantly more or less prevalently in high scored days than low scored days. Thus, analysts inspected a display that used only these differentiating event categories. For the above medium size subset, this strategy further reduced the number of events by 92% (from 462 to 24), and the number of unique complete sequences by 74% (from 27 to 7). Comparisons in temporal patterns between days with high and low scores were made based on the simplified visualization. 119 While differences were found we believed that the data itself was not complete or detailed enough to make inferences about might constitute suspicious event se- quences. This case study resulted in the analyst using this dataset as an example for methods for cleaning and simplifying temporal event sequence data [81]. 6.11 CS9: Social Media Messages Cody Buntain, a PhD candidate in Computer Science at the University of Maryland, used CoCo to identify differences in structures for credible versus non- credible Twitter messages. The dataset used was CREDBANK, a large-scale corpus of social media messages collected between mid October 2014 and end of February 2015. It is a collection of streaming tweets tracked over this period, topics in this tweet stream, topics classified as events or non-events, and events annotated with credibility ratings [82]. Each record is an “event” that happened (e.g. the Boston marathon bomping) and events (in the context of CoCo) are individual tweets or messages. Using the credibility ratings, the data was divided into credible (i.e., true) versus non-credible tweets, and CoCo was used to determine whether there are any structural differences between these two datasets to help identify features that may be used in developing automated credibility detection for Twitter messages. CoCo revealed several differences in the structure of credible versus non- credible events [83]: • First, credible events had a statistically significant higher frequency (p < 0.01) 120 of tweets than non-credible events. • Breaking down the tweet type, credible events also exhibited a twice as many of retweets and tweets with media and four times as many web links (p < 0.01), while non-credible events had a higher frequency of hashtags and mentions of other users. • Credible events showed a significantly higher proportion of media posts than non-credible events (p < 0.01). 6.12 CS10: Baseball Career Trajectories Sean Barnes, an Assistant Professor of Operations Management in the Robert H. Smith School of Business at the University of Maryland, College Park. Dr. Barnes was interested in understanding how to determine characteristics that indi- cate promising players. Using a baseball-reference.com [84] dataset, which calculates a yearly Wins Above Replacement (WAR) average per player per year. The WARs are categorized into five groups, which show the player’s demonstrated ability. The player’s WAR is calculated once per year. The initial analysis compared pitchers versus batters. Because of the very long histories of the players (in some instances, over 15 years), Dr. Barnes found the most useful metric to be the non-consecutive and consecutive subsequence results (e.g., long-term or multiyear patterns). The explicit way that CoCo breaks down each unique sequence was also helpful in quantifying the variety of player’s career trajectories. Though variety is expected across all players, one key insight was that 121 pitchers had more unique patterns than batters, possibly due to a higher potential for injuries. 6.13 8 Incomplete Case Studies include figure/list again here? There were eight other groups that expressed interest in using CoCo for analy- sis and received a demo of CoCo. Five in the healthcare domain, two in business, and one in transportation. However, these were not completed for a variety of reasons (Table 6.2c): • Data quality deemed unsatisfactory - In three cases, the data required too much cleaning to continue with the case study. In the transportation case study, there were hundreds of event categories because they had been typed by operators instead of selected among a list of possible event names (e.g., they included the names of individuals being contacted instead of the job title). After the categories had been aggregated, the analyst realized that the procedure and timing of the recording of events events was different for different agencies so no valid comparisons could be made. The analyst effort was then redirected to attempting to change the way data is recorded. • No suitable comparison - Three case studies were not completed because though data existed and was cleaned for event sequence analysis, there was no suitable comparison. In all three cases, the analysts had used EventFlow or CoCo for a previous trial and were successful in finding results, and were 122 invited to try CoCo. However, we found that although cohort comparison seemed like the next logical step, there was no driving hypothesis which sup- ported pre-defined subpopulations in the dataset to be compared. For ex- ample, in the case of the head trauma dataset, analysts were interested in understanding if there was a pattern that emerges consistently when treating head trauma patients and thus wanted to compare “similarly treated” pa- tients versus “deviations.” However, a central issue was that defining these subpopulations was part of the task and CoCo is not suited for clustering tasks. Although cohort comparison is relate to clustering and classification problems, CoCo is designed for exploratory analysis and open-ended ques- tions, and there still must be a driving hypothesis that allows analysts to split the dataset into groups. Thus, CoCo is best suited for retrospective cohort analysis where the splitting method involves comparing outcomes, treatments (existence of an event), or record attributes. • Aborted - In the remaining three cases, the case study partners became busy or unavailable after expressing interest and receiving a demo. 6.14 Summary This chapter covers Contribution 4: Evaluations to demonstrate the utility and impact of these methods. Through a user study and a series of five long-term and five short-term case studies. The early, preliminary user study refined CoCo’s design and allowed me to observe anaylysts’ actual practice of analyzing real-world 123 dataset using a combined CoCo and EventFlow tool. Following the procedure of a Multi-Dimensional, Long-term In-depth Case Study (MILCS) [71]. All case studies with CoCo illustrated the strengths of the system and highlighted limitations, which allowed me to iterate on its design. Though at each step, many improvements were necessary, each case study partner was able to understand their data and answer questions about cohort comparison better using CoCo than they had previously been able to. 124 Chapter 7: Discussion and Future Work Event sequence data is being collected more and more, in a wide range of domains. With this increased volume of data, developing new, efficient methods for analyzing it is paramount. Despite the commonality of the data type, existing analysis tools for cohort comparison fail to address the unique challenges that come with comparing event sequences. My work aims to bridge this gap by providing an understanding of the complex task of event sequence comparison and provides a visual analytics tool that combines statistics with an interactive visualization to enable more rapid data exploration, hypothesis generation, and insight discovery. The direct contributions of this dissertation are: A taxonomy of metrics for comparing cohorts of temporal event se- quences. Through a systematic literature review of EventFlow and other case studies, I identified common questions that users ask when comparing two or more groups of event sequences and organized these questions in a taxonomy of metrics. A statistical framework for exploratory data analysis. I implemented a subset of the metrics introduced in the taxonomy and identify and solve the major practical challenges of applying thousands of statistical tests, a method I refer to as 125 high-volume hypothesis testing (HVHT), A family of visualizations and guidelines for interaction techniques. Through an iterative design process with case study partners, I develop and implement visu- alizations and interaction techniques that are useful for understanding and parsing large sets of hypothesis results. Evaluations to demonstrate the utility and impact of these methods. I preform three types of evaluation through the development of CoCo: • a preliminary user study comparing CoCo to EventFlow for the task of cohort comparison, • six long-term case studies with case study partners: three in the medical domain, two in education, and one in web log analysis, and • five short-term case studies: two in the medical domain and one each in sports analytics, social networks, and security. 7.1 Limitations Though CoCo has shown to be a powerful analysis tool for analysts in a wide array of domains, there are some limitations to its application. Section 7.2 discusses avenues for future research, 126 7.1.1 Difference Metrics CoCo focuses on differences between the cohorts, but metrics to show similarity between cohorts can also be useful. Additionally, CoCo focuses exclusively on events and sequences which do occur in a dataset. 7.1.2 Statistical False Positives When running thousands of statistical tests on a single dataset, the chance of false positives and erroneous correlations increases. We attempt to mitigate these risks by providing options for statistical corrections, making the distribution of p- values more transparent, and allowing users to see hypothesis results in the context of other related sequences. Despite these considerations, however, advanced domain expertise of statistics is required to truly appreciate the associated pitfalls of this type of analysis. It is important to note that CoCo is intended for exploratory data analysis and that any significant results should be followed with a formal, controlled study to confirm or deny any hypotheses. 7.2 Future Work 7.2.1 Supporting Comparison of Three or More Groups Extending CoCo to support comparison of three of more groups would require changes to the statistical methods and to its visualizations. On the statistics side, using a method such as one-way Analysis of Variance (ANOVA) [85] or linear re- 127 gression would allow comparing three or more groups. Currently CoCo uses t-tests, which are adequate for but limited to comparing only two groups. Extending CoCo to three or more groups would also require changes to its display. The current method of using left versus right columns for each of the cohorts works well for two groups, because each hypothesis result can be ranked and listed. However, extending to three or more groups would require losing the ability to rank and list the results, or require new displays entirely. One option for the display would be to use a lower triangle matrix to show multiple pairwise comparisons for each group pair, similar to the Simplified Overviews method [86]. 7.2.2 Integrated Cohort Selection The first step in every case study partner’s analysis was to determine the two cohorts that were being compared. Providing methods for cohort selection inte- grated directly into CoCo would not only be more convenient, but would allow for more complex analysis. Allowing users to switch between split features will also en- able them to determine causal relationships. For example, in the Children’s Hospital case study, the analysts first looked at patients who were treated correctly versus those who were not. They found that the “now” attribute was a discriminative fea- ture. To confirm this difference, they then split by “now” versus “not now” patients, and found that there was indeed a significant difference in protocol adherence. Analysis can be further aided by integrating tools for cohort selection within CoCo by provided simple interaction techniques to select the split feature. This 128 problem is interesting in the number of ways a cohort can be selected depending on the analysis to be performed: • Record attribute – Binary values - each value corresponds directly to a cohort – Categorical values - the user must select which values go into which cohort (e.g., if attribute is what browser a user was using, we can divide into mobile vs. desktop) – Continuous values - choosing binary ranges or ranges that may not be contiguous (e.g., normal blood sugar level between 70 and 99, abnormal otherwise) • Absolute date (e.g., this year vs last year, beginning by..., ending by..., occur- ring during...) • Relative time (e.g., lasting longer or shorter than...) • Outcome • Event/sequence (“contains”) Further, by integrating the split type into the tool itself, the algorithm can make use of this information for optimizing and reducing the subspace for metrics to calculate, explained more in the Section 7.2.3. Record Attribute Visualization. The next step, after choosing cohorts, is to visualize the record attributes. There are two primary use cases for how cohort 129 visualizing cohort attributes might be useful: (1) In the case of where cohorts are already selected (for example, in A/B testing of web sites, where the user is placed in a group at random or not based on any user-attributes), we might use a visualization to easily determine whether the groups are balanced in all possible attributes. The interest in this is not necessarily for any actionable outcome, just to be aware of any biases or imbalances between the cohorts. (2) In the case of a retrospective study where cohorts must carefully be chosen for analysis, the visualization can help a user select the patients for each cohort by providing a real-time, responsive visualization that shows the attribute balances as the user moves records between the two groups. Visualizing these attributes is becomes more complex when we take the combination of attributes. Secondly, the same problem of visualization the different types of values an attribute can have occurs. For example, a simple pie chart might be sufficient for binary values, whereas continuous or categorical values might require some sense of the minimum, maximum, and average values, as well as an overall distribution. Automatically Generating Balanced Cohorts Based on Record Attributes. A natural extension might be automatically suggesting which records to move in or- der to balance the cohorts best (best could mean fewest number of moves, most balanced number of total records, etc). 130 7.2.3 Optimization Currently, CoCo will run a metric on the datasets when the user selects the metric. Because the set of metrics is bounded, users can benefit from automated computation of every metric in advance. Automatic computation will provide users guidance in exploring the problem-space and save time during exploratory data analysis. When comparing patterns across two or more cohorts, statistical tests are important for comparing means and proportions. However, calculating the sig- nificance (e.g., p-value) for a statistical test is a computationally time consuming approximation problem. Because the number of unique subsequences in a cohort grows exponentially with the number of event types and records, even small datasets with 250 records, 10 event types, and under 5,000 unique subsequences can result in significant wait times for the user. Preliminary timing tests using CoCo indicate such a dataset would take as long as 1.5 seconds to calculate the significance tests for prevalence of all 5,000 subsequences in both cohorts. One visual analytics approach to reducing wait times between user operations is given in Progressive Visual Analytics. Stolper et al. [7] give design guidelines which include allowing the user to direct the algorithm via prioritizing subspaces and designing an algorithm to give meaningful, partial feedback. However, with statistical data, there are unique methods for the algorithm to prioritize subspaces automatically. I propose that progressive visual analytics can be improved by self- directed algorithms. 131 Implementing the design guidelines of Stolper et al. provides users feedback during the computation process and allows them prioritize or ignore subspaces of interest. Besides user-directed methods, progressive analytics algorithms can “self- direct” in order to maximize efficiency in computing many metrics. Prioritizing by P-Value Estimation. Because calculating the exact p-value is time consuming, we can significantly reduce calculation time by prioritizing se- quences that are likely to be significant. For example, the χ2 statistic can be calcu- lated in constant time and from this, we can determine the range that the p-value will be in using a look-up table of pre-computed values. If the p-value is above 0.1, it is likely that the user will not care about the exact p-value, and we can skip this subspace. Similarly, for p-values in the range below 0.1, the algorithm should prior- itize these results and determine the exact p-value before moving on to potentially less significant results. Ignoring Subspaces by Split Feature. All metrics may not be applicable on all datasets or sequences. For example, the factor on which the cohorts are formed may call for different types of questions to be asked about the data. Consider a set of medical patient records split by date (e.g., last month’s trials versus this month’s). A researcher might see how outcomes for the patients differ between the cohorts, whereas in a dataset split by the outcome of the record (e.g., patients who die versus those who live) would ignore such a metric. CoCo enables users to split cohorts by factors such as: 132 • time (patients this month versus last month) • outcome • patient attribute (age, gender, location, team, position) • event occurrence (treatment a versus treatment b) If the algorithm knows what the cohorts are split by, we can eliminate some metrics completely. Users can specify the split factor or the algorithm can auto- matically detect the split factor. For example, if the cohorts are split by a binary patient attribute, we can expect to see 100% of patients on cohort α to have one value, and 100% of β to have the other. 7.2.4 Database Backend CoCo currently, stores all data in memory. As a result, the size of datasets and results is limited by the size of the user’s machine memory and by browser data transfer limits. Using a database would allow for more scalable data storage and more seamless integration with users’ existing tools and databases. Tables. Such a dataset would require tables in three major areas: (1) raw data storage for each cohort, (2) intermediary sequence counts and information, and (3) hypothesis results table. Cohort Data. The current file input format lends itself nicely to a relational database. The raw cohort data can be stored in a single Events table, with headings 133 that match the current input file schema: Event ID Cohort Record ID Event Time The Event ID would be the key that is automatically assigned. If interval events were introduced, a fourth column could be added for “end time” with an imposed constraint that it must occur after the “start time.” Event attributes could be stored in a third table with information of the Event ID (corresponding to an entry in the Events table), in another three column table: Event ID Attribute Value A Record Attributes table would similarly include a Record ID, Attribute, and Value: Record ID Attribute Value Intermediate Sequence Value Tables. In calculating the metrics for the event sequences in CoCo, the intermediate results should be stored in tables because they are accessed by many parts of the system. First, a Sequence table should store all the sequences found in the dataset and the number of times it occurs in each cohort. Sequence ID Sequence Consecutive Occurrences in A Occurrences in B The Sequence ID would be automatically assigned by the database. Sequence would be the sequence of events and consecutive would be a True or False boolean value. Tables for other metrics would include number of records containing a se- quence and duration of sequence (from first to last event). 134 Results. Hypothesis results would be stored in a table as they are calculated, based on the count in the intermediary tables. Sequence ID Metric Value A Value B P-Value Queries. Queries could then be made from the interface based on users’ selected sorting and filtering mechanisms. Most importantly, results should be filterable by p-value, sample size, and values. Sequences should be selectable by sequence length. 7.2.5 Interval Events Extending to interval data requires considerations on both the backend and the frontend. Monroe [63] covers these challenges in great detail, from input processing to data structures and storage to display methods, and many of these issues and their solutions can be applied to CoCo. For example, the inclusion of interval events would require additional information about the granularity of timestamps and consistency checks during the file processing. Regarding visualizing hypotheses and sequences, adaptations would need to be made to account for the 13 temporal relationships between two intervals and 5 relationships between intervals and points [63]. A method similar to EventFlow could be employed, where the start and end events are represented as points and the interval between the events are shown as a shaded region. Aside from practical issues for storing and representing intervals, the task of cohort comparison specifically would require the addition of new metrics. While point events have the notion of “before” and “after,” interval events introduce the 135 concept of “during.” More work would need to be done extend the taxonomy of metrics for new hypotheses involving intervals, but potential metrics could include: • Duration of an interval events. Does one interval tend to last longer in one cohort than the other? • Duration of interval events. Aside from how often certain intervals overlap, how long do they overlap for? • Prevalence of events and sequences that occur during an interval. 7.2.6 Extending to Other Data Types Though my dissertation focuses on comparing cohorts of event sequences, this work can be extended to use with other data types, such as network graphs, time se- ries, or multivariate data. In each case, the metrics, visualizations, and interactions would have to be adjusted for the appropriate data type. Take for example extending to network graph data. Metrics can be similarly divided into “summary metrics” which summarize the networks as a whole (e.g., node and edge counts, degree of connectedness, and reciprocity). Additionally, there are metrics dealing with the node-level, such as degree (including in- and out- degrees for directed networks), node centrality, and node closeness. The metrics could also include metrics about specific subgraphs within the graph – for example, the distance between pairs of nodes, the number of unique paths between the nodes, and nodes that are on the path between two other nodes. Metrics would be applied similarly, by mining for all nodes and subgraphs and 136 applying the metrics to each in order to find significant difference in the structure and values of the metrics, and the results could be distilled similarly into two values (one for each cohort) and a p-value. Regarding visualization, the same guidelines would stand: it would be nec- essary to have an overview of both cohorts and a dedicated results view to review statistical test results in detail, which could remain very similar to the existing work. However, new representations for displaying the networks and hypotheses would be necessary. For example, because each node is unique, it would not be possible to encode the nodes with color. Color could potentially be used to represent node attributes or additional metric values. 7.2.7 Journaling Some users expressed needing a method to keep track of progress so far when exploring a result set. For example, it is important to be able to mark certain results as “reviewed” versus not and to mark whether reviewed results should be kept. Users also requested a way to annotate results that they found interesting, to indicate possible factors for the result. Lastly, users requested a way to export annotated and results marked as important. 7.3 Conclusion This dissertation aims to bridge the gap between the unique needs of event sequence cohort comparison and the limitations of existing tools by providing an 137 understanding of the complex task of event sequence comparison and providing a visual analytics tool that combines statistics with an interactive visualization to enable more rapid data exploration, hypothesis generation, and insight discovery. Through implementation of the scalability, design, and interaction principles into a visual analytics tool, CoCo, I present a ready-to-use tool to support this type of comparison. Work with real-world analysts using CoCo shows the utility of this tool. This chapter summarizes the work of my dissertation and discusses the opportunities for future work that my dissertation opens. 138 Appendix 8: Evolution of CoCo This appendix includes detailed descriptions of the previous versions of CoCo and highlights the changes made during its evolution. 8.1 Version 1 Figure 8.1: The first version of CoCo was largely textual, with results grouped by metric type. Analysis could select the results they wished to view using the metric list in the middle panel. 139 The initial version of CoCo (Figure 8.1) focused on how to implement statistics and organize and display the result set. The initial version was largely textural, with five panels. Summary statistics were shown on their own, with side-by-side values for each cohort α and β. The event legend displayed counts for each event and allowed used to filter by a checkbox. The middle panel was the main form of navigation for results: a four-tiered list displayed all possible metrics and the number of hypotheses. The list was static and included all metrics, even if they were not implemented yet. Users were able to click on a metric in a list, to view the corresponding results. There were no methods of filtering, though results that were subsets of another were grouped together and expanded if a user clicked it. These aggregated results were denoted by a shaded bar to the left of the sequence. The result display was based on the type of metric and the axes changed depending on which metric was selected. For prevalence metric, the axes were scaled to the largest percentage and grey bars grew from the middle to either side to indicate the value in each cohort. A circle was placed to indicate the difference in value for each cohort, placed on the side where the value was greater. 8.2 Version 2 The changes in the second version of CoCo were primarily usability based: Organized metric list to suggest anaysis order. The metrics list was re- organized based on our observations in the Children’s ATLS case study (Section 8). 140 Figure 8.2: CoCo version two brought a variety of usability fixes. Within each category, analysts needed to first look at prevalence of single events, whole sequences, then subsequences. Custom cohort names. Analysts are able to rename the cohorts. Hover tooltips for contextual information. It was noted that it was hard to remember which colors corresponded to which events, so hover tooltip information was introduced to ease this. Exact values for the hypothesis result are also shown. Additional display options allowed all tooltips to be shown or hidden (regardless of hovering). Filtering by p-value. Filtering by p-value was introduced. 8.3 Version 3 Version 3 added more utility to parsing the result set. 141 Figure 8.3: Version 3 added more utility to parsing the result set through methods for filtering and sorting, layout changes, and explicit difference encodings. Methods for filtering and sorting. Users were able to sort by sequence length, in addition to p-value. Methods for sorting the result display were introduced. Previous versions defaulted to a method based on p-value and difference size, which remained the default sorting method. However, more controls were provided so users could sort by p-value only, difference size only, or the raw value of the result of either cohort A or B. Usability changes to layout. The layout was rearranged and lightened, to allow provide more detail to the result view. Explicit differences in overview statistics. A third column was added to the summary statistics and event legend to show the percent difference between the two 142 cohorts, colored by which cohort was bigger (green = Cohort A, red = Cohort B). 8.4 Version 4 Figure 8.4: CoCo v4 introduced important changes in the way sequences and hy- pothesis results were displayed. In previous versions of CoCo, analysts comments on the similarity of CoCo’s result display to statistical error visualizations. Because of the case study part- ners’ familiarity with statistical error bars, I explored alternatives for displaying hypothesis results. The fourth version of CoCo displayed the common percentage along the middle of the result row. Then, a bar (colored by p-value group) grew to the left or right to indicate the size and significance of the difference. In this version, analysts are more clearly able to see where the major differences in their datasets are. 143 This version also introduced iconography for differentiating the different types of sequences because users expressed wanting to see all results of a metric, regardless of sequence type. Thus, icons were developed to indicate which sequences were consecutive versus non-consecutive and whole records versus partial. 8.5 Version 5 Figure 8.5: The fifth version of Coco introduced the most major changes: removing the metrics list, redesigning hypothesis results, sequence scatterplot, and details on demand. The fifth version of Coco introduced the most major changes. Removal of metrics list. Through case studies with analysts, it became clear that analysts did not necessarily care which metrics a result was from. That is, instead of checking each metric result group individually, they wished to answer the 144 question ”what are the top 10 differences,” regardless of metric type. In lieu of the metrics list, filters were added so users could still filter by metric or sequence type. Redesign of result hypothesis visualization. The most major change as a result of listing all hypothesis results was the challenge of displaying results that require different units (percents versus frequency versus elapsed times). As a result, the axes were changed to ratios and the absolute values of the results were removed from the visual encoding. The center (where the absolute values were displayed) were replaced with a visual representation of the hypothesis. Sequence scatterplot was added. A sequence scatterplot was added to provide an overview of how individual sequences occur throughout the dataset. Details on demand. Clicking a result provided details on demand, which include a histogram for results with a distribution of values and statistics about sample size, minimum, maximum, and standard deviation. Event icons converted to rectangles. The event icons were converted to rect- angles in order to save space. 8.6 Version 6 The final version of CoCo streamlined the process model observed through the case studies – the layout was rearranged to provide an overview first and details about specific results last. A stacked EventFlow chart was added to provide high- 145 Figure 8.6: The final version of CoCo (v6) streamlined the process model observed through the case studies. level overviews of each dataset. Summary and event statistics were removed to focus the analysis on event metric results, since those statistics were only looked at once at the beginning of the analysis and could easily be duplicated by other tools. 146 Appendix 9: Case Study Questionnaires 147 Figure 9.1: Entry questionnaire, page 1. 148 Figure 9.2: Entry questionnaire, page 2. 149 Figure 9.3: Exit questionnaire, page 1. 150 Figure 9.4: Exit questionnaire, page 2. 151 Figure 9.5: Exit questionnaire, page 3. 152 Bibliography [1] Megan Monroe, Rongjian Lan, Hanseung Lee, Catherine Plaisant, and Ben Shneiderman. Temporal event sequence simplification. IEEE Transactions on Visualization and Computer Graphics, 19(12):2227–2236, Dec 2013. [2] Megan Monroe, Tamra E. Meyer, Catherine Plaisant, Rongjian Lan, Krist Wongsuphasawat, Trinka S. Coster, Sigfried Gold, Jeff Millstein, and Ben Shneiderman. Visualizing patterns of drug prescriptions with eventflow: A pilot study of asthma medications in the military health system. 2013. [3] Elizabeth Carter, Randall Burd, Megan Monroe, Catherine Plaisant, and Ben Shneiderman. Using eventflow to analyze task performance during trauma resuscitation. Proceedings of the Workshop on Interactive Systems in Healthcare (WISH 2013), 2013. [4] John Alexis Guerra-Go´mez, Krist Wongsuphasawat, Taowei David Wang, Michael L. Pack, and Catherine Plaisant. Analyzing incident management event sequences with interactive visualization. 2011. [5] Krist Wongsuphasawat and David Gotz. Exploring flow, factors, and outcomes of temporal event sequences with the outflow visualization. IEEE Transactions on Visualization and Computer Graphics, 18(12):2659–2668, 2012. [6] Michael Gleicher, Danielle Albers, Rick Walker, Ilir Jusufi, Charles D. Hansen, and Jonathan C. Roberts. Visual comparison for information visualization. Information Visualization, 10(4):289–309, September 2011. [7] Charles D. Stolper, Adam Perer, and David Gotz. Progressive visual analytics: User-driven visual exploration of in-progress analytics. volume 20, pages 1653– 1662, 2014. [8] Jian Zhao, Zhicheng Liu, Mira Dontcheva, Aaron Hertzmann, and Alan Wilson. Matrixwave: Visual comparison of event sequence data. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, pages 259–268, New York, NY, USA, 2015. ACM. 153 [9] Katerina Vrotsou, Anders Ynnerman, and Matthew Cooper. Are we what we do? exploring group behaviour through user-defined event-sequence similarity. Information Visualization, 13(3):232–247, 2014. [10] Paul D Allison. Discrete-time methods for the analysis of event histories. So- ciological Methodology, 13(1):61–98, 1982. [11] Srivatsan Laxman and P. S. Sastry. A survey of temporal data mining. Sadhana, 31(2):173–198, April 2006. [12] Yuan Chen, Fiona Cunningham, Daniel Rios, William M McLaren, James Smith, Bethan Pritchard, Giulietta M Spudich, Simon Brent, Eugene Kulesha, Pablo Marin-Garcia, Damian Smedley, Ewan Birney, and Paul Flicek. Ensembl variation resources. BMC genomics, 11(1):293, January 2010. [13] Marc Fiume, Vanessa Williams, Andrew Brook, and Michael Brudno. Savant: genome browser for high-throughput sequencing data. Bioinformatics (Oxford, England), 26(16):1938–44, August 2010. [14] W. J. Kent, C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler, and a. D. Haussler. The Human Genome Browser at UCSC. Genome Research, 12(6):996–1006, May 2002. [15] Helga Thorvaldsdo´ttir, James T Robinson, and Jill P Mesirov. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in Bioinformatics, 14(2):178–92, March 2013. [16] Jun Wang, Lei Kong, Ge Gao, and Jingchu Luo. A brief introduction to web- based genome browsers. Briefings in Bioinformatics, 14(2):131–43, March 2013. [17] Florin Chelaru, Llewellyn Smith, Naomi Goldstein, and Hector Corrada Bravo. Epiviz: interactive visual analytics for functional genomics data. Nat Meth, 11(9):938–940, September 2014. [18] Miriah Meyer, Tamara Munzner, and Hanspeter Pfister. MizBee: a multiscale synteny browser. IEEE Transactions on Visualization and Computer Graphics, 15(6):897–904, January 2009. [19] Joel A Ferstay, Cydney B Nielsen, and Tamara Munzner. Variant view: visu- alizing sequence variants in their gene context. IEEE Transactions on Visual- ization and Computer Graphics, 19(12):2546–55, December 2013. [20] Ethan Cerami, Jianjiong Gao, Ugur Dogrusoz, Benjamin E Gross, Selcuk Onur Sumer, Bu¨lent Arman Aksoy, Anders Jacobsen, Caitlin J Byrne, Michael L Heuer, Erik Larsson, Yevgeniy Antipin, Boris Reva, Arthur P Goldberg, Chris Sander, and Nikolaus Schultz. The cBio cancer genomics portal: an open plat- form for exploring multidimensional cancer genomics data. Cancer Discovery, 2(5):401–4, May 2012. 154 [21] Nathan D Dees, Qunyuan Zhang, Cyriac Kandoth, Michael C Wendl, William Schierding, Daniel C Koboldt, Thomas B Mooney, Matthew B Callaway, David Dooling, Elaine R Mardis, Richard K Wilson, and Li Ding. MuSiC: identifying mutational significance in cancer genomes. Genome Research, 22(8):1589–98, August 2012. [22] Peter F. Brown, Peter V. DeSouza, Robert L. Mercer, Vincent J. Della Pietra, and Jenifer C. Lai. Class-based n-gram models of natural language. Computa- tional Linguistics, 18(4):467–479, December 1992. [23] Anthony Don, Elena Zheleva, Machon Gregory, Sureyya Tarkan, Loretta Auvil, Tanya Clement, Ben Shneiderman, and Catherine Plaisant. Discovering inter- esting usage patterns in text collections. In Proc. 16th ACM Conference on Conference on Information and Knowledge Management - CIKM ’07, page 213, New York, USA, November 2007. ACM Press. [24] Magdalena Jankowska, Vlado Keselj, and Evangelos Milios. Relative N-gram signatures: Document visualization at the level of character N-grams. In 2012 IEEE Conference on Visual Analytics Science and Technology (VAST), pages 103–112. IEEE, October 2012. [25] Fernanda B. Vie´gas, Martin Wattenberg, and Kushal Dave. Studying cooper- ation and conflict between authors with history flow visualizations. In Proc. 2004 Conference on Human Factors in Computing Systems - CHI ’04, pages 575–582, New York, USA, April 2004. ACM Press. [26] Tamara Munzner, Franc¸ois Guimbretie`re, Serdar Tasiran, Li Zhang, and Yun- hong Zhou. TreeJuxtaposer: Scalable Tree Comparison using Focus+Context with Guaranteed Visibility. In ACM SIGGRAPH 2003, number 1, page 453, New York, USA, 2003. ACM Press. [27] S Bremm, T von Landesberger, M Hess, T Schreck, P Weil, and K Hamacherk. Interactive visual comparison of multiple trees. In Proc. 2011 IEEE Conference on Visual Analytics Science and Technology (VAST), pages 31–40, 2011. [28] Danny Holten and Jarke J van Wijk. Visual Comparison of Hierarchically Organized Data. Computer Graphics Forum, 27(3):759–766, May 2008. [29] John Alexis Guerra-go´mez, Michael L Pack, Catherine Plaisant, and Ben Shnei- derman. Visualizing changes over time in datasets using dynamic hierarchies. IEEE Transactions on Visualization and Computer Graphics, 19(12):2566– 2575, 2013. [30] Viv Bewick, Liz Cheek, and Jonathan Ball. Statistics review 12: survival analysis. Critical Care, 8(5):389–94, 2004. [31] David Collett. Modelling survival data in medical research (2nd. ed.). Chapman and Hall/CRC Press, 2003. 155 [32] Mathieu Dupont, Arnaud Gacouin, Herve´ Lena, Sylvain Lavoue´, Graziella Brinchault, Philippe Delaval, and Re´mi Thomas. Survival of patients with bronchiectasis after the first ICU stay for respiratory failure. Chest, 125(5):1815–20, May 2004. [33] Manish K. Goel, Pardeep Khanna, and Jugal Kishore. Understanding survival analysis: Kaplan-Meier estimate. International Journal of Ayurveda Research, 1(4):274–278, October 2010. [34] Zhiyuan Zhang, David Gotz, and Adam Perer. Iterative cohort analysis and exploration. Information Visualization, Mar. 2014. [35] Oracle. Oracle Health Sciences Cohort Explorer user’s guide. Technical report, Oracle, 2011. [36] John W. Tukey. Exploratory Data Analysis. Pearson, 1st edition, 1977. [37] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4):1165–1188, August 2001. [38] Juliet P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46(1):561–584, 1995. [39] Guimei Liu, Mengling Feng, Yue Wang, Limsoon Wong, See-Kiong Ng, Tzia Liang Mah, and Edmund Jon Deoon Lee. Towards exploratory hypoth- esis testing and analysis. In Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, ICDE ’11, pages 745–756, Washington, DC, USA, 2011. IEEE Computer Society. [40] M. Gupta, Jing Gao, C.C. Aggarwal, and Jiawei Han. Outlier detection for tem- poral data: A survey. IEEE Transactions on Knowledge and Data Engineering, 26(9):2250–2267, Sept 2014. [41] Nizar R. Mabroukeh and Christie I. Ezeife. A taxonomy of sequential pattern mining algorithms. ACM Computing Surveys, 43(1):3:1–3:41, Nov. 2010. [42] Jiawei Han, Hong Cheng, Dong Xin, and Xifeng Yan. Frequent pattern mining: current status and future directions. Data Mining and Knowledge Discovery, 15(1):55–86, 2007. [43] Rakesh Agrawal, Tomasz Imielin´ski, and Arun Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, pages 207–216, New York, NY, USA, Jun. 1993. ACM. [44] Stephen D. Bay and Michael J. Pazzani. Detecting group differences: Mining contrast sets. Data Mining and Knowledge Discovery, 5(3):213–246, 2001. 156 [45] Miguel R A´lvarez, Paulo Fe´lix, and Purificacio´n Carin˜ena. Discovering metric temporal constraint networks on temporal databases. Artificial Intelligence in Medicine, 58(3):139–54, July 2013. [46] Riccardo Bellazzi, Lucia Sacchi, and Stefano Concaro. Methods and tools for mining multivariate temporal data in clinical and biomedical applications. Proc. Annual International Conference of the IEEE Engineering in Medicine and Biology Society., 2009:5629–32, January 2009. [47] S Concaro, L Sacchi, C Cerra, P Fratino, and R Bellazzi. Mining health care administrative data with temporal association rules on hybrid events. Methods of Information in Medicine, 50(2):166–79, January 2011. [48] Yong Joon Lee, Jun Wook Lee, Duck Jin Chai, Bu Hyun Hwang, and Keun Ho Ryu. Mining temporal interval relational rules from temporal data. Journal of Systems and Software, 82(1):155–167, 2009. [49] Philippe Fournier-Viger, Usef Faghihi, Roger Nkambou, and Engelbert Me- phu Nguifo. CMRules: Mining sequential rules common to several sequences. Knowledge-Based Systems, 25(1):63–76, February 2012. [50] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 11th Interna- tional Conference on Data Engineering, pages 3–14. IEEE Comput. Soc. Press, 1995. [51] Heikki Mannila, Hannu Toivonen, and A. Inkeri Verkamo. Discovery of Frequent Episodes in Event Sequences. Data Mining and Knowledge Discovery, 1(3):259– 289, September 1997. [52] Adam Perer and Fei Wang. Frequence: Interactive mining and visualization of temporal frequent event sequences. In Proceedings of the 19th International Conference on Intelligent User Interfaces, IUI ’14, pages 153–162, New York, USA, 2014. ACM. [53] G. Niklas Nore´n, Johan Hopstadius, Andrew Bate, Kristina Star, and I. Ralph Edwards. Temporal pattern discovery in longitudinal electronic patient records. Data Mining and Knowledge Discovery, 20(3):361–387, November 2009. [54] Robert Moskovitch and Yuval Shahar. Medical temporal-knowledge discovery via temporal abstraction. Proc. AMIA Annual Symposium, 2009:452–6, Jan- uary 2009. [55] Denis Klimov, Yuval Shahar, and Meirav Taieb-Maimon. Intelligent visual- ization and exploration of time-oriented data of multiple patients. Artificial Intelligence in Medicine, 49(1):11–31, May 2010. [56] Iyad Batal, Lucia Sacchi, Riccardo Bellazzi, and Milos Hauskrecht. A tempo- ral abstraction framework for classifying clinical temporal data. Proc. AMIA Annual Symposium, 2009:29–33, January 2009. 157 [57] Katerina Vrotsou and Aida Nordman. Interactive visual sequence mining based on pattern-growth. In 2014 IEEE Conference on Visual Analytics Science and Technology (VAST ’14), pages 285–286, Oct 2014. [58] Tim Lammarsch, Wolfgang Aigner, Alessio Bertone, Silvia Miksch, and Alexan- der Rind. Special section on visual analytics: Mind the time: Unleashing tem- poral aspects in pattern discovery. Computer Graphics, 38:38–50, February 2014. [59] Paolo Federico, Ju¨rgen Unger, Albert Amor-Amors, Lucia Sacchi, Denis Klimov, and Silvia Miksch. Gnaeus: Utilizing clinical guidelines for knowledge- assisted visualisation of EHR cohorts. In Enrico Bertini and Jonathan C. Roberts, editors, Proceedings of the EuroVis Workshop on Visual Analytics (EuroVA ’15). The Eurographics Association, 2015. [60] Danyel Fisher, Igor Popov, Steven Drucker, and m.c. schraefel. Trust me, i’m partially right: Incremental visualization lets analysts explore large datasets faster. In Proceedings of the SIGCHI Conference on Human Factors in Com- puting Systems, pages 1673–1682, New York, NY, USA, 2012. ACM. [61] B. Preim, P. Rheingans, H. Theisel, Zhicheng Liu, Biye Jiang, and Jeffrey Heer. immens: Real-time visual querying of big data, 2013. [62] Timos K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23–52, March 1988. [63] Megan Monroe. Interactive Event Sequence Query and Transformation. PhD thesis, University of Maryland, College Park, MD, USA, 2014. [64] TIBCO. Spotfire. http://spotfire.tibco.com/, Mar 2014. [65] Tableau Software. Tableau. http://www.tableausoftware.com/, Mar 2014. [66] Armin Ronacher. Flask (a python microframework), apr 2016. [67] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–. [Online; accessed 2016-04-07]. [68] Olive Jean Dunn. Multiple comparisons among means. Journal of the American Statistical Association, 56(293):52–64, 1961. [69] Taowei David Wang, Krist Wongsuphasawat, Catherine Plaisant, and Ben Shneiderman. Visual information seeking in multiple electronic health records: Design recommendations and a process model. In Proceedings of the 1st ACM International Health Informatics Symposium, IHI ’10, pages 46–55, New York, NY, USA, 2010. ACM. 158 [70] Heidi Lam, Enrico Bertini, Petra Isenberg, Catherine Plaisant, and Sheelagh Carpendale. Empirical Studies in Information Visualization: Seven Scenarios. IEEE Transactions on Visualization and Computer Graphics, 18(9):1520–1536, November 2011. [71] Ben Shneiderman and Catherine Plaisant. Strategies for evaluating informa- tion visualization tools: Multi-dimensional in-depth long-term case studies. In Proceedings of the 2006 AVI Workshop on BEyond Time and Errors: Novel Evaluation Methods for Information Visualization, BELIV ’06, pages 1–7, New York, NY, USA, 2006. ACM. [72] Susan E Andrade, Kristijan H Kahler, Feride Frech, and K Arnold Chan. Meth- ods for evaluation of medication adherence and persistence using automated databases. Pharmacoepidemiology and drug safety, 15(8):565–574, 2006. [73] Sana Malik and Eunyee Koh. High-volume hypothesis testing for large-scale web log analysis. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Computing Systems, CHI EA ’16, pages 1583–1590, New York, NY, USA, 2016. ACM. [74] Michelle K. Smith, Francis H. M. Jones, Sarah L. Gilbert, and Carl E. Wie- man. The classroom observation protocol for undergraduate stem (copus): A new instrument to characterize university stem classroom practices. CBE-Life Sciences Education, 12(4):618–627, 2013. [75] Using EventFlow and CoCo to explore classroom activity patterns and learner performance, 2016. [76] Nalini Sathiakumar, Elizabeth Delzell, Michael Morrisey, Carla Falkson, Mel- lissa Yong, Victoria Chia, Justin Blackburn, Tarun Arora, and Meredith Kil- gore. Mortality following bone metastasis and skeletal-related events among patients 65 years and above with lung cancer: A population-based analysis of U.S. Medicare beneficiaries, 1999-2006. Lung India, 30(1):20–26, 2013. [77] Mette Nørgaard, Annette Østergaard Jensen, Jacob Bonde Jacobsen, Kara Cetin, Jon P. Fryzek, and Henrik Toft Sørensen. Skeletal related events, bone metastasis and survival of prostate cancer: A population based cohort study in denmark (1999 to 2007). The Journal of Urology, 184(1):162–167, 2015/10/06. [78] MJ Lage, BL Barber, DJ Harrison, and S Jun. The cost of treating skeletal- related events in patients with prostate cancer. Am J Manag Care, 14(5):317– 22, may 2008. [79] William F. Hartsell, Charles B. Scott, Deborah Watkins Bruner, Charles W. Scarantino, Robert A. Ivker, Mack Roach, John H. Suh, William F. Demas, Benjamin Movsas, Ivy A. Petersen, Andre A. Konski, Charles S. Cleeland, 159 Nora A. Janjan, and Michelle DeSilvio. Randomized trial of short- versus long- course radiotherapy for palliation of painful bone metastases. Journal of the National Cancer Institute, 97(11):798–804, 2005. [80] Stephen T. Lutz, Joshua Jones, and Edward Chow. Role of radiation therapy in palliative care of the patient with cancer. Journal of Clinical Oncology, 2014. [81] F. Du, B. Shneiderman, C. Plaisant, S. Malik, and A. Perer. Coping with volume and variety in temporal event sequences: Strategies for sharpening analytic focus. IEEE Transactions on Visualization and Computer Graphics, PP(99):1–1, 2016. [82] Tanushree Mitra and Eric Gilbert. Credbank: A large-scale social media corpus with associated credibility annotations, 2015. [83] Cody Buntain, Jennifer Golbeck, Brooke Liu, and Gary LaFree. Evaluating public response to the boston marathon bombing and other acts of terrorism through twitter, 2016. [84] LLC. Sports Reference. War explained. [85] Ronald Fisher. Statistical Methods for Research Workers. Oliver and Boyd, 1925. [86] Matthew Louis Mauriello, Ben Shneiderman, Fan Du, Sana Malik, and Cather- ine Plaisant. Simplifying overviews of temporal event sequences. In Proceedings of the 2016 CHI Conference Extended Abstracts on Human Factors in Com- puting Systems, CHI EA ’16, pages 2217–2224, New York, NY, USA, 2016. ACM. 160