ABSTRACT Title of Dissertation: EVALUATING MACHINE INTELLIGENCE WITH QUESTION ANSWERING Pedro Rodriguez, 2021 Doctor of Philosophy, 2021 Dissertation directed by: Professor Jordan Boyd-Graber Department of Computer Science College of Information Studies Language Science Center Institute for Advanced Computer Studies Humans ask questions to learn about the world and to test knowledge under- standing. The ability to ask questions combines aspects of intelligence unique to humans: language understanding, knowledge representation, and reasoning. Thus, building systems capable of intelligent question answering (qa) is a grand goal of natural language processing (nlp). To measure progress in nlp, we create ?exams? for computer systems and compare their effectiveness against a reference point? often based on humans. How precisely we measure progress depends on whether we are building computer systems that optimize human satisfaction in information- seeking tasks or that measure progress towards intelligent qa. In the first part of this dissertation, we explore each goal in turn, how they differ, and describe their relationship to qa formats. As an example of an information-seeking evaluation, we introduce a new dialog qa task paired with a new evaluation method. Afterward, we turn our attention to using qa to evaluate machine intelligence. A good evaluation should be able to discriminate between lesser and more capable qa models. This dissertation explores three ways to improve the discrim- inative power of qa evaluations: (1) dynamic weighting of test questions, (2) a format that by construction tests multiple levels of knowledge, and (3) evaluation data that is created through human-computer collaboration. By dynamically weighting test questions, we challenge a foundational assump- tion of the de facto standard in qa evaluation?the leaderboard. Namely, we con- tend that contrary to nearly all qa and nlp evaluations which implicitly assign equal weights to examples by averaging scores, that examples are not equally use- ful for estimating machine (or human) qa ability. As any student may tell you, not all questions on an exam are equally difficult and in the worst-case questions are unsolvable. Drawing on decades of research in educational testing, we propose adopting an alternative evaluation methodology?Item Response Theory?that is widely used to score human exams (e.g., the sat). By dynamically weighting ques- tions, we show that this improves the reliability of leaderboards in discriminating between models of differing qa ability while also being helpful in the construction of new evaluation datasets. Having improved the scoring of models, we next turn to improving the format and data in qa evaluations. Our idea is simple. In most qa tasks (e.g., Jeop- ardy!), each question tests a single level of knowledge; in our task (the trivia game Quizbowl), we test multiple levels of knowledge with each question. Since each ques- tion tests multiple levels of knowledge, this decreases the likelihood that we learn nothing about the difference between two models (i.e., they are both correct or both wrong), which substantially increases discriminative power. Despite the improved format, we next show that while our qa models defeat accomplished trivia players, that they are overly reliant on brittle pattern matching, which indicates a failure to intelligently answer questions. To mitigate this prob- lem, we introduce a new framework for building evaluation data where humans and machines cooperatively craft trivia questions that are difficult to answer through clever pattern matching tricks alone?while being no harder for humans. We conclude by sketching a broader vision for qa evaluation that combines the three components of evaluation we improve?scoring, format, and data?to create living evaluations and re-imagine the role of leaderboards. EVALUATING MACHINE INTELLIGENCE WITH QUESTION ANSWERING by Pedro Rodriguez Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2021 Advisory Committee: Professor Jordan Boyd-Graber, Chair Professor Douglas W. Oard (Dean?s Representative) Professor Leilani Battle Professor Leo Zhicheng Liu Professor John P. Lalor ? Copyright by Pedro Rodriguez 2021 Dedication In memory of the Apollo 1, Challenger, and Columbia crews. Their bravery and dedication to exploration, discovery, and science has?and always will?inspire me. Virgil I. Grissom Francis R. Scobee Rick Husband Edward H. White II Michael J. Smith William C. McCool Roger B. Chaffee Ronald McNair Michael P. Anderson Ellison Onizuka Kalpana Chawla Judith Resnik David M. Brown Gregory Jarvis Laurel Clark Christa McAuliffe Ilan Ramon ?From our orbital vantage point, we observe an Earth without borders, full of peace, beauty and magnificence, and we pray that humanity as a whole can imagine a borderless world as we see it, and strive to live as one in peace.? ? William C. McCool, January 29, 2003 ii Acknowledgments I am deeply grateful to all my mentors, colleagues, friends, and family for their nurturing, love, and encouragement over the years. Their support helped me grow as a researcher, craft this dissertation, and achieve my dream. First, I would like to thank my advisor, Jordan Boyd-Graber, for his steady guidance through graduate school. Although at first things like abiding by a book- length style guide seemed arcane, through time I came to appreciate and (attempt to) adopt the same attention to detail in all my work. Among the many things Jordan taught me, I am especially grateful for his guidance in improving as a writer, by extension improving as a critical thinker, and emphasis on digging deep into problems and literature. I am also grateful for the freedom Jordan gave me to try research ideas, the patience to let me learn from my mistakes and failed projects, and the confidence in me when I tried a new idea. This helped me discover and explore my interests while building an understanding of how to conduct research. For this and the countless other things I learned via osmosis, thank you. Thank you as well to the members of my dissertation committee. Doug?s uncanny ability to ask simple, yet deep-cutting fundamental questions was crucial in helping me develop and improve several ideas. John?s expertise and guidance in item response theory helped me apply it to question answering evaluation, and I?m proud of the resulting paper. Both Leilani?s proposal feedback and Leo?s visualization class helped inspire the user-centered perspective I take in laying out future work. Being part of the clip community is among the best parts of my PhD expe- iii rience. Through clip, I met many collaborators, friends, and colleagues. Among the wonderful individuals I?d like to thank are: Yogarshi Vyas, Joe Barrow and Han-Chin Shing for all the board game nights, bottomless boba tea, and reflections on being a PhD student; Joe Barrow for introducing me to the wonderful biking around College Park?even that one time we accidentally went mountain biking on road bikes; Shi Feng for being a friend and collaborator that always seems to ask the right questions; and many former and current clip members including Mohit Iyyer, Alvin Grissom II, Eric Wallace, Kiant? Brantley, Pranav Goel, Peter Rankel, Alexander Hoyle, Matthew Shu, and Fenfei Guo. Finally, thank you to the clip faculty for creating and nurturing such an amazing community that I am proud to be a part of. Although I completed my PhD at umd, my adventure began at cu in beautiful Boulder, Colorado. In those early years, I fell in love with the boundless outdoors and am grateful for the time I had to explore Colorado?s skiing, rock climbing, and hiking. In my time there, I met several friends and mentors who I?d like to thank: Dirk Grunwald for long discussions on anything from hardware to research to backcountry skiing; Rafael Frongillo for supporting the Data Science Team which, in retrospect, likely influenced my later interest in leaderboards; Davis Yoshida for intense and insightful research discussions; Alvin Grissom II for further opening my eyes to racial injustices and giving me the opportunity to become involved in service to the acl community. At both cu Boulder and umd, I had amazing graduate program advisors in Rajshree Shrestha and Tom Hurst. Through internships, I was fortunate to work with fantastic industry mentors iv whose diverse perspectives taught me different ways to frame, approach, think about, and solve research problems. At Riot Games, I am grateful to Xiaoyang Yang, Wes Kerr, Ben Kasper, and Kimberly Voll for pushing me to think hard about the expected impacts of a potential project before diving head-first into it. At Google, I had many insightful discussions with Jannis Bulian and Benjamin B?rschinger. When working in the engineering trenches of a research project, a mistake I often struggled with?especially early on?was losing sight of the science. Although I?m sure he wasn?t the first or only person to point this out to me, Paul Crook (Facebook Research) gave me a firm, but gentle reminder, at the right moment, of something I once knew: that a critical part of science lies in making worthwhile, testable hypotheses and then devising ways to test them. This small nudge even- tually blossomed into the best parts of my emnlp internship paper. In addition to Paul, thank you to Shane Moon, Stephen Wang, and Becka Silvert. Throughout this journey, close friends and family have been an ever-present source of loving support through the best and worst of times. To Stan and Kasandra Bessey, thank you for all the ways, small and large, you?ve unconditionally supported me. To my god-mother Pearline, thank your for your unconditional love and sharing your joy for life with not just me, but generations of my family. To my dad Chago, thank you for nurturing my curiosity and love for science while kindling my desire for exploration through the outdoors. To my mom Lourdes, thank you for your boundless and unconditional love, always believing in me, and being there when I needed you most. To my brother Fritz, you?ve always (lovingly) been my most honest critic; more importantly though, you?ve always been there for me. v Table of Contents Dedication ii Acknowledgements iii List of Tables xi List of Figures xv 1 Introduction 1 1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Conversational Information-Seeking . . . . . . . . . . . . . . . . . . . 2 1.3 Questions are Not Equally Informative of Ability . . . . . . . . . . . 3 1.4 Incremental QA for Polytomous Evaluation of Knowledge . . . . . . . 3 1.5 Crafting Robust Questions Through Cooperative Machine-Human Authoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Background 7 2.1 The Epistemological Heritage of Question Answering . . . . . . . . . 7 2.1.1 The Cranfield Paradigm . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 The Manchester Paradigm . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Common Ground . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2 Question Answering Dataset Characteristics . . . . . . . . . . . . . . 13 2.2.1 Answer Formats . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Information Context . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Generative Story . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Question Answering Datasets . . . . . . . . . . . . . . . . . . . . . . 21 2.4 Methods for Question Answering . . . . . . . . . . . . . . . . . . . . 23 2.4.1 A Brief History of Symbolic Question Answering Models . . . 23 2.4.2 Modern Methods for Question Answering . . . . . . . . . . . . 24 2.4.3 Neural Text Encoders . . . . . . . . . . . . . . . . . . . . . . 26 2.4.4 Document Retrieval . . . . . . . . . . . . . . . . . . . . . . . . 28 2.4.5 Neural Document Retrieval . . . . . . . . . . . . . . . . . . . 29 vi 2.4.6 Neural Answering Module . . . . . . . . . . . . . . . . . . . . 30 2.5 Prior and Related Work in Quizbowl . . . . . . . . . . . . . . . . . . 31 2.6 The Harrowing Past of Measuring Human Intelligence . . . . . . . . . 33 3 Information Seeking in the Spirit of Learning: A Dataset for Conversational Curiosity 35 3.1 Motivation for Conversational Information-Seeking . . . . . . . . . . 36 3.2 Building the Curiosity Dataset . . . . . . . . . . . . . . . . . . . . . . 38 3.2.1 Choosing the Geographic Topics, Aspects, and Facts for the Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.2.2 User and Assistant Dialog Interfaces . . . . . . . . . . . . . . 38 3.2.3 Dialog Act Annotation . . . . . . . . . . . . . . . . . . . . . . 40 3.2.4 Validating Data Quality . . . . . . . . . . . . . . . . . . . . . 41 3.3 Dataset Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3.1 What Facts do User Prefer? . . . . . . . . . . . . . . . . . . . 43 3.3.1.1 Likes for Explicit Preference Elicitation . . . . . . . 44 3.3.1.2 Mining Acts for Implicit Preferences . . . . . . . . . 44 3.3.1.3 Paraphrase Analysis . . . . . . . . . . . . . . . . . . 44 3.3.2 Dataset Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.1 Text Representation . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.2 Dialog Representation . . . . . . . . . . . . . . . . . . . . . . 45 3.4.3 Tasks and Loss Functions . . . . . . . . . . . . . . . . . . . . 47 3.4.4 Model Implementation and Training Details . . . . . . . . . . 48 3.5 Modeling Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.1 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.7 Future Work and Conclusion . . . . . . . . . . . . . . . . . . . . . . . 52 4 Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? 54 4.1 Leaderboards are Shiny . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.1.1 How to Direct Leaderboards? Light . . . . . . . . . . . . . . . 56 4.2 A Generative Story for Leaderboards . . . . . . . . . . . . . . . . . . 56 4.2.1 Examples are Not Equally Useful . . . . . . . . . . . . . . . . 58 4.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3 Ranking and Comparing Subjects . . . . . . . . . . . . . . . . . . . . 59 4.4 IRT for Leaderboards . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.1 Why a Linear Model Baseline . . . . . . . . . . . . . . . . . . 60 4.4.2 Response Prediction is Accurate . . . . . . . . . . . . . . . . . 60 4.4.2.1 SQuaD Leaderboard Data . . . . . . . . . . . . . . . 61 4.4.2.2 Evaluation Scheme . . . . . . . . . . . . . . . . . . . 61 4.4.2.3 IRT Response Prediction is Accurate . . . . . . . . . 61 4.4.2.4 What Model Features are Predictive? . . . . . . . . . 62 vii 4.4.3 Ranking with IRT . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.3.1 IRT Rankings Have Better Reliability . . . . . . . . 62 4.4.4 Statistical Significance of Difference in Kendall Tau Coefficients 63 4.4.5 IRT Improves Cold Start Reliability . . . . . . . . . . . . . . . 64 4.5 Qualitative Insights on Leaderboards . . . . . . . . . . . . . . . . . . 66 4.5.1 Guiding Analysis with IRT . . . . . . . . . . . . . . . . . . . . 66 4.5.2 Identifying Annotation Error . . . . . . . . . . . . . . . . . . . 67 4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.8 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5 The Case for Incremental Question Answering Evaluation 71 5.1 An Introduction to Quizbowl . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Why Quizbowl? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.2.1 What is a Buzzer Race? . . . . . . . . . . . . . . . . . . . . . 74 5.2.2 Pyramidality and Buzzers . . . . . . . . . . . . . . . . . . . . 75 5.2.3 The Craft of Question Writing . . . . . . . . . . . . . . . . . . 76 5.2.4 Quizbowl for Natural Language Processing Research . . . . . 78 5.2.5 Quizbowl for Machine Learning Research . . . . . . . . . . . . 79 5.3 The QANTA Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.1 Sources of Quizbowl Questions . . . . . . . . . . . . . . . . . . 81 5.3.2 QANTA Questions . . . . . . . . . . . . . . . . . . . . . . . . 81 5.3.3 Gameplay Records: Recording Humans Playing Quizbowl On- line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.3.4 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 Deciding When and What to Answer . . . . . . . . . . . . . . . . . . 88 5.5 Guessing QB Answers . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.5.1 Explicit Pattern Matching with Information Retrieval . . . . . 90 5.5.2 Trainable Pattern Matching with Linear Models . . . . . . . . 91 5.5.3 Neural Network Models . . . . . . . . . . . . . . . . . . . . . 91 5.6 Buzzing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.6.1 A Classification Approach to Buzzing . . . . . . . . . . . . . . 95 5.7 Offline Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.7.1 Evaluating the Guesser . . . . . . . . . . . . . . . . . . . . . . 97 5.7.2 Identifying Sources of Error . . . . . . . . . . . . . . . . . . . 99 5.7.3 Evaluating the Buzzer . . . . . . . . . . . . . . . . . . . . . . 103 5.8 Live Exhibition Events . . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.9.1 Question Answering Datasets . . . . . . . . . . . . . . . . . . 108 5.9.2 Answer Triggering and Model Calibration . . . . . . . . . . . 110 5.9.3 Opponent Modeling . . . . . . . . . . . . . . . . . . . . . . . . 110 5.10 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.10.1 Generalization in Factoid Question Answering . . . . . . . . . 111 5.10.2 Robust, Trustable, and Explainable Machine Learning . . . . 112 viii 5.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 6 Centaur Authoring of Adversarial Questions 114 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 6.2 Adversarial Evaluation for nlp . . . . . . . . . . . . . . . . . . . . . 115 6.2.1 Putting a Human in the Loop . . . . . . . . . . . . . . . . . . 116 6.3 Our QA Testbed: Quizbowl . . . . . . . . . . . . . . . . . . . . . . . 116 6.3.1 Known Exploits of Quizbowl Questions . . . . . . . . . . . . . 116 6.3.2 Models and Datasets . . . . . . . . . . . . . . . . . . . . . . . 117 6.3.3 Interpreting Quizbowl Models . . . . . . . . . . . . . . . . . . 117 6.3.4 Adversarial Writing Interface . . . . . . . . . . . . . . . . . . 118 6.3.5 Question Authors . . . . . . . . . . . . . . . . . . . . . . . . . 119 6.3.6 How an Author Writes a Question . . . . . . . . . . . . . . . . 119 6.4 A New Adversarially-Authored Dataset . . . . . . . . . . . . . . . . . 120 6.4.1 Validating Questions with Quizbowlers . . . . . . . . . . . . . 120 6.5 Computer Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.5.1 First Round Attacks: IR Adversarial Questions Transfer To All Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 6.5.2 Second Round Attacks: RNN Adversarial Questions are Brittle121 6.5.3 Humans vs. Computer, Live (again)! . . . . . . . . . . . . . . 122 6.6 What Makes Adversarially-authored Questions Hard? . . . . . . . . . 124 6.6.1 Quantitative Differences in Questions . . . . . . . . . . . . . . 124 6.6.2 Categorizing Adversarial Phenomena . . . . . . . . . . . . . . 124 6.6.3 Adversarial Category 1: Reasoning . . . . . . . . . . . . . . . 125 6.6.4 Adversarial Category 2: Distracting Clues . . . . . . . . . . . 125 6.7 How Do Interpretations Help? . . . . . . . . . . . . . . . . . . . . . . 127 6.7.1 Interviews With Adversarial Authors . . . . . . . . . . . . . . 129 6.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 6.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 7 Conclusion 131 7.1 Future Work: Living Evaluations . . . . . . . . . . . . . . . . . . . . 133 7.1.1 Incorporating Content Models into irt Models . . . . . . . . . 133 7.1.2 Guiding Example Creation . . . . . . . . . . . . . . . . . . . . 134 7.1.3 Tender Loving Care for Underperforming Examples . . . . . . 135 7.1.4 Multi-Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 135 7.1.5 Effects of Continually Updating Evaluations . . . . . . . . . . 135 7.1.6 Model Cards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.2 What More Should Leaderboards Do? . . . . . . . . . . . . . . . . . . 136 7.2.1 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.2.2 Ranking View . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.2.3 Model View . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.2.4 Example View . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.3 Future Directions in Item Response Theory for NLP . . . . . . . . . . 142 7.3.1 Multidimensional Clustering . . . . . . . . . . . . . . . . . . . 142 ix 7.3.2 Statistical Testing . . . . . . . . . . . . . . . . . . . . . . . . . 143 7.4 Reflections and Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 144 A Curiosity 146 A.1 Components of Dialog Interfaces . . . . . . . . . . . . . . . . . . . . . 146 A.1.1 User?s Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 146 A.1.2 Assistant Interface . . . . . . . . . . . . . . . . . . . . . . . . 147 A.2 Dialog Act Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.3 Sample Dialogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 A.4 Paraphrase Analysis and Samples . . . . . . . . . . . . . . . . . . . . 149 A.5 Like Prediction Comparison . . . . . . . . . . . . . . . . . . . . . . . 149 A.6 Model Training, Implementation, and Computation . . . . . . . . . . 149 A.7 MS Marco Conversational Sample Queries . . . . . . . . . . . . . . . 149 B Leaderboard 155 B.1 SQuAD Item Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.2 Logistic Regression Features . . . . . . . . . . . . . . . . . . . . . . . 155 B.3 IRT Model Type Correlation . . . . . . . . . . . . . . . . . . . . . . . 155 B.4 Ranking Stability Experiments . . . . . . . . . . . . . . . . . . . . . . 155 B.4.1 Development and Test Set Correlations . . . . . . . . . . . . . 156 B.5 The IRT Statistical Test . . . . . . . . . . . . . . . . . . . . . . . . . 156 B.6 Multidimensional IRT Clustering . . . . . . . . . . . . . . . . . . . . 157 B.7 Reproducibility Checklist . . . . . . . . . . . . . . . . . . . . . . . . . 158 B.7.1 Software and Parameters . . . . . . . . . . . . . . . . . . . . . 158 B.7.2 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 159 B.7.3 Computational Resources . . . . . . . . . . . . . . . . . . . . 160 C Quizbowl 162 C.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.1.1 Aligning and De-duplicating Questions . . . . . . . . . . . . . 162 C.1.2 Textual Preprocessing . . . . . . . . . . . . . . . . . . . . . . 162 C.1.3 Fold Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.1.4 Matching QB Answers to Wikipedia Pages . . . . . . . . . . . 163 C.1.5 Buzzer features . . . . . . . . . . . . . . . . . . . . . . . . . . 165 C.2 Natural Questions Categories . . . . . . . . . . . . . . . . . . . . . . 166 D Centaur Authorship of Quizbowl Questions 167 D.1 Failure of Syntactically Controlled Paraphrase Networks . . . . . . . 167 D.2 Studio Ousia qb Model . . . . . . . . . . . . . . . . . . . . . . . . . . 168 x List of Tables 2.1 The table categorizes datasets by the task?s paradigm, the Author (human , machine , or a mix), and topic domain. To be catego- rized as human-generated, it is not enough for the dataset to con- sist of naturally occurring text; it must have questions that are au- thored by humans instead of being automatically generated. ? and ? indicate human-generated questions that are non-trivially checked or modified by a machine system. ? and ? indi- cate machine-generated questions that are paraphrased by humans. ? and ? indicate question creation incorporates interactive human-machine cooperation. . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Counts, descriptions and examples of the dataset?s dialog acts. . . . 42 3.2 Consider a task where each utterance has labels A and B. In the single-label version, each utterance is labeled as either A or B. The ta- ble shows the outcome of converting the multi-label version to single- label by creating a row for each example?label combination. Cell values are binary indicators. . . . . . . . . . . . . . . . . . . . . . . 43 3.3 We analyze the paraphrases annotators use through manual cate- gorization. The ?Copy? category includes cherry-picked verbatim phrases, verbatim copies, and contextualized copies (e.g., changing a named entity to ?it?). The majority of paraphrases are correct and only incorporate the provided fact, but a few weave in other informa- tion. 7.2% of paraphrases are either unrelated to the selected facts or paraphrase the fact incorrectly. Overall, 51.2% of messages have valid paraphrases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4 Curiosity has 14,048 dialogs. On average, dialogs have 12.9 utter- ances. 60% of the assistants? 90,534 utterances were liked. . . . . . . 46 3.5 The charm model outperforms end-to-end bert on most tasks. We compare fact selection with mrr, dialog act prediction with micro- averaged F1, and like prediction with accuracy. Ablating dialog his- tory degrades context-dependent tasks (fact selection and policy act prediction), but not tasks more dependent on one message. . . . . . 49 xi 3.6 4 indicates a dataset has the feature, " that it does with a caveat, and 7 that it does not. Conversational ms marco is a search dataset but has inquiry chains we want assistants to induce (exemplar in Ap- pendix A.7). Topical Chat and Search as a Conversation are motiva- tionally similar. While our dataset?s combination of (human) anno- tation is unique, all three datasets are steps forward in resources for conversational information-seeking. . . . . . . . . . . . . . . . . . . . 51 5.1 The qanta dataset is larger than most question answering datasets in qa pairs (120K). However, for most qb instances each sentence in a question can be considered a qa pair so the true size of the dataset is closer to 650K QA pairs. In Section 5.5 using sentence level qa pairs for training greatly improves model accuracy. The qanta dataset has more tokens than all other qa datasets. Statistics for qanta 2012 and 2013 only include publicly available data. . . . . . . . . . . 83 5.2 An entry from the gameplay dataset where the player correctly guesses ?Atlanta? at word 47. The entry qid matches with the proto_id field in the question dataset where additional information is stored such as the source tournament and year. . . . . . . . . . . . . . . . . 87 5.3 We assign each question in our dataset to either the train, develop- ment, or test fold. Questions in the development and test folds come from national championship tournaments which typically have the highest quality questions. The development and test folds are tem- porally separated from the train and development folds to avoid leak- age. Questions in each fold are assigned a ?guess? or ?buzz? association depending on if they have gameplay data. Unassigned refers to ques- tions for which we could not map their answer strings to Wikipedia titles or there did not exist an appropriate page to match to. . . . . . 89 5.4 We compare several models by accuracy at start-of-question, end-of- question, and ew. In the table, models are sorted by start-of-question development set accuracy. Standard deviations for non-ir models are derived from five trials; standard deviation is not reported for the ir model since it is deterministic. . . . . . . . . . . . . . . . . . . . . . 99 5.5 The table is an error breakdown for questions with at least twenty- five training examples. To analyze errors at the start of questions, we randomly sampled fifty errors and for end of question took all thirty- six errors. End of question errors are primarily wrong country errors as in Figure 5.16 where the model answers United States instead of Spain. Errors at the start of the question though are more diverse. The most common error is guessing the correct answer type, but not the specific member of that type; examples of this error class include answering Albert Einstein instead of Alan Turing, or Iowa instead of Idaho. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 xii 5.6 The accuracy (acc), expected wins (ew), and qb score (Score) of each buzzer on the validation set. Both mlp and rnn outperform the static threshold baseline by a large margin, but there is still a considerable gap from the optimal buzzer. . . . . . . . . . . . . . . . 104 6.1 The topical diversity of the questions in the adversarially authored dataset based on a random sample of 100 questions. . . . . . . . . . . 120 6.2 The adversarially authored questions have similar n-gram overlap to the regular test questions. However, the overlap of the named entities (ne) decreases for ir Adversarial questions. . . . . . . . . . . . . . . . 124 6.3 A breakdown of the phenomena in the adversarially authored dataset. 125 6.4 The first category of adversarially authored questions consists of ex- amples that require reasoning. Answer displays the correct answer (all models were incorrect). For these examples, connecting the training and adversarially authored clues is simple for humans but difficult for models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 6.5 The second category of adversarial questions consists of clues that are present in the training data but are written in a distracting manner. Training shows relevant snippets from the training data. Prediction displays the rnn model?s answer prediction (always correct on Train- ing, always incorrect on Adversarial). . . . . . . . . . . . . . . . . . . 127 A.1 Example dialog #1 from Curiosity. (U: User, A: Assistant) . . . . . . 150 A.2 Example dialog #2 from Curiosity. (U: User, A: Assistant). After mentioning the Green Party, the user asks a specific followup ques- tion; we use these interactions to estimate implicit preference. . . . . 151 A.3 A random sample of ten manually labeled paraphrases from the assis- tant. The top row indicates the label we (the authors) annotated, the middle row the message, and the bottom row the original fact from Wikipedia. The original fact is shown as displayed to crowd-workers including punctuation tokenization. . . . . . . . . . . . . . . . . . . 152 A.4 To compare like prediction between models, we randomly sample thirty dialogs and obtain predictions from charm and bert. The table only shows messages where the model predictions disagree and indicates which model was correct. Dialogs are delineated by hor- izontal lines. Unfortunately, from only these examples we cannot determine why the charm model errors in most of these predictions. 153 A.5 An exemplar query chain from the conversational variant of ms marco. An ideal assistant should answer these questions and inspire these types of followup questions. . . . . . . . . . . . . . . . . . . . . . . . 154 B.1 The linear model integrates a variety of features to determine which are most predictive of a subject responding correctly to an item. . . 158 xiii B.2 Table entries are Kendall?s ? rank correlation of irt subject ability between rows and columns. Generally, the models agree on the rank- ing with the irt-feas and irt-disc having the strongest correlation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 B.3 Entries are Kendall?s rank correlation between rows and columns. Scores are squad Exact Match (EM) and irt-disc ability. . . . . . . 160 C.1 A random sample of qb answer strings and their matched Wikipedia pages. Answer mappings are easy to obtain accurately since most failures in exact matching are due to qb specific syntax that can be accounted for by rule based matching. Combined with manual annotation to find common non-exact matches, this process succeeds on 119,093 of 132,849. . . . . . . . . . . . . . . . . . . . . . . . . . . 164 C.2 A breakdown of NaturalQuestion example topics using qb categories. Most questions are about pop culture and the distribution includes many fewer questions about Literature and Fine Arts. . . . . . . . . 166 D.1 Failure and success cases for scpn. The model fails to create a valid paraphrase of the sentence for 97% of questions. . . . . . . . . . . . . 167 xiv List of Figures 1.1 An example of information-seeking dialog that the Curiosity dataset aims to support. Assistants should answer user questions and convey information that inspires meaningful followup questions. . . . . . . . 2 1.2 qb questions are multi-sentence and reference uniquely identifiable answers (e.g., the Apollo Program). They begin with difficult clues and steadily progress to easy ?giveaway? clues by the last sentence. In this question, the early clues are unknown to most while the final clue is known to most school-aged children. Here we show human gameplay data where players answered incorrectly (7) and correctly (4). Having above-average knowledge of this topic, my (correct) guess is shown by a . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 The Winograd Schema Challenge (Levesque et al., 2011) examples have two main ingredients. First, a statement containing two noun phrases and a pronoun which could refer to either (?trophy,? ?suit- case,? and ?it? in this example). Second, a single trigger word that when changed between two alternatives (?big? and ?small?) changes which noun phrase the pronoun should resolve to. Well-crafted ex- amples should expose models that do not ?understand? text. . . . . . 12 2.2 A sample question from trec-8. Questions are contributed by par- ticipants or FAQFinder logs. Answers are text spans selected by nist annotators from news articles. . . . . . . . . . . . . . . . . . . . . . 15 2.3 In squad (Rajpurkar et al., 2016), annotators read a passage from Wikipedia and write several questions. Answers to squad questions are text spans such as ?Colorado Desert.? Despite the qa in its name, squad is better thought of as a rc dataset since questions are context- dependent. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Examples from nq (Kwiatkowski et al., 2019) use real user queries from Google search and annotate them with long and short answers. 16 xv 2.5 squad?s questions are influenced by the distribution of Wikipedia pages, the set chosen to be part of squad, and the text of the ques- tion. We call this generative process type context-first to emphasize that the context (paragraph in this case) directly influences the ques- tion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.6 In question-first qa, the question is created independently of any particular context (paragraph) or answer. This type of qa is most prevalent in information-seeking tasks like web search. . . . . . . . . 19 2.7 Answer-first questions are created by first thinking of a desired answer (or at least topic area), then crafting a question with that answer. . 19 2.8 Deep averaging networks, recurrent neural networks (e.g., lstms and grus), and transformer networks (e.g., bert) are common architec- tures in neural nlp models. To produce text representations of a fixed size (i.e., not a sequence of vectors), each architecture takes a different approach. Deep averaging networks average word vectors and then pass the subsequent vector through one or more feedfor- ward layers; recurrent models often concatenate the last hidden state in each direction; transformer networks usually have a special token associated with the input?s general representation (in bert, the cls token). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.9 The original qb interface and a popular modern interface for playing qb online. Both interfaces reveal questions word-by-word until a player interrupts the system, and makes a guess. . . . . . . . . . . . . 32 3.1 An example of information-seeking dialog that the Curiosity dataset aims to support. Assistants should answer user questions and convey information that inspires meaningful followup questions. . . . . . . . 36 3.2 We sample pre-existing knowledge by asking users to indicate which topically related entities they already know. The assistant para- phrases facts related to either known entities (rooted facts), an aspect (aspect facts), or the topic generally (general facts). The user ex- presses engagement through a like button. Dialog acts are annotated in a separate crowd-source task. . . . . . . . . . . . . . . . . . . . . 37 3.3 In this example, the user is assigned to learn about Lesotho, specif- ically its culture and history. Given their topic, we users indicate which (related) entities?related to Lesotho?are familiar. Related entities range from relatively common like the United States to lesser known like Basutoland. We also provide guidelines and videos before crowd-workers start working on the task. . . . . . . . . . . . . . . . 39 3.4 The user expresses the ?interestingness? of the assistant?s messages through a ?like? button (right of message). This is one of the two ways that we estimate user satisfaction. . . . . . . . . . . . . . . . . 40 3.5 The assistant can incorporate any number of facts into their reply to the user. Their goal is to answer the user?s immediate questions, and anticipate what information they would be most interested in. . . . . 41 xvi 3.6 User engagement is measured by dialog act followups (left) and like button usage (right). We compare reactions to messages that use a fact mentioning an entity the user knew about (rooted) and whether the fact is general or aspect-specific. Pairwise differences are statisti- cally significant (99%+) with a two proportion z-test except for dialog act followups between rooted and non-rooted general facts. Overall, users prefer on-aspect, rooted facts. . . . . . . . . . . . . . . . . . . 43 3.7 Architecture: charm builds a dialog context up to t = i ? 1 to predict the current message?s dialog acts (policy prediction) and the best facts to use. The model uses this combined with the current utterance to classify it?s dialog acts and if it will be liked. . . . . . . 46 4.1 Difficulty and Ability Discriminating (dad) leaderboards infer the difficulty, discriminativeness, and feasibility of examples. Negative discriminability suggests an annotation error; for example, the ques- tion with most negative discriminability asks ?Why did demand for rentals decrease?? when the answer is ?demand for higher quality housing increased.? . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2 A dad leaderboard uses irt to jointly infer item difficulty ?i, dis- criminability ?i, feasibility ?i, and subject skill ?j. These predict the likelihood pij(rij = 1) of a correct response rij. . . . . . . . . . . . . 56 4.3 In irt, the Item Characteristic Curve describes the probability that a specific item with difficulty ? will be answered correctly as a function of skill ?. Visualizing the parameters is helpful in a few ways. First, it shows that high discriminability leads to larger differences in the max- imal and minimal correctness probability (for a given skill range), and the tangent line visually links discriminability to the slope. Second, it clearly shows the maximal probability when feasibility is greater than zero. Lastly, it demonstrates that the inflection point is at the point where difficulty and skill equal out (in this case, zero). . . . . . 57 4.4 We compare each irt and linear model (lm) by how well they predict subject responses. We focus on roc auc since predicting responses is an imbalanced classification problem (most subjects are correct). Under that metric, all irt models improve over the best lm, and the strongest lm ablation only uses irt features. That textual features are predictive in the lm suggests they could improve future models. 61 4.5 Compared to the final ranking over a large test set, how well does a small test set correlate? The left shows correlation between mutually exclusive development set samples and the right between development samples and the full test set. In both experiments (panes), ranking systems by irt ability is more stable?across all sample sizes?than mean accuracy and thus more reliable (Kendall?s rank correlation is higher). Bands show 95% confidence intervals of rank correlations across ten trials per sample size. . . . . . . . . . . . . . . . . . . . . 62 xvii 4.6 P-values of the rank correlation difference for each sample size and trial in Figure 4.5. The inherent noise in dev set sampling makes inferring significance difficult (left); test set driven results (right) are more significant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.7 Suppose we need to cold start, collect annotations for a new subject: what order would most rapidly increase correlation to the test data? As we expect, the correlations eventually converge, but with little data, irt has better correlation than other methods. We suspect that the irt information underperforms early on when the subject ability estimate is unstable. . . . . . . . . . . . . . . . . . . . . . . . 65 4.8 This question is regarded as infeasible by the irt model. Upon fur- ther inspection, the answer omits five acceptable answers, but more importantly does not permit all combinations of Turing machines. . 66 4.9 We partition evaluation data by irt difficulty and discriminability with accuracy in each quartile. Most improvements in high-accuracy systems come from getting high-difficulty questions right. items with low discriminability (and thus prone to annotation errors) are difficult for all subjects except the overfit args-bert model. We include top- performing squad subjects, several notable subjects (systems), and a pair from the bottom of the leaderboard. . . . . . . . . . . . . . . 67 4.10 We annotate squad items by discriminability, difficulty, and irt pre- diction errors. For example, one question with negative discriminabil- ity was classified as ?Wrong? with the explanation that the annotated answer indicates it is not answerable, but the question actually is answerable. items with negative discriminability or where irt?s pre- diction is wrong have a much higher rate of annotation error (?Flawed? or ?Wrong?). Using similar methodology, errors in datasets could be more rapidly identified. . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 qb is a trivia game where questions begin with clues that are initially difficult, but become progressively easier until a giveaway at the end of the question. Players answer as soon as they know the answer so as a result the earlier they answer the more knowledgeable they are. For example, answering after the first sentence indicates the player recognizes the librettist (Emanual Schikaneder) and knows that they played Papageno in The Magic Flute (die Zauberfl?te). Answering at the end of the question only requires surface knowledge of Mozart?s opera works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Trivia has gone from a laid-back pastime to an organized, semi- professional competition format. The qb framework, in particular, which arose from College Bowl (us) and University Challenge (uk) emphasizes fairness and the ability to discover the better question answerer. As organizations such as the Academic Competition Feder- ation and National Academic Quiz Tournaments emerged, the format has focused on academic, well-run tournaments. . . . . . . . . . . . . 74 xviii 5.3 Our interface and a popular modern interface for playing qb online. Both interfaces reveal questions word-by-word until a player inter- rupts the system and makes a guess. . . . . . . . . . . . . . . . . . . 82 5.4 Size of question answering datasets. Questions in the qanta dataset have longer sentences than any other dataset. The instances from SimpleQuestions, SQuAD, and triviaqa are comparatively short, which makes it less likely that they are as diverse as qb or Jeopardy!. For each dataset, we compare the lengths of questions rather than paired context paragraphs; to avoid the histogram being overly skewed we remove the top 5% of examples by length from each dataset. . . . . 84 5.5 Questions in qb cover most if not all academic topics taught in school such as history, literature, science, the fine arts, and social sciences. Even within a single category, questions cover a range of topics. Top- ically, the dataset is biased towards American and European topics in literature and history. . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.6 Distribution of wikidata.org answer types (?instance of? relation) fur- ther broken down by category. Most answers have matching types and reference a person, literary work, or geographic entity. Among these types, there is a good balance of answers spread across litera- ture, history, fine arts, and science. Answer types with only one cat- egory are largely self-explanatory (e.g., mythological answers types to the mythology category). The special category ?NOMATCH? are answers without a matched type and similar types are merged into larger categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.7 Left: each protobowl user is represented by a dot, positioned by aver- age accuracy and buzzing position; size and color indicate the number of questions answered by each user. Right: distribution of number of questions answered, accuracy, and buzzing position of all users. An average player buzzes with 65% of the question shown, and achieves about 60% accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.8 The qanta framework for playing Quiz Bowl with semi-independent guesser and buzzer models. After each word in the input is revealed the guesser model outputs its best guesses. The buzzer uses these in combination with positional and gameplay features to decide whether to take the buzz or wait action. The guesser is trained as a question answering system that provides guesses given the input text seen so far. Buzzers take on dual roles as calibrators of the guesser confidence scores and cost-sensitive decision classifiers by using the guesser?s score, positional features, and human gameplay data. . . . . . . . . . 90 5.9 All our neural models feed their input to an embedding function, then a composition function, and finally a classification function. The pri- mary variation across our models is the choice of composition function used to compute a fixed, example-level representation from its vari- able length input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 xix 5.10 We plot the expected wins score with respect to buzzing position (solid dark blue). For the ten most played questions in the buz- ztest fold we show the empirical distribution for each individual ques- tion (dotted lines) and when aggregated together (solid light blue). Among the most played questions, expected wins over-rewards early buzzes, but appropriately rewards end-of-question buzzes. . . . . . . 98 5.11 The bert and ir models are mostly wrong or correct on the same subset of questions. At the end of the question, most of the questions the bert model is correct on, the ir model is also correct on. . . . . 100 5.12 A test question that was answered correctly by all models after the first sentence; a normally very difficult task for both humans and machines. A very similar training example allows all models to answer the question through trivial pattern matching. . . . . . . . . . . . . . 101 5.13 Only the rnn model answers this question correctly. To test the ro- bustness of the model to semantically equivalent input modifications, we use sears-based (Ribeiro et al., 2018) synonym attacks and cause the model prediction to become incorrect. Although this exposes a flaw of the model, it is also likely that the low confidence score would likely lead a buzzer model to obstain; this highlights one benefit of implicitly incorporating confidence estimation into the evaluation. . 101 5.14 The distribution of training examples per unique answer is heavily skewed. The most frequent answer (Japan) occurs about 100 times. Nearly half of the questions have one training example and just over sixty percent have either one or two training examples. . . . . . . . . 102 5.15 The more an answer is asked in the training set, the easier it is for all models, both at the start and end of the question. This is a significant source of errors since accuracy on at least 50% of test questions? those with seven or less training examples?is significantly lower for all models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.16 Although the answer to this question is Spain, many of the terms and phrases mentioned are correlated with the United States. Thus, the rnn model answers United States instead of the correct answer Spain. This is one of many examples where the model answers with the correct answer type (country), but incorrect member of that type. 105 5.17 In this question, the Threshold and mlp buzzers are too aggressive and buzz before the guesser?s answer is correct. In contrast, the rnn is more conservative and buzzes shortly after the optimal point which is?by a wide margin?still earlier than the earliest (correct) human buzz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 xx 5.18 Comparing buzzers? behavior over time against the optimal buzzer. The red crossed area and dotted blue area combined indicates when the buzzer thinks that the guesser is correct, the other two com- bined when the buzzer thinks the guesser is wrong. The red (crossed) and orange (unhatched) areas combined indicates when the buzzer matches the optimal buzzer. Our goal is to maximize the red areas and minimize the blue areas. The static threshold baseline is overly aggressive, especially at earlier positions in the question (large dotted blue area); mlp and rnn both behaves reasonably well, and the ag- gressiveness of rnn is slightly more balanced early on in the question. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.19 Breaking down the buzzer?s performance on the individual question level. Impossible question means there is nothing the buzzer can do to beat the opponent. It is clearer that rnn performs better than mlp, making fewer mistakes of being overly aggressive. . . . . . . . . 108 5.20 The growth of the qanta dataset in number of questions and num- ber of distinct answers over the past twenty years starting in 1997. The dataset has grown by at least 5,000 questions every year since 2010. All questions with matched answers are included, and we con- struct the plot by using the tournament year of each question. Inde- pendently, participation in qb (and thus number of students writing questions) has roughly doubled every year since 2008. . . . . . . . . 112 6.1 Adversarial evaluation in nlp typically focuses on a specific phe- nomenon (e.g., word replacements) and then generates the corre- sponding examples (top). Consequently, adversarial examples are limited to the diversity of what the underlying generative model or perturbation rule can produce and also require downstream human evaluation to ensure validity. Our setup (bottom) instead has human- authored examples, using human?computer collaboration to craft ad- versarial examples with greater diversity. . . . . . . . . . . . . . . . . 115 6.2 This question is easily answered by a model after seeing the reference to ?Un Bel Di.? Our adversarial writing process highlights terms like these which humans can then modify to make clues more challenging for computer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 6.3 The author writes a question (top right), the qa system provides guesses (left), and explains why it makes those guesses (bottom right). The author can then adapt their question to ?trick? the model. . . . . 118 6.4 The first round of adversarial writing attacks the ir model. Like regular test questions, adversarially authored questions begin with difficult clues that trick the model. However, the adversarial questions are significantly harder during the crucial middle third of the question.122 xxi 6.5 The second round of adversarial writing attacks the ir and rnn mod- els. The questions targeted against the ir system degrade the per- formance of all models. However, the reverse does not hold: the ir model is robust to the questions written to fool the rnn. . . . . . . . 122 6.6 Humans find adversarially authored question about as difficult as nor- mal questions: rusty weekend warriors (Intermediate), active players (Expert), or the best trivia players in the world (National). . . . . . . 123 6.7 The accuracy of the state-of-the-art Studio Ousia model degrades on the adversarially authored questions despite never being directly targeted. This verifies that our findings generalize beyond the rnn and ir models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 6.8 The interpretation successfully aids an attack against the ir system. The author removes the phrase containing the words ?ellipse? and ?parabola?, which are highlighted in the interface (shown in bold). In its place, they add a phrase which the model associates with the answer Sphere. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.9 The Question Length and the position where the model is first correct (Buzzing Position, lower is better) are shown as a question is writ- ten. In (1), the author makes a mistake by removing a sentence that makes the question easier for the ir model. In (2), the author uses the interpretation, replacing the highlighted word (shown in bold) ?molecules? with ?species? to trick the rnn model. . . . . . . . . . . . 128 6.10 A failed attempt to trick the neural model. The author modifies the question multiple times, replacing words suggested by the interpreta- tion, but is unable to break the system. . . . . . . . . . . . . . . . . . 128 7.1 The ranking view and landing page of leaderboards should convey that the community values multiple types of progress. By high- lighting the models according to different metrics, the ranking view de-emphasizes the importance of any single metric and encourages more thought into deciding what does the concept of ?best model? mean. Lastly, rather than highlight only the highest-scoring models, we shift towards highlighting clusters of comparable models; for ex- ample, group all models whose scores are not statistically-speaking different. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 7.2 In mrqa, tsne shows a relationship between whether the task is NarrativeQA with respect to multidimensional difficulty and discrim- inability. The multidimensional irt model uses six dimensions to match the six mrqa tasks. . . . . . . . . . . . . . . . . . . . . . . . 143 xxii A.1 The user is assigned two aspects about their topic. After they are satisfied with what they have learned about the first aspect, they click a button and switch to the next aspect. While the button click is not communicated to the assistant (the user must send a corresponding message), it resets the fact contextualizer; we observe that without this, too many facts were related to the previous aspect. . . . . . . . 147 A.2 A short topic description is always visible to the assistant. The goal is to ensure the assistant always has a general understanding of the dialog topic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 A.3 To annotate dialog acts, we develop an interface that showed each utterance on a separate line. Annotators assign zero or more dialog acts to each utterance using grouped dropdowns. . . . . . . . . . . . 148 B.1 The example from squad with the lowest discriminability. Surpris- ingly, it had a negative discriminability, implying that the less skilled a subject is, the more likely their response is to be correct. . . . . . 156 B.2 This example shows that the answer span is likely too large, causing models to fail in both squad?s exact match and F1 metrics. . . . . . 157 B.3 This highly discriminative question succeeds because there are many plausible answers. For example, although only ?Turkish forces? is correct, some models answer ?the Armenian state.? . . . . . . . . . . 158 B.4 The feasibility parameter ? of our irt model represents the proba- bility that an example is unsolvable. For example, annotation error could lead to an example always being scored incorrectly?regardless of how good the model is. In squad 2.0, ? < .434 in the 5% per- centile, ? < .698 for the 7.5%, and ? < .931 in the 10% percentile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 B.5 In squad, tsne shows a relationship between mean exact match (item accuracy) and answerability with respect to multidimensional difficulty and discriminability. . . . . . . . . . . . . . . . . . . . . . 161 xxiii Chapter 1: Introduction 1.1 Overview Asking questions?as in education and trivia games?is a powerful means of both learning and testing for understanding. The ability to ask questions combines aspects of intelligence unique to humans: language understanding, knowledge rep- resentation, and reasoning. Thus, building systems capable of intelligent question answering (qa) is a grand goal of natural language processing (nlp). Humans primarily ask or answer questions for one of two purposes: to learn new information or to test the understanding of the answerer. These two purposes are the foundations of two (mostly) dichotomous, dual, and dueling perspectives in the evaluation of machine qa: the Cranfield paradigm (?2.1.1) and the Manchester paradigm (?2.1.2). In the Cranfield paradigm (Voorhees, 2002b), satisfying the user of a machine qa system is the central goal, regardless of how that is achieved. In con- trast, the Manchester paradigm (Winograd, 1972; Levesque et al., 2011) emphasizes that evaluations should specifically test for intelligent behavior like its progenitor, the Turing Test (Turing, 1950). Although the choice of goal is rarely explicitly stated, that choice deeply shapes qa evaluations. This thesis begins by exploring how this choice influences evaluation and then turns to improving qa evaluations that test for intelligent behavior. In qa, the dominant evaluation approach collects a large set of questions, pairs questions with correct answers, and automatically scores machine responses against reference answers (Voorhees, 2000b). The Cranfield experiments (Cleverdon, 1967) laid the foundation for modern information retrieval (ir) research by demonstrating that systems scoring higher on automatic evaluations corresponds to systems with higher user satisfaction. Consequently, the critical link in information-seeking sce- narios is to ensure that reference answers satisfy users. This evaluation approach evolved to include comparisons to human effectiveness (Najberg, 2018) and was aug- mented by the construction of qa datasets with hundreds of thousands of questions. When combined with machine learning and eventually deep learning, this led to feats like ibm Watson?s 2011 Jeopardy! victory over multi-time champions Brad Rutter and Ken Jennings (Ferrucci et al., 2010). However, an abundance of evidence has shown that while we can build ?systems with very impressive performance, that [they] are nonetheless idiot-savants? (Levesque, 2014). In essence, the fact that one correctly answers ?who discovered Hawking radiation?? does not necessarily imply intelligent qa.1 The crucial oversight is that the same approach taken to evaluate 1For example, answering with the person whose Wikipedia entry has the highest word overlap 1 U: , tell me about Tahiti. A: It?s the largest island in French Polynesia, near the center of the Pacific U: What is its history with France? Figure 1.1: An example of information-seeking dialog that the Curiosity dataset aims to support. Assistants should answer user questions and convey information that inspires meaningful followup questions. user satisfaction cannot be used to draw conclusions about intelligent behavior, because they have fundamentally different?although not unrelated?goals. Just as teachers ask questions to test student comprehension (Mehan, 1979), if our goal is to quantify progress towards intelligent qa we should create tests that better test that. A central idea throughout this thesis is improving discriminative power: the ability to distinguish between better and worse systems. We improve the discriminative power of qa benchmarks by introducing dynamic elements into their scoring, format, and data. Combined, our methods improve qa benchmarks so that they better measure progress towards intelligent qa. 1.2 Conversational Information-Seeking We begin our exploration of the Cranfield paradigm in qa with perhaps the most natural form of information-seeking: conversation (Solomon, 1997). Information- seeking is naturally an interactive and iterative evolution of both the user?s infor- mation need and the expression of that need (Belkin et al., 1995). As partners in this information dance, machines should guide users toward resolving their infor- mation need (Radlinski and Craswell, 2017), even if they may not know precisely what they are looking for a priori (Raman et al., 2014). Towards the goal of satis- fying information needs, the machine may be thought of as striving to be a clever and helpful librarian who converses with humans to develop and address informa- tion needs (Culpepper et al., 2018). Ultimately, the machine will be evaluated by whether in the course of its dialog, it met the user?s information need. Conversational information-seeking sits at a convergence point between ir and nlp: ir researchers seek to make their systems more conversational while nlp researchers seek to make dialog systems more informative (Gopalakrishnan et al., 2019). Chapter 3 introduces a large resource for conversational information- seeking?an area that until recently lacked large resources despite great interest (Dal- ton et al., 2020). We introduce the Curiosity dataset (Figure 1.1) which has four features that combined make it unique: (1) it is designed for asymmetric conversa- tional information-seeking (i.e., user?assistant interaction), (2) each dialog message is paired (possibly) with supporting documents, (3) each message has explicit user feedback, and (4) messages are annotated with dialog acts which are subsequently used to infer implicit user feedback. We validate both forms of user feedback by to the question. 2 re-discovering a well-known principle in human learning: that novel information should be rooted in pre-existing knowledge (Chaiklin, 2003). In the spirit of the Cranfield paradigm, this work devises multiple ways to measure user satisfaction. Having seen one way to evaluate answers to questions, we shift to evaluation under the Manchester paradigm. 1.3 Questions are Not Equally Informative of Ability A major goal of comparative evaluations is to reliably distinguish between better and worse models. In nearly all qa evaluations, since examples are weighted equally2 they are implicitly assumed to be equally helpful in discriminating between models of disparate ability. However, this implicit foundational assumption is triv- ially falsifiable: datasets have erroneous questions that should not be weighed the same as good questions; even in a dataset free of errors, questions vary in their capability to discriminate between the knowledge of two players. For example, if the final sentence of Figure 1.2 were used to quiz world-class trivia players, it would do a poor job of discriminating between the players since the clue is too easy; in contrast, if only the first sentence was used, it would do a far better job. Similarly, if we tested novice players, then the situation would be reversed: the more difficult clue would likely not be answered by either player. Thus, the capability to infer skill is dependent on both the true skill of players and the intrinsic difficulty of questions. Thus, questions are not equally informative, and we should change our eval- uations to account for this. This precise problem was first encountered during the development of educational and psychological tests in the 1950s and 1960s (Traub, 1997). With respect to educational testing literature, current methods for qa eval- uation (e.g., averaging scores) correspond to Classical Testing Theory (Edgeworth, 1888, ctt) which has largely been replaced by Item Response Theory (Lord et al., 1968, irt) for the reasons discussed. Chapter 4 applies irt to the de facto standard method for measuring progress in qa and many nlp tasks (Wang et al., 2019a)?the leaderboard. We show that jointly modeling the abilities of models and the diffi- culties of examples produces more reliable rankings by dynamically down-weighting the importance of less discriminative questions. Just as importantly, the joint model identifies poor test questions and efficiently guides annotation efforts which we in- corporate in our discussion of future work (Chapter 7). Next, instead of increasing discriminative power through dynamic scoring, we do so with an intrinsically more discriminative qa format. 1.4 Incremental QA for Polytomous Evaluation of Knowledge Trivia games are a popular means of testing knowledge and qa ability in humans and machines alike. This thesis considers two types of tests: dichotomous evaluations where answers are entirely correct or wrong, and polytomous evaluations where partial credit is given (e.g., one point out of five). Like most qa evaluations, 2Most evaluations average scores, which implicitly assigns equal weight to each individual score. 3 One part of this program included Stuart Roosa carrying redwood seeds. A part of this program failed in its mission to reach 7 Fra Mauro, and the final part of this program brought geologist Harrison Schmitt  to the Taurus- Littrow site. There was no second or third (*) mission in the commonly used numerical sequence for this program, which began with a mission that, in a test, killed Roger Chaffee, Ed White, and Gus Grissom in a 7 fire. James Lovell was the hero 4 of another failed mission in this program. Saturn 4 V rockets were used to propel modules into space 4 in, for 10 points, what 4 NASA program whose eleventh mission landed Neil Armstrong on the moon? Answer: Apollo Program Figure 1.2: qb questions are multi-sentence and reference uniquely identifiable answers (e.g., the Apollo Program). They begin with difficult clues and steadily progress to easy ?giveaway? clues by the last sentence. In this question, the early clues are unknown to most while the final clue is known to most school-aged chil- dren. Here we show human gameplay data where players answered incorrectly (7) and correctly (4). Having above-average knowledge of this topic, my (correct) guess is shown by a . the trivia game Jeopardy! dichotomously tests the knowledge of participants: only one player is awarded points, and players are only tested at a single point?the end of the question. As a consequence, when multiple players know the answer, Jeopardy! can be characterized as ?kind of. . . a crappy video game where it?s, like, light goes on, press button?that?s it? (Malone, 2019). Thus, even if players have differing knowledge of question?s topic, they often have no way to display their expertise aside from impressively quick (jedi) reflexes (Harris, 2006).3 Putting the buzzer aside, the problem is that the only outcome which distinguishes between players of different ability is when one answers correctly and other answers incorrectly. In all other cases, we have no information about which player knows more. If we aim to distinguish between differing levels of qa ability, then dichotomous tests are an inefficient means of doing so. Chapter 5 addresses this oversight while simultaneously incorporating two mechanisms that bring it in line with the Manchester paradigm. For this task, we turn to Quizbowl (qb)?a decades-old trivia game still played annually by over fifty thousand students across the world (National Academic Quiz Tournaments, 2020). In qb, each question tests knowledge incrementally by checking for knowledge of difficult clues first and easier clues last. This property of questions?called pyramidality?is implemented by creating multi-sentence ques- tions where each subsequent sentence contains clues easier than previous ones (Fig- ure 1.2). As in Jeopardy!, there is also a buzzer, but with one critical difference: it can be used at any point instead of only after the full question has been shown. 3In the Watson match, the reflexes of mere mortals could not match a computer?s, contributing to their demise. 4 In effect, qb is a polytomous evaluation4 with points deducted for every additional word needed to answer. This design combined with the pyramidality of qb ques- tions makes it easier to distinguish between novices, dilettantes, weekend warriors, and goats.5 After introducing qb, Chapter 5 builds a framework that uses the trivia game in a Manchester-style qa evaluation. Specifically, we argue that evaluation for the purpose of evaluating intelligent behavior is improved by the pyramidality of ques- tions, implicitly requiring models to ?know what you don?t know?, and a metric that requires stability in guesses. To test this approach, we create a (deep-learning based) computer system for playing qb where one sub-system provides candidate answers while another decides when to buzz with that answer. We put this sys- tem (and our evaluation) to the test through a multi-year set of exhibition matches against accomplished trivia players; as in Jeopardy!, our system defeats even the most accomplished players despite using only statistical pattern matching. Upon closer inspection, although the format of qb is an improvement over prior qa eval- uations, properties of the data still makes it easy for clever?yet cheap?statistical tricks to win the day. 1.5 Crafting Robust Questions Through Cooperative Machine-Human Authoring Although answering a question to the satisfaction of a user may be sufficient in the Cranfield paradigm, if that question was answered through flawed reasoning then in the Manchester paradigm the test taker is not displaying intelligent be- havior and should fail the test. For example, answering ?who discovered Hawking radiation?? with Stephen Hawking on the basis of word co-occurrence statistics on Wikipedia is not sound; similarly answering Figure 1.2?s question on the basis that the answer to any question with the phrase ?in this program? is the Apollo Program is also flawed and not how humans answer questions. The challenge then is ?can [we] find questions where cheap tricks like this will not be sufficient to produce the de- sired behavior?? (Levesque, 2014). Levesque proposes three ways to make questions (more) robust to these tricks: make questions ?Google-proof, . . . avoid questions with common patterns, . . . [and] watch for unintended bias? like word order betray- ing the answer. While following these recommendations is no panacea, following them increases the likelihood that a machine that correctly answers these questions is doing so through intelligent means. In Chapter 6, we are the first to show that the same machines that used these tricks to best accomplished qb answers are also useful for helping humans to find and eliminate these shortcuts. At first, it may seem paradoxical that the ?idiot- savant? models are useful for making questions more robust tests of intelligent qa. However, if we know?a priori?that a model explicitly uses pattern matching to 4Dichotomous responses are either correct or wrong; polytomous responses allow for partial credit. 5goat: the Greatest of All Time. 5 answer, then we can use it to make it more difficult for pattern matching alone to derive the correct answer. Concretely, Chapter 6 designs an interface where humans and machines cooperatively and iteratively craft questions together: the machine identifies key phrases that are statistically predictive of the answer while the human creatively rephrases them. The output of this human?machine collaboration is a new set of questions that?by construction?are more difficult to answer through naive pattern matching alone and therefore if solved are more likely to indicate that the answer was derived through intelligent means. 1.6 Roadmap In this thesis, we create methods that improve the format, data, and scoring frameworks in qa evaluations. Prior to discussing our methods, we first identify the Cranfield and Manchester paradigms as two influential perspectives in qa evalua- tions and show how they affect the form and data of evaluations (Chapter 2). In the subsequent chapter, as an example of the Cranfield paradigm, we introduce the Curiosity dataset along with a new idea for dialog qa evaluation (Chapter 3). From here, we turn to Manchester paradigm evaluation and begin by showing how lessons from educational testing can help us dynamically change the influence of individual examples (Chapter 4). Where the prior chapter held the format and data static, Chapter 5 changes the qa format to one that is intrinsically more discriminative. We use this same qa format in Chapter 6 to introduce a framework for cooperative human?computer question authorship that results in more robust evaluation data. Finally, Chapter 7 takes stock of the current state of qa evaluation and builds a broader vision that combines the ideas of this thesis. 6 Chapter 2: Background We can only see a short distance ahead, but we can see plenty there that needs to be done. Alan M. Turing, 1950 This chapter provides background in the epistemological heritage of question answering research (?2.1) and datasets (?2.3), and modern methods for question answering (?2.4). In discussing qa?s heritage, we more precisely define the origin and definition of the Cranfield and Manchester paradigms. Afterward, we compare several common qa formats. Using this new vocabulary, we analyze and categorize qa datasets by their purpose, how they were created, and other attributes which will help contextualize our dataset contributions in Chapters 3, 5, and 6. Following this, we describe modern (neural) methods for qa that we use throughout this thesis. 2.1 The Epistemological Heritage of Question Answering Although machine qa is a comparatively young area of study, reviewing its history is helpful for understanding where the field has been, where it is going, and how to avoid repeating mistakes. In this section, we first compare the Cranfield paradigm (?2.1.1) and Manchester paradigm (?2.1.2) which are each motivated by related yet independent goals. The Cranfield paradigm is driven by the ?goal of [information retrieval] research [which] is to improve access to information? (Croft, 2019). From this perspective, qa is another way to satisfy human information needs. The Manchester paradigm?s goal is to create qa systems that exhibit intel- ligent behavior. Neither is inherently better, but regardless the choice shapes the type of systems built. Far before the rise of big data, the exponential growth of computational resources, and deep learning, Schwitter et al. (2000) argue that: There is general agreement that these competitive evaluations had a striking and beneficial effect on the performance of the various systems tested over the years. However, it is also recognized (albeit less generally) that these evaluation experiments also had the,[sic] less beneficial,[sic] effect that the participating systems focussed increasingly more narrowly on those few parameters that were measured in the evaluation, to the detriment of more general properties. Arguably, this phenomenon has repeated itself (Dotan and Milli, 2020) with deep 7 learning, and it is only recently that other goals like robustness, fairness, or efficiency have been given attention (Paullada et al., 2020). Thus, we begin our review with the Cranfield paradigm, which paved the way for using large-scale static evaluations as a proxy for user satisfaction. 2.1.1 The Cranfield Paradigm In information retrieval, the main task is to use a user?s stated information need (query) to select relevant information (documents) to show them. The most natural method to evaluate ir systems then would be to simply ask the user if the returned documents satisfy their information need. However, much like anno- tation in nlp this is expensive and time-consuming, so Cleverdon (1967) propose an alternative in the Cranfield experiments.1 Rather than have users interact with systems, test collections are built, and all systems are evaluated on this collection. The collection has four parts: (1) a representative list of user queries, (2) a large set of documents, (3) per-query relevance judgments indicating which documents were relevant to each query, and (4) a means to aggregate per-query scores. Instead of putting humans in the loop with ir systems, ?in the Cranfield paradigm, researchers perform experiments on test collections to compare the relative effectiveness of dif- ferent retrieval approaches? (Voorhees, 2002b). The Cranfield paradigm makes three key assumptions: the relevance of one document is independent of all the others, a single set of judgments is representative of the user population, and all relevant documents are annotated (Voorhees, 2019). At first glance, it would seem to be that every assumption is violated with real users: they do not like repeated information, users may differ in judging relevance, and it is infeasible to annotate every query-document pair unless the document collection is small (Cleverdon, 1991). Indeed, in the time of the Cranfield experiments these arguments made test collections seem like a radical proposal (Taube, 1965).2 Despite this, part of the success of the Cranfield paradigm and subsequently trec?s success is attributable to great care in aligning the core task with the evaluation (Jones, 2001); for example, repeated information does not matter if the user only requires at least one relevant document. As ir systems became more effective at retrieving documents to sort queries, researchers started working on extensions that returned short answers instead of whole documents (Sanderson and Croft, 2012). Despite the change in output for- mat, the core goal mirrors that of ir: return relevant information?in the form of a short answer?to the user. For example, in the first large-scale evaluation of domain-independent qa systems (trec-8) ?human assessors read each string and made a binary decision as to whether the string actually did contain an answer to the question? (Voorhees, 2000b). With this history lesson in mind, it should be clear by now that although this is a reasonable means to evaluate how effective a 1The Cranfield experiments and thus the Cranfield paradigm are named after where they were conducted: Cranfield University in Cranfield, Bedfordshire, United Kingdom. 2This was in part due to its controversial but ultimately correct finding that word-based indexing of documents was effective. 8 qa system is for users, there is (rightfully) no focus on evaluating whether the sys- tem is behaving intelligently. Despite this, system effectiveness on qa benchmarks have been repeatedly used to claim super-human?and presumably ?intelligent?? capabilities (Najberg, 2018). Next, we formalize an alternate paradigm that better aligns with the goal of testing for intelligent behavior than the Cranfield paradigm. 2.1.2 The Manchester Paradigm The Manchester paradigm is an intellectual descendant of the Turing Test (Tur- ing, 1950). Just as the Cranfield paradigm is named for Cranfield University, we name the Manchester paradigm after where Alan Turing created the Turing Test: the University of Manchester. Rather than consider the ill-defined question of ?can machines think?? Turing proposed that we instead consider the imitation game: a test of whether a machine could fool an interrogator into thinking it was a human. The eliza (Weizenbaum, 1966) dialog system is the first well known, qa- like system to have ?passed? a weak form of the Turing Test; related systems like parry (Colby, 1981) eventually followed. eliza was intentionally built to ?dazzle even the most experienced observer? while still being a ?mere collection of proce- dures.?3 Consequently, the lesson from eliza is that ?we often read into a program?s behavior our own ideas of what it understands? (Winograd, 1977) which is some- times referred to naive psychology (Watt, 1996); Caporael (1986) also point out this anthropomorphization of machines. In the paper describing eliza, Weizenbaum makes the point that there are limits to eliza?s ?understanding,? since a crucial component of understanding ?is not the subject?s ability to continue a conversation, but to draw valid conclusions from what is being told.? We posit that Terry Winograd?s shrdlu qa system (Winograd, 1971) is a response to this criticism. Winograd viewed shrdlu as a ?system [that] attempts to integrate all the aspects of language, in combining linguistic knowledge with a command of the world being discussed? (Winograd, 1977). This system builds an internal model of a toy block world that is manipulated through natural language; in this way, shrdlu draws valid conclusions about the toy block world. In a similar spirit, a core goal of the Manchester paradigm is to test?through a natural language interface?if qa systems can manipulate a model of the real world to derive correct answers. There is, however, a crucial difference in the feasibility of testing for user satisfaction in the Cranfield paradigm and testing for intelligent behavior in the Manchester paradigm. Namely, the main challenge that the Cranfield paradigm addresses is one of cost: it is simply not practical to evaluate every new system with humans. In contrast, the Manchester paradigm must contend with the reason for the Turing Test in the first place: that we do not know how to define whether 3Other early qa systems included baseball which manipulated a knowledge based of foot- ball games through natural language (Green et al., 1961) which inspired later work in querying databases (Copestake and Jones, 1990), lunar which helped Apollo Program scientists find spe- cific Moon samples matching certain traits (Woods, 1972), and student that solved simple word algebra problems by parsing numerical expressions (Bobrow, 1964). 9 a system is intelligent, in large part because defining intelligence is at minimum exceptionally challenging. Computer scientists like Turing were not the first to struggle with defining, understanding, and measuring intelligence: this is a long-standing challenge in fields like psychology. While the nature of intelligence is highly contested, prominent theories include the g factor theory which postulates that there is a single underlying general intelligence (Spearman, 1904), theories that postulate multiple independent axes of intelligence such as primary mental abilities (Thurstone, 1973) or multiple intelligences (Gardner, 2011, p. 277),4 and a triarchic theory based on the idea that human intelligence is a combination of componential, experiential and practical components such as ?mental activity directed toward purposive adaptation to, and selection and shaping of, real-world environments relevant to one?s life? (Sternberg, 1985). Of these, the g factory theory took strongest hold and was arguably central in the creation of psychometrics: a field studying the measurement of intelligence in humans (Section 2.6). While we will not opine as to which theory is correct,5 it is not necessary to do so to discuss why Sternberg?s triarchic theory best lights the way when cre- ating evaluations of machine intelligence.6 At its core, the g factor theory posits that manifestations of intelligence?such as demonstrating mathematical or spa- tial reasoning?emerge from a general mental ability (the g factor). The validity of measuring the g factor (i.e., iq) in humans is questionable (Section 2.6),7 and makes even less sense for machines: every system discussed thus far?by construc- tion (e.g., eliza)?does not have an underlying ?intelligent? core. While modern statistical learning methods have certainly pushed state-of-the-art, they do little to change this state of affairs (Bender and Koller, 2020). Thus, at best, approach- ing the evaluation of machine intelligence from this perspective implies measuring a quantity (g factor) that we know does not exist in machines built with current technology. Another family of intelligence theories?such as primary mental abilities or multiple intelligences?posit that intelligence is the combination of several abilities such as numerical, reasoning, musical, and bodily-kinesthetic ability. While these abilities map onto machines with varying degrees of success (e.g., bodily-kinesthetic ability is irrelevant to most qa systems), these theories are nonethless helpful for creating a taxonomy of general skills that we should endeavor to make machines capable of. For example, developing machines capable of numerical ability and test- ing for that ability is tractable and an example of what we have referred to in this thesis as an intelligent behavior. However, theories of multiple intelligences still leave much on the table. These theories take a narrower skill-based view that omits 4Gardner?s multiple intelligences can be understood as multiple independent abilities like lin- guistic or numerical reasoning. 5This is an empirical question which we do not investigate in this thesis. 6Furthermore, the nature of human intelligence may be different from artificial general intelli- gence (provided of course that agi is achievable). 7Although multiple studies have demonstrated correlation of iq test results with positive out- comes like job success, as the old adage goes, correlation is not necessarily causation. 10 several aspects of intelligence included in the triarchic theory such as creativity (e.g., composing novel questions as in Chapter 6), generalization from experience (e.g., few-shot learning as in Section 5.10.1), adapting and influencing environments (e.g., reinforcement learning), and the ability to learn efficiently. While developing ma- chines capable of achieving particular skills?such as reasoning from known facts?is of definite interest in nlp, this broader view covers these skills8 while describing ad- ditional aspects of intelligence that are already of active scientific interest in nlp, machine learning, and ai. In this spirit, an alternative interpretation of the Turing Test is that it is ?not asking whether all digital computers would do well in the game nor whether the computers at present available would do well, but whether there are imaginable computers which would do well? (Turing, 1950). One way to build towards this goal is to iteratively imagine tasks that a machine should be able to do at least as well as a human, prototyping a machine ostensibly capable of that task, and then testing for that capability. Likewise, we can iteratively imagine aspects of intelligence?such as sample-efficient learning or generalization to novel contexts? that humans regularly exhibit, develop ways to build those in machines, and measure our progress. In this thesis, we refer to both of these manifestations of intelligence as intelligent behaviors. Towards the goal of building machines exhibiting increasingly many intelligent behaviors, descriptive theories and definitions of intelligence are helpful for identifying intelligent behaviors that we can work towards developing in machines. In testing for capabilities, we introduce the second element of Manchester paradigm: we seek not to prove that a machine has a capability representative of a specific intelligent behavior, but to know when it does not so we can work towards building that capability. Through this mechanism, the Turing Test ?represents what it is that ai must endeavor eventually to accomplish scientifically? (Harnad, 1992). For example, one intelligent behavior humans exhibition is being able to answer a question regardless of how it is phrased (Chapter 6). The challenge however comes in testing for this: we cannot tractably test all possible rephrasings. Theory also suggests this intractability: if the imitator model and interrogator are Turing Machines (Turing, 1937), then determining whether the imitator is a machine is undecidable (Sato and Ikegami, 2004). Thus, we avoid positively proving ability or passing the Turing Test since it is likely impossible. A wider view of this point partially explains the ?moving goalposts? phenomenon in ai research where once a grand challenge is solved?like Chess (hsiung Hsu et al., 1995) or Go (Silver et al., 2016)?it is no longer considered ?ai.? Instead, under the Manchester paradigm, the process of creating tests, improving tests, and improving models based on test failures guides research. The best and most famous example of the Manchester paradigm is the Wino- grad schema challenge (Levesque et al., 2011). In this qa task, machines answer specially crafted questions with binary answers (Figure 2.1). Examples in the Wino- grad schema challenge always mention two parties (noun phrases), one is referred 8These academically-oriented skills reside in the componential analytical subtheory. 11 Questions Answer The trophy would not fit in the brown suitcase because it was too big. What was too big? trophy/suitcase The trophy would not fit in the brown suitcase because it was too small. What was too small? trophy/suitcase Figure 2.1: The Winograd Schema Challenge (Levesque et al., 2011) examples have two main ingredients. First, a statement containing two noun phrases and a pronoun which could refer to either (?trophy,? ?suitcase,? and ?it? in this example). Second, a single trigger word that when changed between two alternatives (?big? and ?small?) changes which noun phrase the pronoun should resolve to. Well-crafted examples should expose models that do not ?understand? text. to by a pronoun or possessive adjective, the question involves determining the re- ferred noun party, and crucially there is a special word that when replaced by an alternative, flips the answer. Should a machine fail to answer questions like this, we conclude it is not displaying intelligent behavior and is thus not intelligent in at least the same way humans are. The opposite is not necessarily true though, since we only test for a single case and would further need to make the logical inference that the Turing Test tries to avoid: to define intelligence itself as displaying a specific list of intelligent behaviors. The key feature of the Winograd schema challenge, its suc- cessor the Winogrande challenge (Sakaguchi et al., 2020), and Manchester paradigm qa evaluation is to identify specific intelligent behaviors that the evaluation should test for. This distinction of paradigms is a generalization of a point made by Gard- ner et al. (2020b)?that the three broad motivations for qa are to (1) fill human information needs, (2) probe a system?s understanding of some context (Weston et al., 2016), and (3) to transfer learned parameters.9 Specifically, filling human in- formation needs aligns with the Cranfield paradigm and the Manchester paradigm generalizes the idea of probing a system?s understanding of a context to the idea of testing for intelligent behavior which may not necessarily require explicit context at all.10 While we have introduced these as opposing paradigms by contrasting their main goals, there is a wide swath of research that advances the goals of both. 2.1.3 Common Ground Although we present these paradigms as a dichotomy, the world is not black and white, and progress is often shared through common sub-goals. Making models 9Parameter transfer is based on the idea that if the end task follows a qa format, that training on qa-like data may be helpful even if the task itself is not qa. 10For example, answering many semantically equivalent phrasings of the same question. 12 more robust is?to an extent?one of these areas of overlap. For example, Microsoft Research and Bing created the Web Scale Speller Challenge in 2011 (Wang and Pedersen, 2011) to motivate research in making search engines more robust through better handling of spelling errors. In concurrence with the Cranfield paradigm, mak- ing search engines robust to mis-spellings ultimately helps users find information faster. From the Manchester paradigm perspective, humans demonstrate an im- pressive robustness to poor spelling, albeit at speed penalty (Rayner et al., 2006),11 so expecting this of machines is quite reasonable. Along similar lines, Contrast Sets (Gardner et al., 2020a) test robustness to syntactic variants of the original questions in multiple datasets. However, there is a limit to this: when robustness works towards the goals of one paradigm, but against the other. Suppose a user wanted to find a particular email, but only could recall a men- tioned word that is otherwise unrelated to the email?s contents. In the abstract, what we really ask is whether we should allow a system to use information that although helpful in solving the task, is not related to any conventional sense of nat- ural language understanding or reasoning. The Cranfield paradigm would certainly welcome this information; in contrast, the Manchester paradigm would reject this usage as one of a larger class of correlations that models should not rely on (Feng et al., 2018). Beyond robustness, these paradigms share interests in areas like few- shot and zero-shot qa: responding to infrequently asked questions is important in practice (Baeza-Yates et al., 2007), and building models that learn more from less certainly qualifies as intelligent behavior (Linzen, 2020). Next, we introduce com- mon qa formats and use this as a means to survey of qa and reading comprehension (rc) datasets. 2.2 Question Answering Dataset Characteristics In addition to a qa dataset?s goals, they are also characterized by the qa format (?2.2.1), how questions are created (?2.2.3), and the context that they assume (?2.2.2). Questions based on specific text passages (e.g., from Wikipedia) will be different from questions that you might ask a digital assistant or search engine; likewise, the form of the answer also influences the data (e.g., multiple-choice versus free response). Ultimately, each of these factors changes the data distribution and the types of models built for the task. 11Davis (2003) summarizes human robustness by saying that: Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn?t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe. 13 2.2.1 Answer Formats All qa formats share three components: the question, an implicit or explicit context, and a desired answer.12 We begin by first looking at the forms that answers take. Although it may seem easy to check the correctness of answers, this is a major challenge in qa and practitioners frequently realize upon inspecting the data that ?there is no such thing as a question with an obvious answer.? (Voorhees and Tice, 2000). Along similar lines, the creators of NaturalQuestions?a qa dataset derived from Google search queries?estimate that one third of (short) answers to questions are ?debatable? (Kwiatkowski et al., 2019).13 In the early days of large-scale qa evaluation with trec (Voorhees, 2000b), human annotators graded each system?s answers, but this was ultimately not feasible as the numbers of questions and systems grew. Much like in ir, qa moved to (mostly) automatic methods to evaluate answers from qa systems. Free Response Although free response is conceptually the simplest and relatively straightforward for humans to evaluate, it is the most challenging format to auto- matically grade. For example, is ?Turing? a good enough answer to ?who proposed the imitation game?? compared to ?Alan Mathison Turing?? Would this be changed if instead of a uniquely identifying name like Turing, the answer were a much more common name like ?Pedro Rodriguez? which could refer to no less than nine famous athletes, several in the same sports? One approach taken by the Bing-based qa dataset ms-marco (Nguyen et al., 2016) is to compute textual overlap between reference answers and proposed answers with bleu-1 (Papineni et al., 2002) and rouge-l (Lin, 2004). Unfortunately, these metrics often have unimpressive corre- lation to human judgements (Callison-Burch et al., 2006; Chen et al., 2019) as in SemEval-2018 Task 11 (Ostermann et al., 2018), especially when single word differ- ences have significant effects on correctness (Yang et al., 2018a).14 While evaluating free response answers remains challenging, the expressiveness of the format contin- ues to encourage the improvement of scoring metrics through the use of learned metrics (Nema and Khapra, 2018; Chen et al., 2020). Span Selection One step back from free-response, is answering questions using text spans from documents. The first large-scale examples of this evaluation format are trec-qa-8 (Voorhees, 2000b) and trec-qa-9 (Voorhees, 2000a). In this task (Figure 2.2),15 systems answer questions with spans of text from a document in a 12In this case, ?answer? is more accurately described as a desired response, whether it be multiple answers or abstention. 13They define correct (but debatable) as: ?a reasonable person could be satisfied by the answer; however, a reasonable person could raise a reasonable doubt about the answer.? 14Answers like ?the radiation of wireless routers has [an/no] impact on people? have high lexical overlap despite being contradictory. 15As of writing, we have not lost contact, although that is expected in the 2020s or early 2030s. 14 Question: When will the Voyagers lose contact with Earth? Answer: about 2015 or 2020 Figure 2.2: A sample question from trec-8. Questions are contributed by partici- pants or FAQFinder logs. Answers are text spans selected by nist annotators from news articles. Page: Southern California Passage: To the east is the Colorado Desert and the Colorado River at the border with Arizona, and the Mojave Desert at the border with the state of Nevada. To the south is the Mexico?United States border. Question 1: What is the name of the water body that is found to the east? Answer 1: Colorado River Question 2: What is the name of the desert on the border of Arizona? Answer 2: Colorado Desert Figure 2.3: In squad (Rajpurkar et al., 2016), annotators read a passage from Wikipedia and write several questions. Answers to squad questions are text spans such as ?Colorado Desert.? Despite the qa in its name, squad is better thought of as a rc dataset since questions are context-dependent. pre-specified collection.16 This format was later adopted by both versions of the Stanford qa dataset (Rajpurkar et al., 2016, 2018, squad) (Figure 2.3) with the second iteration incorporating abstention as in trec 2001 (Voorhees, 2001). The format has proved popular and is also used by NaturalQuestions (Figure 2.4) and many others. Beyond having some of the lexical matching problems as free response answers, the primary disadvantage of span selection is that qa models are limited to using only text present in the provided documents. Multiclass Classification Although free-response and span selection are flexible, one drawback is that sometimes it is unclear which semantic concept is intended (e.g., Francis Bacon the philosopher versus the artist). In human trivia games like Jeopardy! or Quizbowl (Chapter 5), this is easily solved by prompting humans to be more specific. The equivalent for machines is to answer from a large, but closed set of answers, each of which refer to a distinct concept such as named entities. By construction, answers to examples in WebQuestions (Berant et al., 2013) and SimpleQuestions (Bordes et al., 2015) are named entities from knowledge bases like Freebase (Bast et al., 2014). Classification is also the format we use in qb evaluation since most answers correspond to Wikipedia page entities (?5.3.4). The primary tradeoff of this evaluation scheme is that although answers are never ambiguous, if the set of candidate answers does not encompass all possible answers, then the 16In trec-8, since the requirement that the document support the span was not originally stated, nist allowed these ?unsupported? answers, but adopted this requirement for trec-9. 15 Page: Sun Question: what stage of the star life cycle is the sun in Long Answer: The Sun is about halfway through its main-sequence stage, during which nuclear fusion reactions in its core fuse hydrogen into helium. Each second, more than four million tonnes of matter . . . Short Answer: about halfway through its main-sequence stage Figure 2.4: Examples from nq (Kwiatkowski et al., 2019) use real user queries from Google search and annotate them with long and short answers. evaluation is less representative of true model effectiveness. Multiple Choice The final common answer format is multiple-choice, much like one might find on standardized education tests. This format is attractive from the modeling perspective since answering from a small set of options is a more tractable challenge. Along these lines, there are multiple-choice tasks for story (Richardson et al., 2013, MCTest), science (Welbl et al., 2017; Mihaylov et al., 2018; Clark et al., 2018), and commonsense reasoning (Talmor et al., 2019) questions. The main drawback to multiple-choice though is that creating plausible false answers is challenging and requires substantial conscious effort (Welbl et al., 2017; Talmor et al., 2019) or frameworks like the Winograd schema challenge or templates of understanding (Dunietz et al., 2020). Abstention Another common feature of tasks since trec-qa 2001 (Voorhees, 2001) and qa4mre at clef (Pe?as et al., 2013) is the possibility of a question having no answer and penalties for wrong answers. Trecqa 2002 took this a step farther and factored in the confidence scores instead of a simple binary decision (Voorhees, 2002a). The motivation behind this is that models should know what they do not know (Rajpurkar et al., 2018); Quizbowl takes this even farther by requiring this decision to be made many times per question, instead of only once. In other terms, models should have calibrated estimates of their confidence (Kamath et al., 2020) to help decide when there is legitimately no correct answer or when the model?s best guess is wrong. Having covered the most common forms that answers take,17 we next discuss the information contexts used in various datasets before bringing everything together to characterize common generative stories of qa datasets. 2.2.2 Information Context We?as humans?draw on our knowledge and (sometimes) external sources (e.g., webpages or books) to answer questions. Although these sources are varied, when we evaluate qa models they are limited to the information in the training 17Although extensive, our list does not cover formats like lists of answers. 16 data or passed in as input. Generally, the format of qa examples either assumes a global context and passes only the question to the model, or it provides a specific context?such as a paragraph?for the model to use.18 Global Context Examples of global context qa include tasks like Jeopardy! (Dunn et al., 2017), Quizbowl, and ComplexWebQuestions (Talmor and Berant, 2018). In these tasks, there is no provided context, and any knowledge can answer the question. Thus, tasks based on search queries like early iterations of trec-qa, ms-marco, and NaturalQuestions are all global context questions.19 Although fac- toid qa is the most common type, some common sense reasoning tasks also fit this description. Local Context In contrast, other tasks require that models use a specific, local context to answer the question. The most obvious examples of this are story-based datasets that ask questions about what occurred in the story (context) to test read- ing comprehension (Ko?isk? et al., 2018; Tafjord et al., 2019). This is also common in factoid qa tasks like squad or drop (Dua et al., 2019) where (frequently) the question requires a paragraph (context) for it to be answered unambiguously (e.g., in Figure 2.3). In either case, the differentiator is not whether a certain local context is helpful in answering a question, but if it?in general?is required. Another example of local context are conversational qa datasets which inten- tionally make the dialog history important to answering questions. The earliest large sequential qa dataset is Trecqa 2004 (Voorhees, 2004), but it has been followed up on by many other likes coqa (Reddy et al., 2019), quac (Choi et al., 2018),20 and the Curiosity dataset (Chapter 3). 2.2.3 Generative Story Finally, having discussed the format and context of questions, we are prepared to treat them as part of the generative process in creating a question. If we view the creation of questions as sampling from a probability distribution, then specifying its generative process is helpful in identifying latent factors and observed factors that influence the types of questions asked. Although there are several latent factors we could define, such as the difficulty of questions as we explore in Chapter 4, for now we focus on observed factors. In particular, we model the data distribution of questions as having three factors: the question, the context, and the answer. Under this generative model, the defining aspect of the question distribution is the choice of which factor is selected or sampled first. For example, if we se- lect a context paragraph first, then the question will be influenced and perhaps 18Datasets also typically assume a particular temporal context (i.e., when it was created); an assumption tested in trec-qa 2006 (Dang et al., 2006) with questions like ?who is the president of the U.S.? whose answer is temporally dependent. 19Although NaturalQuestions provides gold passages with answers, questions are created inde- pendently of the document. 20Elgohary et al. (2019) create a de-contextualized version of quac called canard. 17 Stanford Question Answering Dataset (SQuAD) Distribution of Wiki Page Paragraph Question Wikipedia Pages Wiki Page Answer Figure 2.5: squad?s questions are influenced by the distribution of Wikipedia pages, the set chosen to be part of squad, and the text of the question. We call this generative process type context-first to emphasize that the context (paragraph in this case) directly influences the question. explicitly depend on that context; on the other hand, if the question is created first (e.g., a search engine question) then it does not depend on any particular context. Although the Cranfield and Manchester paradigms are a separate concept, identify- ing the generative story is often helpful in identifying the core motivation of a qa dataset. When the answer to a question is chosen first?such as a teacher writing a test question?the task almost surely falls under the Manchester paradigm since by construction it does not originate from a genuine information need. Next, we describe these three generative stories for questions. Context-First QA Context-first questions are created when (optionally) a topic is selected, then a paragraph discussing the topic, and then a question is written that uses the context. In every step along the way, there are multiple points of influence?none of which are necessarily good or bad. As an example, lets consider squad 1.0 (Rajpurkar et al., 2016) where questions are written based on paragraphs from one of 490 Wikipedia pages (Figure 2.5).21 Cultural artifacts like Wikipedia? which many qa datasets such as squad depend on?are reflective of society and carry its biases (Reagle and Rhue, 2011); in Wikidata,which is partially derived from Wikipedia, 77% of people entities are male which in turn influences who is asked about.22 Similarly, while it is obvious that the question depends on the context, a somewhat less predictable side effect is that questions are often lexically similar to the context which makes them easier to answer through pattern matching (Sugawara et al., 2018; Rondeau and Hazen, 2018). This exact effect was observed years earlier in Trecqa-823 where questions ?were often back-formulations of statements in the documents, which made the questions somewhat unnatural and also made the task easier since the target document contained most of the question words? (Voorhees, 2000a). In response, the next year?s track used search queries from Microsoft?s Encarta and Excite instead which induces a different generative story. Aside from 21442 in the training set and 48 in the development set. 22We used the Wikidata query interface to derive this number. 23Track participants submitted their own questions and were evaluated on questions from other participants. 18 TREC-9, NaturalQuestions and MS Marco Search Queries Question Paragraph Answer Figure 2.6: In question-first qa, the question is created independently of any particu- lar context (paragraph) or answer. This type of qa is most prevalent in information- seeking tasks like web search. Quizbowl and Jeopardy! Distribution of Wikipedia Pages Answer Question Wiki Page Figure 2.7: Answer-first questions are created by first thinking of a desired answer (or at least topic area), then crafting a question with that answer. factoid qa, context-first qa is common in story or narrative qa questions that frequently tests if models make specific inferences from the story.24 Question-First QA In this generative process, the question is the first compo- nent created?datasets based on search query logs are an excellent example of this process. In question-first qa, questions often emerge from an information need and therefore presumably the answer?if there is one?is not known, so does not directly influence the question. This would naturally include many of the Trecqa tasks, ms marco, and NaturalQuestions. While some of these tasks provide supporting doc- uments, their inclusion does not directly influence the questions themselves since even questions with no answer are included. In these tasks, although providing a question-derived context is common, it is not a necessary component. 24Consider an example from Dunietz et al. (2020): Why did the dog come inside? Because: (1) it was raining outside, (2) dogs prefer not getting wet to getting wet, and (3) going outside while it is raining would get you wet as opposed to going outside when it is not raining and being dry. They propose to continue this questioning until the resulting facts are obviously true. 19 Answer-First QA The final major category of generative stories is answer-first qa where the answer is the first component conceived (Figure 2.7). In contrast with question-first qa, the answer or at least the topic of the question is specified a priori. This is most prevalent in trivia tasks like Jeopardy!, Quizbowl, and TriviaQA (Joshi et al., 2017) where the question writer aims to test a subject?s understanding or knowledge of a topic. One drawback of this approach?much as with context-first? is that often the distribution of answers is heavily skewed (?5.7.2). Like question- first, although datasets like TriviaQA also provide supporting documents, they again do not directly influence the question like a priori specification of the answer does. All this said, if the goal is to test knowledge of a particular answer or topic, answer- first is the go-to format. Human and Machine Authorship Although some datasets are entirely human- written, machines increasingly play a larger role in dataset construction. Whether a dataset is human-authored ( / ), machine-authored ( ), or authored by a com- bination also affects the distribution of questions?Chapter 6 investigates how co- operative human-machine authoring affects question difficulty. For example, babi is generated from a set of templates and thus is entirely machine-authored (Weston et al., 2016), WikiHop from a knowledge base by posing slot-filling as qa (Welbl et al., 2018), and WikiReading from aligning WikiData properties to contexts au- tomatically (Hewlett et al., 2016). While this automated approach makes it easy to quickly generate many questions, they tend not to be as diverse as human-authored examples. Still, these approaches show promise in testing for specific capabilities and comparing to more naturally constructed datasets (Liu et al., 2021). SimpleQues- tions (Bordes et al., 2015) and Complex WebQuestions (Talmor and Berant, 2018) take a middle-of-the-road approach and automatically select what relationships a question should ask about and then human annotators phrase them ( ? / ). Un- der this setup, an annotator might receive a knowledge graph triple like (Picard, played by, Patrick Stewart) and write ?who played Picard, the captain of the Enter- prise?? In another model for human-machine authorship, humans write questions and machines filter out questions that are too easy ( / ? ) as in drop (Dua et al., 2019) and quoref (Dasigi et al., 2019). Chapter 6 takes this a step farther and interactively constructs questions ( ? / ) like subsequently done in dialog (Nie et al., 2020), natural language inference (Nie et al., 2020), and the Winogrand chal- lenge (Sakaguchi et al., 2020). On the other end are questions authored entirely by humans, be it through crowdsourcing ( ) or some other means ( ). This distinguishes tasks like Quizbowl or Quasar-T (Dhingra et al., 2017) which are created by domain experts from crowd- sourced tasks like squad, coqa (Reddy et al., 2019), and quac (Choi et al., 2018). In the case of trivia games, writers have intrinsic motivations for crafting good ques- tions, players are intrinsically motivated to play them well (von Ahn and Dabbish, 2008; Wang et al., 2013), and this generally results in higher quality data (Kuznetsov, 2006). This distinction can be important since oftentimes crowd-sourced datasets are more vulnerable to effects from annotators (Geva et al., 2019). For example, data 20 quality can suffer if compensation is not appropriate for the task (Fort et al., 2011; Wang et al., 2013), if incentives are misaligned, or in the worst case both (Gneezy and Rustichini, 2000). In our categorization of qa datasets, we include author- ship information as well. While there are certainly other factors that influence the distribution of questions in qa datasets, these are some of the most important. Human-in-the-Loop Adversarial Examples Since the publication of the work in Chapter 6, several new datasets across a variety of nlp tasks have used model-in- the-loop methods to create more challenging datasets. Although it is presumed that datasets like HotPotQA require multi-step reasoning, Min et al. (2019) show that many do not require multi-step reasoning. Dua et al. (2019) and Dasigi et al. (2019) take aim at this problem by not allowing crowd-workers to submit questions answer- able by a strong baseline. These works and ours in Chapter 6 create single-round adversarial questions, but the process can be generalized to multiple rounds. After every round of data collection, models are re-trained on the original data plus the adversarial data which makes future questions even more difficult. This procedure has been done for reading comprehension (Bartolo et al., 2020), natural language inference (Nie et al., 2020), and detecting abusive messages in dialog systems (Dinan et al., 2019a). This human-in-the-loop method is one effective way to reduce lexical artifacts in datasets which in turn forces the development of better models (Kaushik et al., 2021). Dynabench (Kiela et al., 2021) draws these and other tasks together into shared evaluation infrastructure that supports iterative adversarial collection. For an extensive survey of these methods, we refer the reader to Wang et al. (2021). 2.3 Question Answering Datasets A major contributor to the efficacy of recent qa systems is the vast improve- ment in the availability of large-scale datasets. These datasets and corresponding tasks are extremely diverse, but it is nevertheless helpful to categorize them?where possible?by paradigm (?2.1), answer format (?2.2.1), information context (?2.2.2), generative process (?2.2.3), and domain. With those datasets in mind, Table 2.1 cat- egorizes the most frequently used qa datasets starting from 1999 to the present.25 For further reading, we refer the reader to a qa tutorial by Chen and Yih (2020) or surveys by Cambazoglu et al. (2020) and Rogers et al. (2021). Dataset Paradigm Author Area Citation Deep Read Manchester Stories Hirschman et al. (1999) trec-8 qa Cranfield News Voorhees (2000b) trec-9 qa Cranfield Search Voorhees (2000a) trec qa 2001 Cranfield Search Voorhees (2001) trec qa 2002 Cranfield Search Voorhees (2002a) trec qa 2003 Cranfield Search Voorhees (2003b) trec qa 2004 Cranfield Search Voorhees (2004) trec qa 2005 Cranfield Search Voorhees and Dang (2005) 25In this thesis, we use qa and reading comprehension interchangeably in reference to datasets. 21 Dataset Paradigm Author Area Citation trec qa 2006 Cranfield Search Dang et al. (2006) trec qa 2007 Cranfield Search Dang et al. (2007) qa4mre 2011-2013 Manchester Multiple Pe?as et al. (2013) MCTest Manchester Stories Richardson et al. (2013) WebQuestions Cranfield + Search Berant et al. (2013) CNN/Daily mail Manchester 26 News Hermann et al. (2015) Simple Questions Manchester ? Freebase Bordes et al. (2015) Children?s Book Test Manchester Stories Hill et al. (2016) babi Manchester Stories Weston et al. (2016) squad 1.0 Manchester Wiki Rajpurkar et al. (2016) WikiReading Manchester Wiki Hewlett et al. (2016) ms-marco Cranfield Search Nguyen et al. (2016) MovieQA Manchester Movies27 Tapaswi et al. (2016) race Manchester Exams Lai et al. (2017) triviaqa Manchester Trivia Joshi et al. (2017) SearchQA Manchester Trivia Dunn et al. (2017) Quasar-T Manchester Trivia Dhingra et al. (2017) SciQ Manchester ? Science Welbl et al. (2017) NewsQA Cranfield News Trischler et al. (2017) cwq Manchester ? 28 Wiki Talmor and Berant (2018) NarrativeQA Manchester Stories Ko?isk? et al. (2018) DuoRC Manchester Movies Saha et al. (2018) MultiRC Manchester Multiple Khashabi et al. (2018) HotpotQA Manchester Wiki Yang et al. (2018b) squad 2.0 Manchester Wiki Rajpurkar et al. (2018) QBLink Manchester Trivia Elgohary et al. (2018) WikiHop29 Manchester Wiki Welbl et al. (2018) OpenBookQA Manchester ? Science Mihaylov et al. (2018) qasc Manchester ? Science Khot et al. (2020) drop Manchester ? Wiki Dua et al. (2019) quoref Manchester ? Wiki Dasigi et al. (2019) quac Cranfield Wiki Choi et al. (2018) BoolQ Cranfield Search Clark et al. (2019) eli-5 Cranfield Reddit Fan et al. (2019) arc Manchester Science Clark et al. (2018) Record Manchester ? News Zhang et al. (2018b) ropes Manchester Wiki/Sci Lin et al. (2019) coqa Cranfield Multiple Reddy et al. (2019) CosmosQA Manchester Narrative Huang et al. (2019) Natural Questions Cranfield Search/Wiki Kwiatkowski et al. (2019) Quizbowl Manchester Trivia Rodriguez et al. (2019) Trickme Manchester ? Trivia Wallace et al. (2019b) MCScript Manchester Commonsense Ostermann et al. (2018) QuaRel Manchester ? Stories Tafjord et al. (2019) CommonSenseQA Manchester ? Commonsense Talmor et al. (2019) AmbigQA Cranfield Search Min et al. (2020) 26Although noun phrases to mask are created automatically, the source of questions?summary bulletpoints?and the articles are human created. 27Multi-modal, pairs video and transcripts 28The original questions are automatically created and then paraphrased by humans. 29MedHop is a second dataset in this publication. 22 Dataset Paradigm Author Area Citation Curiosity Cranfield ? Geopolitical Rodriguez et al. (2020) QuAIL Manchester Multiple Rogers et al. (2020a) Table 2.1: The table categorizes datasets by the task?s paradigm, the Author (human , machine , or a mix), and topic domain. To be categorized as human- generated, it is not enough for the dataset to consist of naturally occurring text; it must have questions that are authored by humans instead of being automat- ically generated. ? and ? indicate human-generated questions that are non-trivially checked or modified by a machine system. ? and ? indicate machine-generated questions that are paraphrased by humans. ? and ? indicate question creation incorporates interactive human-machine cooperation. 2.4 Methods for Question Answering This section reviews methods for qa answering, beginning with those from be- fore the statistical machine learning era and proceeding to modern neural methods. 2.4.1 A Brief History of Symbolic Question Answering Models Early qa work focused on building systems for narrower, more well-defined, and thus tractable problem domains. Each of these systems developed means of combining knowledge of syntax (grammar), semantics (the meaning of words), and reasoning (ability to deduce and connect facts). baseball showed the ability to ma- nipulate a knowledge-base of baseball games through natural language (Green et al., 1961) which inspired later work in querying databases with natural language (Copes- take and Jones, 1990). At a larger scale, lunar showed that querying information in natural language?in this case, metadata about Moon mineral samples?was use- ful to Apollo Program scientists (Woods, 1972). Other work like student solved simple word algebra problems by parsing numerical expressions and reasoning about relations between them (Bobrow, 1964). qa was also used to show that shrdlu could manipulate objects in a simulated world (Winograd, 1971). Many other sim- ilarly limited symbolic qa systems were developed, but research eventually shifted to the statistical-based methods with the creation of the question answering track at the Text Retrieval Conference (trec) in 1999. Using the Trecqa track, systems like mulder (Kwok et al., 2001) showed that qa systems significantly reduced user ef- fort when their information-need is satisfied by short answers. Contemporaneously, systems like proverb (Keim et al., 1999) solved crossword problems by combining multiple qa methods with a centralized probabilistic model the decides on the best solution. 23 2.4.2 Modern Methods for Question Answering Modern qa systems use either classical methods, deep learning methods, or combine both. The shared goal of both methods is to represent text numerically in a way that can be used to process questions, retrieve relevant documents, and compute an answer. Both families of methods represent text as numerical vectors, but derive their representations differently. Generally, classical methods encode text with a vector space model (Salton et al., 1975) while deep learning methods use distributed word representations (Mikolov et al., 2013). In both cases, the objective is to represent texts such that those that are semantically similar have high similarity scores while those that are semantically different have low similarity scores. This exploits the intuition that the answer to a question?such as ?who was the first woman to fly to space??is more likely be located in a document that is semantically similar to the question than not. Vector space models compute semantic similarity by considering the overlap between the words mentioned in documents. Intuitively, when terms in a question overlap with terms in a document, it is more likely to contain useful information. The most effective vector space model for qa-motivated ir are Term-Frequency Inverse Document-Frequency tf-idf (Jones, 1972; Rajaraman and Ullman, 2011) methods like bm25 (Robertson and Walker, 1994). The intuition behind these methods is two-fold: (1) the number of occurrences of a term in a document is correlated with its importance to the document, and (2) the prevalence of a term across all documents is inversely correlated with its importance. In ir and qa, methods like tf-idf are particularly effective because the words of interest like named entities are precisely those up-weighted. However, there are two disadvantages to this approach.30 First, although local word order can be encoded by using n-grams instead of unigrams, longer-range word order and thus longer range dependencies cannot be modeled with n-grams.31 The second disadvantage is that even with mitigations like stemming, the similarity of related yet unmentioned terms?such as the semantic relevance of ?astronaut? to the prior question?is zero. These shortcomings motivated the development of distributed word representations. Distributed word representations aim to find representations of words such that words mentioned in similar contexts have similar representations. For exam- ple, ?space? and ?astronaut? should have representations that are mathematically similar since they occur in similar contexts. Concretely, for a corpus comprised of V unique words, each is represented as a row in a d ? V -dimensional matrix Wt. Conceptually, one might imagine that each of the d dimensions encodes semantic aspects of each word like tense, plurality, or ?spaciness? although none of these are defined or guaranteed a priori. Terms are considered more similar if their dot product is more positive and less 30Additionally, out-of-vocabulary issues are also amplified in qa due to the prevalence of named entities. 31The number of unique n-grams O(|V |n) is exponential with respect to a vocabulary V , thus the probability in natural text of any given n-gram exponentially decays with respect to n. 24 similar otherwise. Since the numerical representations of words are not determined a priori, they must be trained. For example, Word2Vec (Mikolov et al., 2013) or glove (Pennington et al., 2014) are algorithms that create representations such that words that appear in similar contexts have similar representations. Word similarity can be extended to sentence similarity by aggregating word representations through operations like summation, averaging, or maxout. There still remains at least two drawbacks to distributed word representations: (1) during inference, word order and more generally context still is not factored in (e.g., ?apple? the fruit versus the company is context-dependent), and (2) after training, these representations are not further updated (fine-tuned) based on the target downstream task. Order-aware neural networks address both these issues. These models begin by representing the tokens w = [w(1), . . . , w(n)] with vector embeddings [v(1), . . . ,v(n)] and then build context-aware representations [h(1), . . . ,h(n)] that the downstream task uses. Section 2.4.3 describes several order-aware models for building these representations. Pre-trained word embeddings are also a powerful mechanism for transfer learning. Similarly, initializing order-aware models with parameters from pre-trained neural language models (McCann et al., 2017; Peters et al., 2017, 2018; Devlin et al., 2018) is an effective way to improve downstream tasks. Lastly, since these models are fully differentiable, as long as the end task (or a suitable proxy) is expressable by a loss function, the full model can be fine-tuned to the end task. Although these models have some downsides like difficulty to train and computa- tional cost, the improvements in metrics like accuracy and recall are well worth the tradeoff. Modern qa systems combine some or all of these textual representations to find documents relevant to the question and to answer it. The most common, general architecture for qa breaks the problem into two components: (1) an optional document retrieval system to find passages relevant to the question, and (2) a system that?possibly using the retrieved documents?derives a final answer. Answers can take multiple forms like free-response, choosing from a closed set, multiple-choice, and span selection. These answers are used along with an evaluation measure to give the system feedback during training. This architecture is advantageous for three reasons. First, it logically separates two related but distinct problems: collecting documents likely to contain relevant information to the question and reasoning about the contents to arrive at an an- swer. This architectural inductive bias mimics one human approach to information- seeking: we rapidly collect seemingly relevant documents (e.g., via Google search or by asking a librarian) then read those documents more carefully to find the answer. Thus, these sub-problems can be studied in isolation first and then composed and improved together later. Second, this approach makes the qa tractable even when there are millions or billions of documents. For both humans and machines, read- ing a document with a specific question in mind is computationally expensive; doing this process for every possible document is intractable and computationally wasteful since only a small fraction of documents are relevant for any given question. By sep- arating these problems, the document retrieval step can be designed to be scalable and efficient through classical approaches like inverted indices (Zobel and Moffat, 25 Input Text Captained by George Nares, the HMS Challenger lay the foundation of oceanography. Tokenization & Embedding Captained by George Nares , the HMS Challenger ... Deep Averaging Network Bidirectional LSTM Network Transformer Network Transformer Layer LSTM LSTM LSTM LSTM LSTM LSTM Cell Cell Cell Cell Cell Cell Average Transformer Layer ... Feedforward layers Transformer Layer Output: Output: ; Output: Figure 2.8: Deep averaging networks, recurrent neural networks (e.g., lstms and grus), and transformer networks (e.g., bert) are common architectures in neural nlp models. To produce text representations of a fixed size (i.e., not a sequence of vectors), each architecture takes a different approach. Deep averaging networks average word vectors and then pass the subsequent vector through one or more feedforward layers; recurrent models often concatenate the last hidden state in each direction; transformer networks usually have a special token associated with the input?s general representation (in bert, the cls token). 2006) or efficient nearest neighbor search over numerical vector representations (Lee et al., 2019). Third, this format fits a variety of tasks without additional work; qb uses only step two,32 squad uses both steps, but the document is provided, and Open Domain qa (Chen et al., 2017) uses both. Combined, these benefits have made this two-step approach common for modern qa (Chen and Yih, 2020). Next we describe specific neural methods for neural text representation (Sec- tion 2.4.3), classical and differentiable document retrievers (Section 2.4.4), and neu- ral answering models (Section 2.4.6). 2.4.3 Neural Text Encoders This section reviews several common neural text encoding methods used in ir and qa models. These encoders create question and document representations that models then combine to derive an answer. At a high level, these methods aim to represent text (questions, documents, or answers) so that semantically similar concepts have similar encodings. Given text t with tokens [w(1), . . . , w(n)], a text encoder should return a se- quence of representations [h(1), . . . ,h(n)]?one for each token.33 For tasks where 32Finding relevant documents could aid prediction, but we do not provide human-approved annotations. 33We use the term token embedding since it can combine the word and character embeddings 26 Transformer layers predictions are not per-token (e.g., predicting the start of a span), encoders should also define an aggregated representation h(?). Despite compressing the meaning of a text to a fixed-length vector, the aggregated representation still captures substan- tial semantic information (Conneau et al., 2018). We assume that the per-token representations, aggregate representation, or both are ultimately incorporated into the model?s loss function which is then optimized via backpropagation and stochas- tic gradient descent (Rumelhart et al., 1986; Goodfellow et al., 2016). Figure 2.8 shows several common text encoders in nlp: deep averaging networks, recurrent architectures, and transformer architectures (?5.5.3).34 Word embeddings are the simplest approach where each word in a vocabulary of size V is represented by a d-dimensional vector vi of trainable parameters. In ad- dition to representing distinct terms, token embeddings can be replaced by character embeddings (Kim et al., 2016a) or used in addition to them (Joulin et al., 2017). While the parameters can be trained from scratch on the target task training data, it is also common to pre-initialize them with weights from Word2Vec or glove and then update these representations; this is particularly effective when there is scarce training data for the target task. Like ?frozen? word embeddings, word sequences can be aggregated to h(?) through operations like average pooling ?n h(?) 1 = v(i) (2.1) n i=1 and max pooling (?) 1 nh = maxv(i) (2.2) n i=1 the word vectors. Alternatively, the word vectors v(1), . . . ,v(n) can be passed through one or more feedforward layers n 1 ? h = v(i)0 (2.3) n i=1 hn = f(Wn?1 ? hn?1 + bn?1) (2.4) with parameters Wn, bn and a non-linearity f as in a deep averaging network (Iyyer et al., 2015, dan). Although word embeddings alone and dans are unordered syn- tactic representations, they are still effective for some qa and sentiment analysis tasks where word order is less important and attractive in terms of speed and com- putational budget. Next, we consider other composition functions. Throughout this thesis, we adopt the notation that given a sequence of tokens w = [w1, . . . , wn] and a compo- sition function encoder, encoder(i)(w) (2.5) for a specific token. 34We do not cover convolutional architectures since they are less common for nlp. 27 denotes the encoder?s output representation of word w(i) and encoder(w) (2.6) denotes the aggregated representation?as defined by the model. For example, dan(w) = hn is the aggregated dan representation and lstm(i)(w) is an lstm?s contextualized representation of the ith input token (?5.5.3). In bert-based mod- els, we amend this slightly and will refer to the representation of the cls token as bertcls(w). The most common order-aware neural composition functions are simple recur- rent networks (Rumelhart et al., 1986), long short-term memory networks (Hochre- iter and Schmidhuber, 1997, lstm), gated recurrent networks (Cho et al., 2014b, gru), and transformer networks (Vaswani et al., 2017). Recurrent representations (rnns, lstms, and grus) represent sequences (in the first layer) sequentially: each word?s representation is conditioned on all previous words. For tasks where repre- sentation in both directions is important?such as text span selection?the opposite direction can be modeled by concatenating the original representation with the rep- resentation of the reversed sequence (bidirectional) (Schuster and Paliwal, 1997). Additionally, these layers can be stacked on top of each other; empirically, lay- ers deeper in the network have more abstract representations (Peters et al., 2018). Transformer networks?especially large-pretrained language models like bert (De- vlin et al., 2018)?are another powerful composition function. Unlike recurrent net- works, these transformers are comprised of alternating attention and feedforward layers, and as a consequence the representation of every word is (symmetrically) conditioned on the representation of every word in the prior layer. While these models can be challenging to train due to training dynamics, computation speed, and memory consumption, they are the foundations of most state-of-the-art mod- els in qa. For a survey of bert-related work see Rogers et al. (2020b). Next, we describe how these text encoders are used to construct document retrievers and question answering modules. 2.4.4 Document Retrieval The first component of a two-stage qa system is the document retriever. The responsibility of the document retrieval module is to return?from a corpus of doc- uments D = {d1, . . . , dn}?a small list of nk documents Dk or even only one that is highly relevant to a question q. In qa systems, document retrievers play the role of librarians. It would be infeasible for a librarian to carefully read every book in a library every time a patron asked a question; instead, librarians categorize and index libraries so that they can quickly provide general suggestions. The objective of these systems is to design a document encoding Ed(di), a question encoding Eq(q), and scoring function f such that a scalar score si = f(eq(q),ed(di)) (2.7) is correlated with labeled relevance judgments. For either a human or machine li- brarian, the primary constraint in retrieving documents is the computational cost 28 incurred for resolving a query q as the number of documents nd in the corpus grows. In practice, document collections can include hundreds of thousands of news arti- cles (Voorhees, 2000b), millions of Wikipedia pages, or billions of webpages (Crawl). Thus, commonly used document retrievers first pre-index D and then at inference time use that index to efficiently return a list of the nk documents with highest scores si. Next, we describe the classical and neural approaches used to define the encoders and scoring functions. The predominant method for classical document retreival indexes documents as bag-of-word (Jones, 1972) using a tf-idf (Rajaraman and Ullman, 2011) encoder Etf?idf , stores them in an inverted index (Zobel and Moffat, 2006), and uses a bm25 scoring function fbm25. For qa, the terms valued by tf-idf are usually correlated with judgements. In Chapters 5 and 6, we use bm25 ? (i) IDF (q(i) g(q ,D) ? (k1 + 1) ) (2.8) g(q(i), D) + k ? (1? b+ b ? fieldLen1 )i avgF ieldLen (i) (i) docCount? f(q ) + .5IDF (q ) = ln(1 + ) (2.9) g(q(i)) + .5 scoring for document retrieval. The function g(qi, D) represents how many times the token q(i) occurs in document D and g(q(i)) is the number of documents that contain token q(i).35 Apache Lucene (Foundation) and ElasticSearch (Gormley and Tong, 2015) are two industry-standard, user-friendly, and scalable implementations of this document retriever. DrQA (Chen et al., 2017) show this approach is effective for Open-domain qa and we show in Chapter 5 that it is a strong baseline for qb. 2.4.5 Neural Document Retrieval Neural networks have also proven to be effective document retrievers, and their design has evolved from augmenting the output of a classical system to replacing it entirely (Mitra and Craswell, 2018). Initial neural systems do not replace the retrieved document setDk, but instead re-rank its results (Wang et al., 2018). If the underlying system has high recall at nk, but low precision in the first few documents then this method is effective. Neural networks can also adaptively determine an appropriate number of documents to retrieve to reduce noise in the results (Kratzwald and Feuerriegel, 2018). Other work uses neural networks to reformulate queries to a classical ir retriever (Buck et al., 2018) or even iteratively retrieve documents (Das et al., 2018). As effective as they are, these systems are restricted by the accuracy and recall of the underlying, black box ir system. Current state-of-the-art in document retrieval replaces classical systems with an end-to-end neural approach. Like classical systems, neural retrievers define a similarity score si = fn(Ed(di), Eq(q)) that factorizes the representation of the doc- uments and questions. This factorization allows all the document encodings to be 35b and k1 are hyper parameters. 29 pre-computed once training is complete. The primary challenges are defining the scoring function in a way that is computationally scalable in nd, defining the form of the encoders, and defining the loss function that encoder parameters are trained with. Most document retrievers encode questions and documents with a text en- coders from Section 2.4.3. For example, orqa (Lee et al., 2019), realm (Guu et al., 2020), dpr (Karpukhin et al., 2020), rag (Lewis et al., 2020b), and Colbert (Khat- tab and Zaharia, 2020) use the bert architecture for both Eq and Ed and later fine-tune it with a retrieval-specific loss function. Similarly, all of these models use maximum inner product search (Ram and Gray, 2012, mips)?typically just the dot product?as the similarity function. Using dpr as an example, the question encod- ing bert(?what was the first Mars rover??) is compared using the dot product to every Wikipedia passage encoding bert(d)?including one mentioning the answer ?Sojourner?; The highest scoring passage as returned. Exhaustive calculation of dot product similarity is efficient even across all documents D (Abuzaid et al., 2019) and can be made more efficient through approximate nearest neighbor search (Johnson et al., 2019). Next, we describe how the retriever?s parameters are trained. Although models vary in exact training procedure, the general approach is to define a metric learning loss based on the score and relevance label of the doc- ument (Kulis, 2012). For example, the loss may be binary cross-entropy using a binary relevance label and a (probability) score ranging from zero to one. In par- ticular, this framing imposes the inductive bias that the document and question encodings should have a high score fn(Ed(di), Eq(q)) when the document is labeled as relevant. The labels for document relevance may be defined in the task itself (e.g., squad or nq) or automatically generated like orqa?s Inverse Cloze Task (Lee et al., 2019). While this general approach is not the only way to train these encoders, it is one effective and general-purpose approach. The next component of the two-stage qa system is the model for deriving answers from the question and/or document. 2.4.6 Neural Answering Module The role of the answering module is to?given the question and (optionally) relevant documents?return the most likely answer. The form of the answers de- pends on the task; for qb it is a Wikipedia entity, for squad and nq a text span, and for open qa free text. Although the documents for the answer module may already have an encoding from the classical or neural retriever, most answer mod- ules re-encode the question, document, or both. This is desirable since the retriever representations of the question and document are independent of one another and therefore not particularly expressive. Here we describe neural architectures and loss functions for (1) models that take only the question as input and (2) that take the question and a document as input. The simplest architecture for qb is classification over the answer classes (dis- tinct Wikipedia titles). As in the retriever, the question can be represented by any of the text encoders in Section 2.4.3. Following this, the representation can be passed to any number of hidden layers before being projected to the classification 30 dimension. This output layer has logits corresponding to each answer label, and the model is trained with cross-entropy loss. For further details on this type of model, we refer the reader to Section 5.5. For tasks like squad and nq (or qb with supporting evidence like triviaqa), the prediction is the shortest text span in the document most likely to contain the answer. Thus, one goal of re-encodings in these tasks is to jointly encode the question and document to identify tokens that are most relevant to the answer. For example, bidaf (Seo et al., 2017) adds question-to-document and document- to-question attention mechanisms; interpreting the document with the question ?in mind? has shown to be important in qa (Weissenborn et al., 2017). More recent architectures like transformers jointly model questions and documents through all layers. Other models may jointly model this and additional information such as graph connections (Zhao et al., 2020a,b). In all of these models, the sequence of outputs is used to classify whether each token is the start or end of the correct answer span. Throughout this thesis we use classical and neural ir and qa as parts of dialog systems (Chapter 3), systems that play qb (Chapter 5), and a human-in-the-loop question writing interface (Chapter 6). Before moving to this work, we briefly review prior work in qb that laid the foundation for Chapters 5 and 6. 2.5 Prior and Related Work in Quizbowl Although qb has been played for decades, the first online interface was only created in 2012 (Figure 2.9a). After a day online, over 7000 games were played, and by the end of the two-week data collection period, 43000 questions were played by 461 users (Boyd-Graber et al., 2012). After the initial interface was taken of- fline, enterprising members of the qb community reacted to the sudden depriva- tion by resurrecting the interface. Over the years, its successor Protobowl (http: //protobowl.com) collected orders of magnitude more data and built a competitive experience by adding multiplayer support (He et al., 2016b). On both platforms, users play questions from prior tournaments: words in the question are revealed one-by-one until the player attempts to answer the question. Every time a question is played, the word that the player buzzed on, their answer, and whether their answer was correct are recorded. In Chapter 5, we discuss how updated this dataset to include 3.9 million records from over ten thousand users. The qb-related work in this thesis uses ?tossup? questions, but there are other types such as bonus questions as in qbLink. Each bonus question contains multi- ple parts, and the player answers each one-by-one. Bonus questions reward multi- step reasoning by making later questions dependent on correctly answering previous questions. qbLink (Elgohary et al., 2018) builds a dataset of these bonus questions. While qb and its educational focus are a distinctly positive example of intellectual competition, the measurement of human intelligence more generally has not always a net benefit to society. 31 Answering questions as: Jordan Question from ACF Nationals 2007 (Berkeley) Correct! Don't show questions from this packet again People who answered before you did: Kevin Don't show questions from this tournament again Parag Irene People who answered after you did: Text Reveal Speed Cecilia Jim One is Monte Carlo if at least half of the possible results for all x in a language it says "yes" and "no" otherwise. One is called unambiguous if for any x there is at most one Incorrect Answers: accepting computation. One is called oblivious if the position of the cursor at the t-th step Turntable depends only on the t and the length of the input. One is non-deterministic if its sets of Toaster next states may contain more than one element. For ten points, identify this model of Computer computation named for Poland Algorithm Answer (or press space) (a) The 2012 interface was the first way to play qb online (Boyd-Graber et al., 2012). (b) The qb interface from He et al. (2016b) for collecting gameplay. Beyond using modern web frameworks, it also allows for real-time competitive play and chat with other players. Figure 2.9: The original qb interface and a popular modern interface for playing qb online. Both interfaces reveal questions word-by-word until a player interrupts the system, and makes a guess. 32 2.6 The Harrowing Past of Measuring Human Intelligence This thesis takes inspiration for measuring machine intelligence from how hu- man intelligence has been measured in psychometrics,36 a field concerned with the measurement of intelligence, skill, and abilities in humans (Guilford, 1954). There- fore, it feels necessary to acknowledge and raise awareness of how measurement of human intelligence was used to justify eugenic and racist policies that contributed to horrific human rights violations and discuss how this history is relevant to the present day. At least two pioneers of psychometrics, Francis Galton and J. McKeen Cattell (Jones and Thissen, 2006), were eugenicists that held repugnant views advo- cating that society ?welcome and support the eugenic movement tending to limit the birth of feeble-minded and defective children? (Cattell, 1915). In the United States, the legal implementation of forced sterilization of the ?feeble minded? was endorsed by the U.S. Supreme Court in Buck v. Bell (Court, 1927; Gross, 2016). Similar justifications were employed by 30 states in legislation that led to the sterilization 60,000 U.S. citizens across 30 states (Norrgard, 2008) including Puerto Rico where ?about one third of women of childbearing age, a figure that remained constant through the 1980s,? were sterilized (Briggs, 2003). Eventually, ?America?s eugenic movement spread to Germany as well, where it caught the fascination of Adolf Hitler and the Nazi movement,? directly contributing to the Holocaust (Black, 2012). Faced with the need and opportunity to ?scientifically? support unscientific, eugenic views like scientific racism (Gould, 1981), ?eugenicists labored to devise objective methods of measuring and quantifying valued traits, including intelli- gence? (Reddy, 2007). The Alpha and Beta group intelligence tests of WW1 are the first example of large-scale intelligence tests that directly influenced access to opportunities (for military recruits in this case). These tests ?became models for future group tests? (Spring, 1972) despite the fact that a developer of these tests, the eugenicist Henry H. Goddard, later admitted ?we do not know what intelligence is? (Goddard, 1920). The same intelligence tests were later used to support the eu- genic Immigration Act of 1924 (Gelb et al., 1986; Jacobson, 1999, p. 83?85).37 While the Alpha and Beta intelligence tests?created and promoted by eugenicist Carl C. Brigham?are lesser known to the common person, their immediate intellectual de- scendant is known to nearly every college student as the Scholastic Aptitude Test or sat (Lemann, 2000; Black, 2012).38 Despite this inter-connected history, some still (falsely) claim that ?intelligence tests are not culturally biased against American blacks or other native-born, English-speaking peoples? (Arvey et al., 1994; Gold- stein, 2012). As historian and scholar Ibram X. Kendi writes, Standardized tests became the newest ?objective? method of proving Black intellectual inferiority and justifying discrimination, and a multimillion- 36For example, Item Response Theory (4) was originally developed in psychometrics and is used in Chapter 4. 37The Immigration Act severely restricted immigration to the United States ?to preserve the ideal of U.S. homogeneity? (U.S. Department of State Office of the Historian, 2021). 38While Brigham later recanted his view that the sat could measure intelligence, substantial irreparable damage had already been done by fueling eugenic discourse. 33 dollar testing industry quickly developed in schools and workplaces (Kendi, 2016, p. 310). Since then, Kendi has called standardized tests ?the most effective racist weapons ever devised to objectively degrade Black and Brown minds? (Kendi, 2020) and some in psychology have argued that ?the validity of iq tests is questionable? (Weiten, 2016, p. 281). In a world where numbers?such as the sat (and other intelligence tests)? and algorithms play crucial roles in structuring society (O?Neil, 2016, p. 134), it is crucial to consider the role that technology plays in society. The evolution of psychometrics and the standardized testing industry is one example of a ?New Jim Code: the employment of new technologies that reflect and reproduce existing inequities but that are promoted and perceived as more objective or progressive than the discriminatory system of a previous era? (Benjamin, 2019, p. 5).39 This and numerous other examples emphasize that, No object or algorithm is ever either good or evil itself. It?s how they?re used that matters. . . Forming an opinion on an algorithm means un- derstanding the relationship between human and machine. . . It?s about asking if an algorithm is having a net benefit on society (Fry, 2018, p. 3). In the context of testing machine intelligence, the goals of this thesis are to use this technology to better serve humans seeking to expand their knowledge (e.g., Chapter 6) and improve how we compare the effectiveness of machines on qa tasks (e.g., Chapters 4). As we highlight in Chapter 7, we should?as a field?critically think about the ways that incentive structures in benchmarks influence the type of algorithms and technology rewarded with recognition and use these structures as a way to reward algorithms that are a net benefit to society. Finally, as we discuss multiple times in this thesis, comparisons between humans and machines on qa tasks should not be used to make claims of the intellectual superiority of one over the other. In the next chapter, we revisit the Cranfield paradigm and introduce a new dataset for conversational information-seeking. Following this, Chapter 5 describes the traditions of qb as a trivia game, frames it as a machine learning task, builds models to play the game, and evaluates them with offline metrics and live exhibition matches. Afterward, Chapter 6 introduces a methodology for centaur authoring of adversarial questions. We conclude by outlining directions for future work. 39Benjamin?s coining of the term a ?New Jim Code draws on The New Jim Crow, Michelle Alexander?s (2012) book that makes the case for how the US carceral system. . . permits legalized discrimination? (Benjamin, 2019, p. 8). 34 Chapter 3: Information Seeking in the Spirit of Learning: A Dataset for Conversational Curiosity So great is our innate love of learning and of knowledge that no one can doubt that man?s nature is strongly attracted to these things even without the lure of any profit. Marcus Tullius Cicero, 45BCE Conversations are a natural form for humans to seek information, and there are decades of study on formal dialogues and interactions of users with reference librarians. The natural next step is to design automated systems that are ?virtual librarians?, eliciting information needs, correcting misconceptions, and providing the right amount of information at the right time across all possible domains. Research Frontiers in Information Retrieval: Report from the Third Strategic Workshop on Information Retrieval (Culpepper et al., 2018) With working definitions of the Cranfield and Winograd paradigms in hand, we introduce a dataset for conversational information-seeking that whose evaluation follows the Cranfield paradigm. This work contributes a large resource of 14K di- alogs to the nascent area of conversational information-seeking (Dalton et al., 2020), investigates various means to evaluate the effectiveness of dialog strategies, and pro- poses a baseline model for the dataset?s dialog task.1 At the heart of the task is a dialog between a user and an assistant?role-played by human annotators?where the user is encouraged to engage in curiosity-driven inquiry about a geopolitical entity. We design dialog interfaces so that while they converse, we record infor- mation that characterizes the assistant?s dialog strategies and measure the user?s engagement both directly and indirectly. With this data, we validate that answers to user questions that build on pre-existing knowledge tend to increase engagement as measured by the number of times that the user asks a followup question. In the next chapter, we contrast this dataset?s goal of increasing user engagement with Quizbowl?s goal of testing knowledge. 1This chapter is based on the emnlp publication Rodriguez et al. (2020). 35 U: , tell me about Tahiti. A: It?s the largest island in French Polynesia, near the center of the Pacific U: What is its history with France? Figure 3.1: An example of information-seeking dialog that the Curiosity dataset aims to support. Assistants should answer user questions and convey information that inspires meaningful followup questions. 3.1 Motivation for Conversational Information-Seeking Humans are naturally epistemically curious (Berlyne, 1954) and conversational agents?such as Alexa, Siri, and Google Assistant?should encourage this by helping users discover, learn, and retain novel factual information. More generally, systems for conversational information-seeking should help users develop their information need, be mixed-initiative, incorporate user memory, and reason about the utility of retrieved information as a combined set (Radlinski and Craswell, 2017). We focus on a curiosity-driven, information-seeking scenario where a user starts a conversation with an assistant by asking an open-ended question and then drills down into interest areas (Figure 3.1). In this setting, what policies should assistants pursue to maintain the user?s in- terest in the topic? Theories of human learning, such as Vygotsky?s zone of proximal development, posit that learning novel information should be rooted in pre-existing knowledge and skills of the learner (Chaiklin, 2003). Considering this, a good policy may give general information about Tahiti; a better policy would select information related to the user?s knowledge (e.g., familiarity with France); along similar lines, Jennings (2006, p. 81) articulates that ?if you already know a few little bits. . . the new information has something to cling to, barnacle-like, in your synapses.? We hypothesize that engagement and by proxy satisfaction is correlated with policies that integrate a user?s pre-existing knowledge, and we test this through a large-scale, Wizard-of-Oz (woz) style collection (Kelley, 1984; Wen et al., 2017) that captures assistant policies, user reactions, and topically relevant entities that the user knows about. This dataset?the Curiosity dataset?has 14,048 English dialogs annotated with sentence-level knowledge grounding, the user?s prior knowledge, dialog acts per message, and binary ratings per message.2 In our dialog task (Figure 3.2), one worker takes the role of a curious user learning about a geographic entity and the other of a digital assistant with access to Wikipedia facts (?3.2). At the start of each dialog, the user is assigned an entity as their topic (e.g., Puerto Rico) along with two aspects (e.g., history and demo- graphics) to investigate. Beforehand, we show the user a list of entities related to the topic, and they mark which they know; these entities are a sample of their pre- existing knowledge. The user engages in open-ended discovery while the assistant simultaneously answers the user?s questions and proactively introducing facts likely 2Dataset and code at curiosity.pedro.ai. 36 Topic: Puerto Rico Aspects: History, Demographics * User knows about ** Assistant sees Spaniards General Facts San Juan * Aspect Facts United States ** Rooted Facts (1) Quiz Completion (2) Relevant Facts from Wiki User Could you tell me about Puerto Rico?s history? request_aspect Assistant It was a Spanish colony until 1898 when the U.S. acquired it (4) Dialog Acts as part of the Treaty of Paris. & Like Labels Annotation liked inform_response (3) Human-Human Role-playing Dialog Creation Figure 3.2: We sample pre-existing knowledge by asking users to indicate which topically related entities they already know. The assistant paraphrases facts related to either known entities (rooted facts), an aspect (aspect facts), or the topic generally (general facts). The user expresses engagement through a like button. Dialog acts are annotated in a separate crowd-source task. to prompt followup questions. For example, if the assistant knew of a user?s famil- iarity with astronomy when providing information about Puerto Rico, then the user is more likely to engage with and remember facts about the Arecibo Observatory. Although our dialog task methodology differs from typical Cranfield evalua- tions (?2.1.1), we share the same end goal to satisfy users. Section 3.3 uses dialog act annotations combined with explicit and implicit user feedback to compare as- sistants? content selection and presentation policies. For example, in interactions where the user asks a question and the assistant paraphrases a fact, how often does the user ask a followup question versus trail off in disinterest? Most datasets (Sec- tion 3.6) do not have enough annotations to answer these questions since it requires message-level dialog act annotations and feedback signals. We compare three assis- tant policies: using a fact with a rooted entity, a fact from the user?s aspect, and a generic fact about the topic. The policies are compared through user ?likes? of assistant messages and by the dialog act of their subsequent message (e.g., did they ask a specific followup or change topic). While the focus of this work is in creating the Curiosity dataset and a method- ology for measuring user interest, we also introduce models that attempt to mimic assistant policies. These models predict what type of message to send and which 37 fact to use (?3.4), but do not generate dialog text. Since there are multiple facets to the assistant?s actions, we jointly train models with a multi-task objective function. We compare an end-to-end bert (Devlin et al., 2018) model to our task-specific Hi- erarchical Recurrent Encoder model (Serban et al., 2015) and show that our model (charm) improves over the baseline. In summary, this chapter makes three main contributions: (1) we design an experiment to test the efficacy of personalizing conversational information systems through a user?s prior knowledge, (2) introduce the Curiosity dataset?the first dia- log dataset combining sentence-level knowledge groundings, per message ratings, and per message dialog act annotations, allowing for robust and fine-grained structural learning of dialog policies for similar applications, and (3) design a multi-task model that incorporates the user?s prior knowledge and improves over a bert baseline. 3.2 Building the Curiosity Dataset This section describes the construction of the Curiosity dataset. Dialog topics consist of prominent world geographic entities. The worldwide spread of entities makes each novel to most users, the consistent topic type makes starting dialogs easier, and their rich histories, demographics, and economics add topical diversity. For example, most people are only vaguely familiar with the history of Puerto Rico, but most know about related concepts such as the United States or Hurricane Maria. Section 3.2.1 describes how we select geographic topics, aspects, and derive a set of facts to ground against. We collected the dataset in two steps: (1) collecting dialogs with a custom interface (Section 3.2.2) and (2) after-the-fact dialog act annotation (Section 3.2.3). Sample dialogs from the Curiosity dataset are in Appendix A.3. 3.2.1 Choosing the Geographic Topics, Aspects, and Facts for the Dataset We select 361 geographic pages from Wikipedia that have separate geography and history pages (e.g., Puerto Rico, Geography of Puerto Rico, and History of Puerto Rico).3 We use sentences from each page to build a set of 93,845 facts. We run an entity linker over the content (Gupta et al., 2017) and index each fact by its source page (topic), source section (aspect), and mentioned entities. Finally, we fit a tf-idf text matcher (Rajaraman and Ullman, 2011) with Scikit-Learn (Pedregosa et al., 2011). While conversing, assistants are shown facts filtered by topic, aspect, or mentioned entities, that are ranked by textual similarity to the dialog. 3.2.2 User and Assistant Dialog Interfaces To collect dialogs, we build user and assistant interfaces for annotators. The user?s interface samples their prior knowledge of a topic, captures which assistant 3The existence of a page implies that the topic has ample historical and geographical knowledge to draw from. 38 Your goal is to learn about Lesotho Especially about its "Culture" and "History". Completing this Quiz is VERY important! It helps the assistant answer your questions Check boxes if 1. Geography: if you could locate it on a map 2. Concept: if you could accurate explain what it is When done, tell the assistant what you want to learn about Related Entities Entity Do you know Pretoria Sotho people United States Temple Mount Mohale's Hoek Distrct South Africa Orange Free State Basutoland Book of Common Prayer Africa United Kingdom Asia-Pacific Economic Cooperation Done or I do not know any of these Figure 3.3: In this example, the user is assigned to learn about Lesotho, specifically its culture and history. Given their topic, we users indicate which (related) entities? related to Lesotho?are familiar. Related entities range from relatively common like the United States to lesser known like Basutoland. We also provide guidelines and videos before crowd-workers start working on the task. messages interest them, and manages the dialog context. The assistant?s interface provides contextually relevant facts. Appendix A.1 has screenshots and details of each interface component. Sampling User?s Prior Knowledge When deployed, digital assistants can draw from prior interactions (Raman et al., 2014) to estimate what a user knows. How- ever, since we do not have these prior interactions, we collect information about what users know. Instead of exhaustively asking about every entity related to the topic, we sample this knowledge. Before the dialog begins, we show the user fif- teen related entities that range from commonplace to obscure (United States versus Ta?no). Users mark the entities they could (1) locate on a map or (2) summarize succinctly in one sentence (Figure 3.3). 39 Figure 3.4: The user expresses the ?interestingness? of the assistant?s messages through a ?like? button (right of message). This is one of the two ways that we estimate user satisfaction. Like Button for User Interest As part of our collection, we aimed to determine what fact-grounded utterances users found interesting. Users were told to ?like? the assistant?s message if they found it ?interesting, informative, and relevant to their topic? (Figure 3.4). These likes are an explicit means to estimate user satisfaction. Assistant?s Topic Summary and Fact Bank The worldwide spread of Curios- ity?s entities makes them unfamiliar to most crowd-workers, including the assistants. So that the assistant can still engage the user, the assistant interface provides con- textually relevant information. First, the interface shows a topic summary from Wikipedia. Second, the assistant paraphrases facts from a contextually updated fact bank (box 2 in Figure 3.2 and Figure 3.5). To reduce information overload, we use simplified topic descriptions from SimpleWikipedia and show a maximum of nine facts at a time.4 We encourage assistants to ?stimulate user interest and relate information to things they already know or have expressed interest in.? Assistants are instructed to select relevant facts, click the ?use? button, and paraphrase the content into their next utterance. Like Dinan et al. (2019b), the fact bank selects facts to the assistant using tf- idf textual similarity (?2.4.4) to recent dialog turns but differs by incorporating the user?s prior knowledge. We show the assistant nine facts: three facts that mention an entity familiar to the user (rooted facts), three facts from their assigned aspects (aspect facts), and three from anywhere on the page (general facts). By construction, rooted facts overlap with the exclusive categories of aspect and general facts. For each category, we find the nine highest tf-idf scoring facts and then randomize their order. To avoid biasing the assistant, we do not inform them about the user?s known entities or distinguish between types of facts. 3.2.3 Dialog Act Annotation Inducing structure on conversations through dialog acts is helpful for analysis and downstream models (Tanaka et al., 2019). We introduce structure?beyond 4If a description exists in simple.wikipedia.org, we use that; otherwise, we use the description from en.wikipedia.org. 40 Figure 3.5: The assistant can incorporate any number of facts into their reply to the user. Their goal is to answer the user?s immediate questions, and anticipate what information they would be most interested in. knowledge groundings?into Curiosity by annotating the dialog acts of each mes- sage. The dialog act annotation schema is based on iso 24617-2 (Bunt et al., 2010, 2012) with customized sub-categories for our scenario; Table 3.1 shows our schema, descriptions, and examples. The dialog acts are annotated in a separate collection using a custom annotation interface (Appendix A.2). In our case, knowledge ground- ings paired with dialog acts are helpful for identifying assistant policies (?3.3.1.2). 3.2.4 Validating Data Quality We crowd-sourced conversations in two phases using parlai (Miller et al., 2017). In the first, pilot studies collect feedback from individual workers. Based on feed- back, we create task guidelines, sample dialogs, a faq, tutorial videos, and qual- ification tests. These materials were used to train and qualify crowd-workers for the second phase. During the second, we monitor the interface usage and removed workers that ignored instructions. To validate the quality of dialog act annotations, we use Krippendorff?s ? (Krip- pendorff, 2004). Dialog acts are multi-class and multi-label: a message can have none, one, or multiple dialog acts (e.g., positive feedback and followup). However, 41 Dialog Act Count Description Example request topic 10, 789 A request primarily I?d like to know about Puerto Rico. about the topic. request aspect 41, 701 A request primarily Could you tell me about its history? about an aspect. request followup 4, 463 A request about men- Do you know more about the Ta?nos? tioned concept. request other 10, 077 Requests on unmen- What is there to know about cuisine? tioned concepts. inform response 59, 269 Directly answer an info Ta?nos were caribbean indigenous. request. inform related 6, 981 Not a direct answer, but I do not know, but. . . related info. inform unrelated 557 Does not answer ques- Politics is tiring! tion, not related. feedback positive 26, 946 Provide positive feed- Thats quite interesting! back feedback negative 176 Provide negative feed- Thats pretty boring. back feedback ask 36 Ask for feedback Do you find < info > interesting? offer topic 91 Offer to discuss topic Want to learn about Puerto Rico? offer aspect 1, 440 Offer to discuss aspect How about more on its demographics? offer followup 63 Offer to discuss men- I could say more about the Spanish. tioned concept. offer other 1, 619 Offer to discuss unmen- How about I tell you about its ex- tioned concept. ports. offer accept 1, 727 Accept offer of informa- I?d love to learn about its history. tion. offer decline 405 Decline offer of informa- Sorry, I?m not interested in that. tion Table 3.1: Counts, descriptions and examples of the dataset?s dialog acts. Krippendorff?s ? is typically computed for single-label tasks from a table where rows represent examples, columns represent annotators, and cells indicate the sin- gular class label. We convert our multi-label problem to a single label problem by making each combination of example and label class a row in the table (Table 3.2). Since there are few dialog acts per utterance, most annotations agree; however, since Krippendorff?s ? focuses on disagreement, it is appropriate for this scenario. Using a separate interface (Appendix A.2), we doubly annotate 4,408 dialogs and the agree- ment score 0.834 is higher than the 0.8 threshold recommended by Krippendorff (2004). Next, we analyze the annotated dialogs and introduce our model. 3.3 Dataset Analysis Next, we show that users prefer aspect-specific, rooted facts and then show general statistics of the Curiosity dataset. 42 Annotator 1 Annotator 2 Utterance 1, Label A Yes No Utterance 1, Label B Yes No Utterance 2, Label A Yes Yes Utterance 2, Label B Yes Yes Table 3.2: Consider a task where each utterance has labels A and B. In the single- label version, each utterance is labeled as either A or B. The table shows the out- come of converting the multi-label version to single-label by creating a row for each example?label combination. Cell values are binary indicators. 3.3.1 What Facts do User Prefer? Earlier, we hypothesized that when assistants use facts that mention previ- ously known entities (rooted facts), that users will be more likely to engage (?3.1). In our data collection, we incorporate two mechanisms to test this hypothesis. The first mechanism is explicit: we directly ask users?through a like button?to in- dicate what messages they preferred. The second mechanism is implicit and de- rived by mining dialogs for specific sequences of dialog acts that suggest engage- ment with the content. For each of these mechanisms, we compute the likelihood P (Prefer |Fact Source) of a user preferring utterances grounded to each fact source (Rooted, Aspect, or General). Figure 3.6 shows this likelihood and indicates that users prefer: (1) facts relevant to aspects versus general ones and (2) rooted facts in three of four scenarios. Rooted Not Rooted 0.100 Dialog Act Followup Like Button 0.8 0.075 0.6 0.050 0.4 0.025 0.2 0 0 Aspect General Aspect General Fact Source Figure 3.6: User engagement is measured by dialog act followups (left) and like but- ton usage (right). We compare reactions to messages that use a fact mentioning an entity the user knew about (rooted) and whether the fact is general or aspect-specific. Pairwise differences are statistically significant (99%+) with a two proportion z-test except for dialog act followups between rooted and non-rooted general facts. Over- all, users prefer on-aspect, rooted facts. 43 Proportion Preferred 3.3.1.1 Likes for Explicit Preference Elicitation Explicit preference is computed directly from like button usage and shown on the right panel of Figure 3.6. Overall, users liked 60% of messages, and they prefer on-aspect, rooted facts. 3.3.1.2 Mining Acts for Implicit Preferences When users ask specific followup questions?as opposed to generic ones?about an assistant?s fact, it shows that the user implicitly prefers these kinds of messages. For example, asking about an entity like the Ta?nos is more specific than asking about history and therefore indicates engagement. We identify these interactions by mining for assistant-user message pairs where the assistant uses a fact and their message is has an ?inform? dialog act. With these, we compute the likelihood P (Outcome = request followup |Fact Source) that the user?s message has the ?request followup? dialog act given the source. Sim- ilarly to likes, users engage more with aspect-oriented and rooted facts. 3.3.1.3 Paraphrase Analysis Although our work does not include a paraphrase model, we manually analyze a random sample of two hundred and fifty assistant messages where facts were used (Figure 3.3). Of these messages, 51% were acceptable paraphrases, 27% were verbatim copies, 12% were contextualizations of near copies, and the remainder were errors such as incorrect paraphrases or did not incorporate the fact. Appendix A.4 shows descriptions, counts, and random examples of each category. This analysis estimates that about half of grounded messages have non-trivial signal for future paraphrase models to use. 3.3.2 Dataset Statistics Table 3.4 shows the basic statistics of the Curiosity dataset which contains 14,048 dialogs with 181,068 utterances. The fact database contains 93,845 facts; of those, 76,120 (81%) were shown to the assistants and 27,486 (29%) were used in at least one message. We randomly split dialogs for training, validation, and testing. 3.4 Models We design a machine learning model that predicts assistant and user actions. We introduce a multi-task architecture for Curiosity that Hierarchically Models (charm, Figure 3.7) dialogs to: (1) predict the dialog acts of the user message (ut- terance act prediction), (2) select the best fact (fact prediction), (3) choose the best set of dialog acts for the next message (policy act prediction), and (4) predict if the assistant message will be liked (like prediction). 44 Category Label Count Percent Copy verbatim 68 27.2% Copy cherry-pick 6 2.40% Copy context 30 12.0% Copy Total 104 41.6% Paraphrase paraphrase-correct 111 44.4% Paraphrase paraphrase-multiple 17 6.80% Paraphrase Total 128 51.2% Error paraphrase-error 5 2.00% Unrelated unrelated 13 5.20% Total 250 100% Table 3.3: We analyze the paraphrases annotators use through manual categoriza- tion. The ?Copy? category includes cherry-picked verbatim phrases, verbatim copies, and contextualized copies (e.g., changing a named entity to ?it?). The majority of paraphrases are correct and only incorporate the provided fact, but a few weave in other information. 7.2% of paraphrases are either unrelated to the selected facts or paraphrase the fact incorrectly. Overall, 51.2% of messages have valid paraphrases. 3.4.1 Text Representation charm jointly encodes the text of utterances and facts with one encoder enc. Following our prior encoder notation (?2.4.3), enc is a bi-directional lstm (Sutskever et al., 2014) over concatenated glove (Pennington et al., 2014) word embeddings and Wikipedia2Vec (Yamada et al., 2018a) entity embeddings.5 The text tui of ut- terance ui in dialog D is represented as enc(tui ). Similarly, fact fj on turn i is represented as enc(tfi,j) where j indexes facts shown on that turn. 3.4.2 Dialog Representation In our models, we use a hierarchical recurrent encoder (hre) architecture (Sor- doni et al., 2015; Serban et al., 2015) where a forward lstm contextualizes each utterance to the full dialog. We modify the hre model by adding additional inputs beyond the utterance?s textual representation. First, we represent user?s known entities k = avg(encentity(e1), . . . ,encentity(ek))) (3.1) as the average of entity embeddings. An entity embedding also represents the topic t = encentity(topic) (3.2) 5In charm, bert was not as effective an encoder. 45 Metric (# of) Total Train Val Test Zero Dialogues 14,048 10,287 1,287 1,287 1,187 Utterances 181,068 131,394 17,186 17,187 15,301 Likes 57,607 41,015 5,928 5,846 4,818 Topics 361 331 318 316 30 Facts Total 93,845 NA NA NA NA Facts Shown 76,120 66,913 29,785 30,162 6,043 Facts Used 27,486 21,669 4,950 4,952 2,290 Table 3.4: Curiosity has 14,048 dialogs. On average, dialogs have 12.9 utterances. 60% of the assistants? 90,534 utterances were liked. Input Utterance Encoder Hierarchical Multi-task Decoders Modality (Section 4.1) Recurrent (Section 4.3) Encoder Contextual Paraphrase (Section 4.2) (Future Work) Text Encoder Utterance Text (BiLSTM/BERT) Policy Act Prediction fer orm her? Focus & Rooted Knowledge of inf ot Facts Encoder ? + + Utterance Act Speaker Speaker concat Prediction ask trqs rthe ? (User / Assistant) Encoder o Joint Dialog Act Like Multi-task Dialog Act Encoder Prediction yes no Optimization ... ... for each Fact 1 Utterance Encoder Fact Fact 2 Scores 1 2 ? act _ f fac t_ Utterance Encoder ? Figure 3.7: Architecture: charm builds a dialog context up to t = i?1 to predict the current message?s dialog acts (policy prediction) and the best facts to use. The model uses this combined with the current utterance to classify it?s dialog acts and if it will be liked. of the dialog. Next, we create trained speaker embedding vs for the user and vt for the assistant. Given the set of all dialog acts A, each utterance has a set of dialog acts Au ? P(A) where P(X ) denotes the set of all subsets of X . Finally, we use an act embedder act to compute an act rep?resentation1 a(i) = |A | act(ak) (3.3)u ak?Au by averaging embeddings at each turn. The input at each step is the concatenation c(i) = [enc(tu);a(i)i ; t;k;v] (3.4) of the representations for text, speaker, topic, known entities, and utterance dialog acts.6 With this joint representation, the contextualized dialog representation h(i?1) = lstm(c(1), . . . , c(i?1)) (3.5) 6The speaker embedding v alternates between vs and vt. 46 is the final lstm state and includes time step t = i ? 1. The dialog up to and including time i is d(i) = [h(i?1); c(i)] (3.6) which emphasizes the current utterance and makes multi-task training straightfor- ward to implement. 3.4.3 Tasks and Loss Functions In our model, we jointly learn to predict fact usage, user likes, utterance acts, and policy acts. Fact Prediction For every assistant turn, the model predicts which fact(s) from {f1, . . . , fk} ? F (i),F (i) ? P(F) the assistant marked as ?used? where F is the set of all facts. We frame this task as pointwise learning to rank (Li(et[ al., 2008). A fact predict]io)n network f,(i) s = gelu W f ? h(i?1) f fj + b ; enc(tj ) (3.7) with parameters W f and bf using a Gaussian Error Linear Unit (Hendrycks and Gimpel, 2017b) outputs salience scores for each fact. The network does not use utterance ui since it contains signal from the choice of fact. The predictions f,(i) softmax f,(i)y?j = (sj ) (3.8) are converted to probabilities by the softmax exp(q) softmax(q) = ? (3.9)k j=1 exp(qj) over k labels. Using this, we compute th?e fact loss L 1= ` (y?ff |F | ce i,j,yi,j) (3.10)(i) i,j where labels f,(i)yj indicate if fact from utterance i in position j was used and ?k `ce(y?,y) = yp log(y?p). (3.11) p=1 is the cross entropy loss. To mitigate class imbalance, we also scale positive classes by nine (Japkowicz and Stephen, 2002). 47 Policy Act and Utterance Act Prediction Each utterance may have multiple dialog acts so we treat policy and utterance act prediction as a multi-label task. The goal of policy prediction is to choose the best act for the next utterance; the utterance act classifies the last message?s acts. To predict these acts, we create a policy act network sp,(i) = gelu(W p ? h(i?1) + bp) (3.12) and an utterance act network su,(i) = gelu(W u ? d(i) + bu) (3.13) where the probability of act is ?,i ?,(i)ak pk = exp(sk ). From these, we derive the policy act loss ?|A| L = ya p,(i) p,(i)p i,k log pk + (1? yai,k) log(1? pk ) (3.14) k and utterance act loss ?|A| L u,(i) u,(i)u = ya(i),k log pk + (1? ya(i),k) log(1? pk ) (3.15) k for an utterance at t = i with act labels yai,k. Like Prediction For every assistant message, the model predicts the likelihood of the user ?liking? the message. We treat this as binary classification, predict the ?like? likelihood y?l = softmax(gelu(W l ? h(i)i + bl)), (3.16) and use it to compute the like loss Ll = `ce(y?l, yli i) (3.17) where yli indicates if the message was liked. We train the model jointly and optimize the loss L = Lf + Ll + Lp + Lu. (3.18) 3.4.4 Model Implementation and Training Details We implement all models with PyTorch (Paszke et al., 2019) and Allennlp (Gard- ner et al., 2018). The learning rates for models is set using the built-in learning rate finder in Allennlp. Model losses were optimized with Adam (Kingma and Ba, 2015); the bert model uses a learning rate of .0001 and charm a learning rate of .001 with otherwise default parameters. We train for a maximum of forty epochs and early stop based on the sum of validation losses. The charm model uses batch size 64 and the bert model batch size 4. In our models, text encoders for utterances and facts share parameters. Additional hardware and runtime details in Appendix A.6. 48 Fact Rank (mrr) Utt. Act (F1) Policy Act (F1) Like (Accuracy) Model Val Test Val Test Val Test Val Test Majority Class n/a n/a 0.602 0.604 0.491 0.494 0.690 0.681 e2e bert 0.420 0.418 0.794 0.795 0.635 0.631 0.829 0.822 charm 0.546 0.546 0.845 0.847 0.682 0.682 0.826 0.815 ? context 0.516 0.506 0.838 0.842 0.664 0.664 0.824 0.820 Table 3.5: The charm model outperforms end-to-end bert on most tasks. We compare fact selection with mrr, dialog act prediction with micro-averaged F1, and like prediction with accuracy. Ablating dialog history degrades context-dependent tasks (fact selection and policy act prediction), but not tasks more dependent on one message. 3.5 Modeling Experiments charm improves over a bert model in most tasks. We evaluate each sub- task with separate metrics. Fact selection is evaluated with mean reciprocal rank (mrr). For utterances with at least one selected fact, we compute the mrr using the selected facts as relevant documents. We compare like prediction with binary classification accuracy. For utterance and policy act prediction, we compare models with micro-averaged F1 scores so that frequent classes are weighted more heavily. For each metric, we report validation and test set scores. 3.5.1 Baselines bert (Devlin et al., 2018) is a standard baseline for many nlp tasks. We use a multi-task extension of an uncased bert model as our primary baseline and fine-tune it for our unique set of tasks (e2e bert). Specifically, we use the cls representation of each utterance to replace the hre representation as a time-distributed input to the same multi-task decoders (Section 3.4.3). The context-less charm ablation replaces the dialog contextualizer lstm with a per-timestep projection layer. Lastly, we report majority class accuracy for classification tasks. 3.5.2 Discussion The proposed charm model for conversational curiosity is more effective than e2e bert for most of the tasks in Curiosity (Table 3.5). Specifically, charm im- proves significantly in fact prediction (13 mrr points) and both dialog act prediction tasks (5 F1 points), demonstrating the efficacy of the structural encoding of the var- ious input modalities. Generally, models accurately predict utterance acts and likes, but their mrr and F1 scores on fact selection and policy act prediction is compar- atively worse. To a degree, this is expected since there is not always one best fact or one best action to take as the assistant; there may be various reasonable choices, which is common in information retrieval tasks. Nonetheless, models that specif- 49 ically factor in the relationship between prior knowledge and entities should yield improvements. For example, Liu et al. (2018) predict the most relevant unmentioned entity while Lian et al. (2019) model a posterior distribution over knowledge. We leave these improvements to future work. 3.6 Related Work Our work builds on knowledge-grounded conversational datasets and modeling. Datasets Although there are numerous grounded datasets, we did not find one for conversational information seeking that contained fine-grained knowledge ground- ings, message-level feedback from the user, and dialog acts. Table 3.6 compares the Curiosity dataset to several others according to six factors: (1) is the goal of the task information seeking, (2) is the dataset collected from natural dialog with one par- ticipant always taking the role of an assistant, (3) are dialog responses constrained, (4) are document groundings annotated?as opposed to distantly supervised?and fine-grained, (5) is there message level feedback for the assistant, and (6) is the dataset annotated with dialog acts. Our dataset is most similar to those for information-seeking such as quac (Choi et al., 2018), Wizard of Wikipedia (Dinan et al., 2019b, wow), cmu dog (Zhou et al., 2018b), ms marco (Nguyen et al., 2016), Topical Chat (Gopalakrishnan et al., 2019), the trec Conversational Assistance track (Dalton et al., 2019, cast), and Search as a Conversation (Ren et al., 2020, saac). quac constrains assistant re- sponses to spans from Wikipedia, which makes it better for conversational question answering, but prevents more sophisticated assistant policies. quac also provides dialog acts, but they exist so that the assistant can inform the user of valid actions; we annotate dialog acts after-the-fact so that we can compare freely chosen user responses. Like quac, Topical Chat, saac, and wow have annotated knowledge- groundings for each message, but responses are free-form. saac is a contemporane- ous, cast-based dataset that shares our motivation to make conversation a medium for information-seeking. Topical Chat includes user feedback, but instead of ex- plicitly defined roles, workers implicitly take dual and alternating roles as the user and assistant through knowledge asymmetry; followup work added automatically annotated dialog acts to Topical Chat (Hedayatnia et al., 2020). Many tasks instruct annotators to take on a specific role in the dialog. For example, in Wizard of Wikipedia, annotators assume an assigned persona (Zhang et al., 2018a) in addition to being the user or assistant. Consequently, many dialogs revolve around personal discussions instead of teaching about a topic. Additionally, annotators may not have the background to play their role. In contrast, we ask annotators to take roles that?as humans?they already know how to do: read about and convey interesting information on a topic (assistant) and engage in inquiry about a novel topic (user). Our work is one of many in knowledge-grounded conversational datasets. For example, Moghe et al. (2018) have workers discuss movies and ground messages to 50 Dataset Info Dialog Free Annotated Message Dialog Seek- w/Assistant Re- Ground- Feed- Acts ing sponse ing back Curiosity (ours) 4 4 4 4 4 4 Topical Chat 4 " 4 4 4 " (Gopalakrishnan et al., 2019) Search as a Conversation 4 4 4 4 7 7 (Ren et al., 2020) Wizard of Wikipedia 4 4 4 4 7 7 (Dinan et al., 2019b) quac (Choi et al., 2018) 4 4 7 4 7 " cmu dog (Zhou et al., 4 4 4 " 7 7 2018b) ms Marco Conversational 4 7 n/a n/a n/a n/a (Nguyen et al., 2016) opendialkg (Moon et al., 7 4 4 4 7 7 2019) coqa (Reddy et al., 2019) 7 4 " 4 7 7 Holl-E (Moghe et al., 2018) 7 " 4 4 7 7 Commonsense 7 7 4 7 7 7 (Zhou et al., 2018a) Reddit+Wiki (Qin et al., 7 7 4 7 7 7 2019) Table 3.6: 4 indicates a dataset has the feature, " that it does with a caveat, and 7 that it does not. Conversational ms marco is a search dataset but has inquiry chains we want assistants to induce (exemplar in Appendix A.7). Topical Chat and Search as a Conversation are motivationally similar. While our dataset?s combination of (human) annotation is unique, all three datasets are steps forward in resources for conversational information-seeking. plot descriptions, reviews, comments, and factoids; however, one worker plays both roles. In opendialkg (Moon et al., 2019), annotators ground messages by path- finding through Freebase (Bast et al., 2014) while discussing and recommending movies, books, sports, and music. Qin et al. (2019) use Reddit discussion threads as conversations and ground to web pages. Similarly, Ghazvininejad et al. (2018) collect Twitter three-turn threads and ground to Foursquare restaurant reviews. Our work adds to this dataset compendium. External Knowledge in Models Our model is related to those that incorpo- rate external information like facts in question answering (Weston et al., 2015; Sukhbaatar et al., 2015; Miller et al., 2016a), knowledge base triples in dialog mod- els (Han et al., 2015; He et al., 2017; Parthasarathi and Pineau, 2018), common sense (Young et al., 2017; Zhou et al., 2018a), or task-specific knowledge (Eric and 51 Manning, 2017). Similarly to Kalchbrenner and Blunsom (2013); Khanpour et al. (2016), charm predicts the act of the current message, but also next message?s act like Tanaka et al. (2019) do. 3.7 Future Work and Conclusion This chapter primarily introduces Curiosity: a large-scale conversational information-seeking dataset. Like much work in ir and the Cranfield paradigm, the primary goal of this dataset is to help build models that better satisfy users?in this case, in the context of curiosity-driven dialog. Aside from building the Curiosity dataset, this work also investigates an alternative, dialog act based method for mea- suring how often users meaningfully engage with assistant messages which we take as a proxy for their satisfaction of how interesting the assistant is. With Curiosity?s unique set of annotations, we design charm which jointly learns to choose facts, predict a policy for the next message, classify dialog acts of messages, and predict if a message will be liked. We hope that the Curiosity dataset and contemporaries will encourage progress and interest in curiosity-driven dialog. Since we later discuss avenues for more ambitious future work (Chapter 7), here we discuss three immediate directions for work using the Curiosity dataset. The first is to augment our charm model with a text generation module to make a digital version of our human assistants. This involves contextualizing and para- phrasing facts which our dataset supports. Second, dialog act sequences could iden- tify additional data-driven policies that could be used to define rewards or losses. By conditioning on dialog acts or sequences of dialog acts, textual outputs could be better-controlled (Sankar and Ravi, 2019; See et al., 2019) and combined with knowledge grounding (Hedayatnia et al., 2020). However, text is not the native modality of digital assistants. We envision digital assistants participating in information-seeking, which means also handling speech input. Consequently, automatic speech recognition (asr) in- troduces transcription errors which are especially prevalent in knowledge-oriented text like question answering (Peskov et al., 2019). Gopalakrishnan et al. (2020) show this is also problematic in information-seeking dialog by comparing models on textual and asr versions of Topical Chat. To close the loop in conversational information-seeking, models need to account for the speech-based environment of digital assistants. Lastly, for the focus we give to dialog act mining, there is room for significant room for improvement in the evaluation of models for predicting these. There are two drawbacks to the current evaluation approach. First, although the dialog act of interest to us is relatively rare (?request followup?), the dialog act evaluations use micro-averaged F1 scores which discounts the importance of these rare classes.7 Second, these rare classes are more challenging to detect than some of the most common classes like ?request topic.? That is, a priori, we expect that the most 7At best, macro-averaging F1 would make the class weights equal, but also scale by the number of distinct classes. 52 frequent classes are not only the easiest for a model to predict,8 but are the ones of least interest to us. Ideally, we would like to have an evaluation methodology that infers which types of examples are difficult and rewards models appropriately. Having seen the Curiosity dataset as an example of the Cranfield paradigm, the next chapter introduces a solution to this particular challenge?which is one shared across many qa tasks. 8Even ignoring that this class has more training data, we still maintain it is intrinsically more difficult to detect. 53 Chapter 4: Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? A successful evaluation requires a task that is neither too easy nor too difficult for the current technology. If the task is too simple, all systems do very well and nothing is learned. Similarly, if the task is too difficult, all systems do very poorly and again nothing is learned. Ellen M. Voorhees, The trec-8 qa Track Report Before shifting focus to a Manchester paradigm task (Chapter 5), we first turn our attention to the challenge of evaluating tasks for which the test examples are not equally difficult.1 In Chapter 3, one example of this problem is evaluating how well a dialog act classifier works since the most difficult class factors least into aggregate metrics since it is also a rare class. Taking a wider view, the main problem is that standard evaluation schemes weigh examples equally, but a cursory examination shows this assumption to be flawed. Intuitively, any student could tell you that exam questions are not equally difficult; this is even trivially verifiable in nlp datasets: poorly annotated examples?which are present to some degree in many datasets?tend to be either too difficult or too easy. This applies equally well to most qa evaluations since they use metrics like mean accuracy or mean lexical overlap that weigh each question equally. Rather than re-invent the wheel, to address this problem we take inspiration from work in a field that concerns itself with faithfully and reliably measuring skills using exams?educational testing. Ultimately, this chapter (1) introduces a method for weighting examples appropriately and (2) argues we should maximize the ability of qa evaluations to discriminate between better and worse models. Chapters 5 and 6 subsequently introduce methods for improving the discriminative power of qa evaluations through better formats and data. To test our method, this chapter takes a critical look at the dominant evaluation methodology in qa: the leaderboard. 4.1 Leaderboards are Shiny Leaderboard evaluations?for better or worse?are the de facto standard for measuring progress in question answering (Rajpurkar et al., 2016) and in many nlp 1This chapter is based on the acl publication Rodriguez et al. (2021). 54 2,000 Feasibility (?) 0 0.0 1.0 ?7.5 ?5.5 ?3.5 ?1.5 0.5 2.5 4.5 6.5 8.5 Discriminative 12.0 10 10.0 8.0 6.0 5 Discriminativeand Hard 4.0 Easy 2.0 0 0.0 ?2.0 ?4.0 ?5 ?6.0 Annotation Error ?8.0 ?10 ?10.0 ?8 ?6 ?4 ?2 0 2 4 6 8 10 0 2,000 Difficulty ( ) Figure 4.1: Difficulty and Ability Discriminating (dad) leaderboards infer the dif- ficulty, discriminativeness, and feasibility of examples. Negative discriminability suggests an annotation error; for example, the question with most negative discrim- inability asks ?Why did demand for rentals decrease?? when the answer is ?demand for higher quality housing increased.? tasks (Wang et al., 2019a). An unfortunate side effect of leaderboard popularity is sota-chasing, often at the expense of carefully inspecting data and models (Linzen, 2020). For example, the same models that achieve ?super-human? question an- swering (Najberg, 2018) often fail spectacularly (Feng et al., 2018; Wallace et al., 2019a) by learning non-generalizable statistical patterns (McCoy et al., 2019; Niven and Kao, 2019); i.e., these are the models that Levesque (2014) refers to as ?idiot- savants? (?1.1). Finally, focusing solely on metrics conflates progress on a specific task with progress on real-world nlp problems that the task stands in for (Bender and Koller, 2020). Plainly, focusing on headline sota numbers ?provide(s) limited value for scientific progress absent insight into what drives them? and where they fail (Lipton and Steinhardt, 2019). In this chapter we take leaderboards ?as they are,? and imagine how they might better support research. Leaderboards establish differences between models on a fixed task. Hence, leaderboards should enable and encourage the comparison of models and inspection of examples. And leaderboards should also reveal when they have outlived their usefulness (Boyd-Graber and B?rschinger, 2020). 55 Discriminability ( ) Items Subjects Responses Figure 4.2: A dad leaderboard uses irt to jointly infer item difficulty ?i, dis- criminability ?i, feasibility ?i, and subject skill ?j. These predict the likelihood pij(rij = 1) of a correct response rij. 4.1.1 How to Direct Leaderboards? Light To help focus attention on examples and models of interest, we propose Diffi- culty and Ability Discriminating (dad) leaderboards that explicitly model both task and submissions jointly, rather than either in isolation. dad?s underlying model is based on Item Response Theory (Lord et al., 1968; Baker, 2001, irt, reviewed in ?4.2), a widely used (van Rijn et al., 2016) alternative in educational testing to simple summary statistics (Edgeworth, 1888). dad can explicitly identify the difficulty and discriminability of items (Fig- ure 4.1),2 which in turn can lead to a more nuanced ranking of models, identifying poor items, and better understanding of a dataset and task. Throughout this chap- ter, we use the question answering (qa) benchmark squad 2.0 (Rajpurkar et al., 2018). For example, dad can identify questions that are challenging to models and questions that are wrong (incorrectly annotated). In addition to better under- standing datasets, it is also helpful for efficiently selecting evaluation to annotate. We conclude the chapter with recommendations for future leaderboards (?4.7) and discuss where irt in nlp can go next (?4.9). 4.2 A Generative Story for Leaderboards Much as questions are the product of their inputs (?2.2.3), leaderboards are a product of the metrics, evaluation data, and subjects (machine or human) who answer items (Figure 4.2). For concreteness, let?s assume that we have a question answering task and two subjects: Ken, who is good at trivia, and Burt, who is not. In the simplest irt models, each subject j has a random variable ?j corresponding to their skill: Ken?s is big, Burt?s is small. 2Example and feasibility distribution in Appendix B.1. 56 Item Characteristic Curve Discriminability (?) ? = 0.5 ? = 1 ? = 2 1.0 0.9 Feasibility ?=.95 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Difficulty ?=0.0 0.0 ?4 ?3 ?2 ?1 0 1 2 3 4 Skill (?) Figure 4.3: In irt, the Item Characteristic Curve describes the probability that a specific item with difficulty ? will be answered correctly as a function of skill ?. Visualizing the parameters is helpful in a few ways. First, it shows that high discriminability leads to larger differences in the maximal and minimal correctness probability (for a given skill range), and the tangent line visually links discriminabil- ity to the slope. Second, it clearly shows the maximal probability when feasibility is greater than zero. Lastly, it demonstrates that the inflection point is at the point where difficulty and skill equal out (in this case, zero). But you cannot know that until you start asking them questions of varying difficulty ?i. Harder questions have a higher difficulty (?what is the airspeed of an unladen swallow?) than easy ones (?who is buried in Grant?s tomb?). The bigger the margin between a subject?s skill ?j and an item?s difficulty ?i, ?j ? ?i, the more likely that subject j responds correctly pi,j(ri,j = 1). This is the simplest irt model, which we call irt-base. Generally, given n test items X = (X1, . . . , Xn) andm subjects S = (S1, . . . , Sm), where each subject responds to every item, we want to estimate subject skills and item difficulties. To discover the random values that best explain the data, we turn to probabilistic inference (Pearl, 1988). Two additional random variables further improve dad: discriminability ?i and feasibility ?i. Let?s return to the margin between a question?s difficulty ?i and a subject?s skill ?j. A discriminative question is challenging but can still be answered correctly by a strong subject. If Ken?s ability is higher than most items? difficulty (?j??i is large), item discriminability multiplies this gap by ?i in a model called irt- disc. Questions with low ?i (discriminability) are low quality: they have annotation error or do not make sense. Another way of capturing poor quality questions is the feasibility ?i. For example, if the question ?who was the first president? has the answer Rajendra 57 P(response = correct | ?) Prasad, the question has an unstated implicit assumption that subjects must guess what country or company the question is about. In the irt-feas model (Figure 4.3), if a large fraction of subjects all get an item wrong, everyone?s probability of getting the item right is capped at feasibility ?i. In nlp terms, 1 ? ?i corresponds to the prevalence of annotation errors that lead to unsolvable items. Having introduced the constituent elements of the model, we now present the full generative model: 1. For each subject j: (a) Draw skill ?j ? N (? , ??1? ? ) 2. For each item i: (a) Draw difficulty ?i ? N (??, ??1? ) (b) Draw discriminability ?i ? N (? , ??1? ? ) (c) Draw feasibility ?i ? U[0, 1] 3. Draw subject i response on item j, ?i rij ? pij(rij | ?j, ?i, ?i) = pij(rij = 1|?j) = . (4.1) 1 + e??i(?j??i) For irt-base, ?i and ?i are fixed to 1.0, while for irt-disc, only ?i is fixed.3 Means ??, ??, ?? are drawn from N (0, 106) and ??, ??, ?? from a ?(1, 1) prior, as in Lalor et al. (2019) and recommended by Natesan et al. (2016).4 ?Because it is difficult to completely codify skill and difficulty into a singlenumber, we can rewrite the exponent in Equation 4.1 as a sum over dimensions??i( k ?j,k??i,k) where each dimension captures the interaction between an item?s difficulty and a subject?s skill. For example, perhaps Burt could better exploit artifacts in one dimension (their skill for ?j,k=5 is high but everywhere else is low) while Ken might not know much about a particular topic like potent potables (?j,k=2 is low but everywhere else is high). We call this model irt-vec.5 Multidimensional irt models (Reckase, 2009) could?in addition to better modeling difficulty?also cluster items for interpretation; we briefly experiment with this (Appendix B.6), but leave more to future work (?4.9). 4.2.1 Examples are Not Equally Useful irt?s fundamental assumption is that not all items and subjects are equal. This explains why leaderboards can fail while having ?normal looking? accuracies. As a thought experiment, consider a dataset that is one third easy (?i ? [0, 1]), one third medium difficulty (?i ? [2, 3]), and one third hard (?i ? [6, 7]). Suppose that Ken has skill ?k = 4 while Burt has skill ?b = 2. A standard leaderboard would say 3In psychometrics, irt-base is called a Rasch (Rasch, 1960) or 1 parameter logistic (1PL) model, irt-disc is a 2PL model, and irt-feas is a 4PL model with guessing set to zero. 4We differ by allowing ? < 0 to help identify bad items. 5We do not incorporate feasibility into the irt-vec model since it already improves over 1D models without it. 58 that Ken has higher accuracy than Burt. But suppose there?s a new subject that wants to challenge Ken; they are not going to reliably dethrone Ken until their skill ?c is greater than six. This is a more mathematical formulation of ?easy? and ?hard? dataset splits in question answering (Sugawara et al., 2018; Rondeau and Hazen, 2018; Sen and Saffari, 2020). In irt-feas, this recapitulates the observation of Boyd-Graber and B?rschinger (2020) that annotation error can hinder effective leaderboards. irt helps systematize these observations and diagnose dataset issues. 4.2.2 Inference To estimate the latent parameters of our model, we use mean-field variational inference (Jordan et al., 1999). In variational inference, we propose a distribution over the latent variables, q?(?), that approximates the true but intractable posterior p(?). We then minimize the kl-divergence between these distributions, equivalent to maximizing the evidence lower-bound (elbo) with respect to the variational parameters. In our case, q?(?) is a mean-field distribution, which means it factorizes over each of the latent variables (the product is over?the n?m subject-item pairs) q?(?,?,?,?, ? ) = q(?)q(? ) q(?j)q(?i)q(?i) i,j Specifically, for our key latent variables z ? {?,?,?}, the associated variational dis- tributions are of the form q(z) = N (u , t?1z z ). Recall that in the generative distribu- tion, each latent z is drawn from a N (?z, ??1z ) whose parameters are also latent vari- ables; for these variables, we use the variational distributions q(?z) = N (u?z , t?1? )z and q(?z) = ?(a?z , b?z). We optimize the elbo with respect to the variational pa- rameters ? = {uz, tz,u?z , t?z ,a?z , b?z ,?} for all z using adam (Kingma and Ba, 2015). With dad?s leaderboard irt model introduced, we next discuss how leader- board subjects are statistically compared and alternative methods?such as using irt parameters?to evaluate whether two models are truly different. 4.3 Ranking and Comparing Subjects Fundamentally, the objective of comparative evaluations like leaderboards is to decide whether model A is better than model B. A thread of nlp has rightfully advocated for adding rigour to these decisions using statistics (Traub, 1997, Classical Testing Theory) where the objective is to infer a true score T from the observed test score X = T + E given a measurement error E, uniform across subjects. However, in educational testing?a field measuring skill and knowledge in humans?irt is a primary measurement instrument (Hambleton, 1991, p. 2). A major motivation for 59 irt is that subjects of different skill have different errors. irt explicitly accounts for the bandwidth-fidelity dilemma (McBride, 1976): items can either accurately measure a narrow ability range (fidelity) or inaccurately measure large ability ranges (bandwidth).6 This section and the next contrast methods for identifying the best model and advocate for irt. Implicit in nearly all leaderboard evaluations is ranking of models based on a statistic such as the average accuracy. As we show in ?4.4, na?ve rankings are noisier than irt rankings. 4.4 IRT for Leaderboards Leaderboards should do two things: (1) reliably and efficiently rank better models ahead of worse models (Tague-Sutcliffe, 1992; Voorhees, 2003a) and (2) guide inspection of items and subjects (?4.5). The first ameliorates the unavoidable ran- domness of finite evaluations while the second enables error analysis (Wu et al., 2019) and model probing (Belinkov and Glass, 2019; Zhang et al., 2019). First we verify that irt models accurately predict the responses of subjects (?4.4.2). Next, a ranking stability analysis shows that irt has modestly better reliability than classi- cal rankings (?4.3). Lastly, using irt to actively sample items for annotation yields rankings with better correlation to the test data (?4.4.5). 4.4.1 Why a Linear Model Baseline At first blush, the differences between irt and logistic regression are minimal, but we include the comparison to address natural questions from the nlp commu- nity: (1) do the idiosyncrasies of the irt formulation hurt accuracy? (2) should we add features to better understand phenomena in the questions? (3) why not use deep models? The next section argues that both irt and logistic regression are accurate, even without laboriously engineered task-specific features. Adding obvious features such as item words (e.g., question) only minimally improves accuracy. We explic- itly omit less interpretable deep models since we want to make leaderboards more interpretable. 4.4.2 Response Prediction is Accurate Just as educational testing researchers validate irt fit seeing if they predict subject responses correctly (American Educational Research Association, 2014), we validate how well dad predicts whether squad models get questions right. We compare against a logistic regression linear model (lm) implemented with Vowpal Wabbit (Agarwal et al., 2014). Since integrating hand-crafted features is easy, we incorporate features derived from subject ids; item ids; functions of the squad question, answer, and title,; and irt parameters (details in Appendix B.2). 6Estimation error of ? varies by position (?B.5). 60 ROC AUC Macro F1 Accuracy 1.0 1.0 1.0 0.9 0.9 0.9 Features 0.8 0.8 0.8 IRT-Vec LM +Question 0.7 0.7 0.7 IRT-Feas LM +ContextIRT-Disc LM +Stats 0.6 0.6 0.6 IRT-Base LM +Subj & Item ID 0.5 0.5 0.5 LM All LM +Topics 1K 0.4 0.4 0.4 LM +IRT LM +Title 0.3 0.3 0.3 LM +Item ID LM +Baseline 0.2 0.2 0.2 LM +Subject ID 0.1 0.1 0.1 0.0 0.0 0.0 Figure 4.4: We compare each irt and linear model (lm) by how well they predict subject responses. We focus on roc auc since predicting responses is an imbalanced classification problem (most subjects are correct). Under that metric, all irt models improve over the best lm, and the strongest lm ablation only uses irt features. That textual features are predictive in the lm suggests they could improve future models. As in irt, logistic regression predicts whether a subject correctly responds to an item. Later, we discuss ways to integrate more features into irt (?4.9). 4.4.2.1 SQuaD Leaderboard Data Experiments use data from the squad 2.0 leaderboard. Development data is publicly available and test set responses were obtained from the organizers. There are 161 development subjects, 115 test subjects, and 11,873 items (1.9 million total pairs). Experiments that do not need the test responses use all the development subjects; those that do use the smaller test subset. 4.4.2.2 Evaluation Scheme Following prior work (Wu et al., 2020), we evaluate irt and linear models by holding out 10% of responses and computing classification metrics.7 In squad, predicting whether a response is correct is an imbalanced classification problem (80.4% of responses in the development set are correct). Thus, we use roc auc, macro F1, and accuracy. 4.4.2.3 IRT Response Prediction is Accurate irt models that incorporate more priors into the generative story should be better, but are they? We compare four irt models: irt-base, irt-disc, irt-feas, and irt-vec (?4.2). The more sophisticated models are better and all improve over the lm (Figure 4.4) and correlate well with other (Appendix B.3). To be clear, while outperforming the lm is good, our goal was to validate that irt model are accurate; later, we inspect model errors and identify annotation errors (?4.5). 7Everywhere else in the chapter, we train on all responses. 61 Metric Value Dev Sample to Dev Sample Dev Sample to Test 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Correlation IRT to IRT 0.5 0.5 Acc to Acc 0.4 0.4 16 32 64 128 256 512 1,024 2,048 4,096 16 32 64 128 256 512 1,024 2,048 4,096 Development Set Sample Size Development Set Sample Size Figure 4.5: Compared to the final ranking over a large test set, how well does a small test set correlate? The left shows correlation between mutually exclusive development set samples and the right between development samples and the full test set. In both experiments (panes), ranking systems by irt ability is more stable? across all sample sizes?than mean accuracy and thus more reliable (Kendall?s rank correlation is higher). Bands show 95% confidence intervals of rank correlations across ten trials per sample size. 4.4.2.4 What Model Features are Predictive? Integrating additional features into Bayesian models is not trivial, so we in- stead use the flexibility of linear models to identify useful features. Our leave-one-in ablation compares features (Figure 4.4): the top ablations both use irt features, further validating irt parameters. The subject and item identifier features are also strongly predictive, but item is the stronger of the two. Text-based features are weaker, but this suggests room for future work in better integrating them into irt models (?4.9). 4.4.3 Ranking with IRT Leaderboards should produce reliable subject rankings: can dad rank systems even with a tiny test set? Thus, we compare the reliability of traditional average accuracy (?4.3) to irt ability rankings. Our first experiment (?4.4.3.1) examines the stability of existing items and subjects while the second (?4.4.5) investigates stability of ?new? evaluation data using various sampling strategies. 4.4.3.1 IRT Rankings Have Better Reliability Rankings should be reliable within the same dataset (e.g., on dev set) and generalize to similar datasets (e.g., with a test dataset). To test the first, we measure the ranking stability of mutually exclusive samples of the development data (Buckley and Voorhees, 2000). To test the second, we measure the correlation of between development set sample rankings to test set rankings (Voorhees, 1998). 62 Kendall Rank Correlation Kendall Rank Correlation Specifically, for a range of sample sizes8 we (1) sample two partitions of the data, (2) compute the classical ranking9 and the irt ranking from a refit irt- feas model, then (3) compute Kendall?s correlation (Kendall, 1938) between the samples for each ranking (details in Appendix B.4). In both cases irt rankings have higher correlation than classical rankings (Figure 4.5, left). Since the bene- fit is strongest at low sample sizes, irt can improve the reliability of small-scale evaluations. The second experiment examines ranking generalization: irt yields more re- liable measures of subject skill, implying a greater consistency in subject rankings across evaluation settings. Figure 4.5 compares the development set sample rank- ings computed above to rankings obtained using subjects? test set responses (with the same irt model). Across all sample sizes, subjects? irt ability estimated with the development set has significant correlation with their test set ability. Crucially, this is better than the corresponding classical metrics like accuracy, supporting our original motivation for using irt.10 4.4.4 Statistical Significance of Difference in Kendall Tau Coefficients While Figure 4.5 shows a consistent difference in correlation between ranking methods, it is unclear whether this difference is statistically significant. We estimate the statistical significance of the difference with bootstrap sampling (Efron, 1994). Since the null case is no difference in correlation coefficients, we seek a symmet- ric sampling distribution centered at zero that represents a realistic density function. Each ranking stability experiment11 trial results in two lists of number pairs. The lists correspond to subject scores on two datasets;12 each number pair is the subject?s accuracy and irt score. To create the bootstrap distribution, we (1) sample with replacement pairs from one list, (2) compute the correlation between the resampled ranking and unused ranking when using accuracy versus irt score, and (3) compute and store the irt correlation score minus the accuracy correlation score. We repeat this process 1000 times for each of the 10 trials in the original experiment and ag- gregate all the differences to build the bootstrap distribution. For each sample size we compute the empirical P-Value on each trial which we show in box and whisker plots (Figure 4.6). While individual P-values tend to be high (particularly for dev to dev sampling), there is still a consistent trend across sample sizes. 8The sample size must be less than half the size of the development data so that we can obtain two samples. 9For squad, ordering by mean exact match score. 10Since the maximum trial size was limited, we train one final model with the full data, see Table B.3 in Appendix B.4. 11One experiment for development sample to development sample and one for development sample to test set. 12In the first experiment, development set samples; in the second, a development set sample and the full test set. 63 Dev Sample to Dev Sample Dev Sample to Test 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0 1,000 2,000 3,000 4,000 5,000 0 1,000 2,000 3,000 4,000 5,000 Sample Size Sample Size Figure 4.6: P-values of the rank correlation difference for each sample size and trial in Figure 4.5. The inherent noise in dev set sampling makes inferring significance difficult (left); test set driven results (right) are more significant. 4.4.5 IRT Improves Cold Start Reliability irt can also guides the construction of tests. Just as irt practitioners prepare tests for humans, we too construct tests for machines. In educational testing, col- lecting responses from humans is expensive; likewise, although questions are cheap in search-based qa tasks (Nguyen et al., 2016; Kwiatkowski et al., 2019), annotating answers is expensive. Likewise, ?grading? machine dialog responses is expensive and irt helps (Sedoc and Ungar, 2020). To emulate this setting, we use computerized adaptive testing (Weiss and Kingsbury, 1984) to iteratively select squad items to ?annotate.? As in human test preparation, we use existing annotations to infer item pa- rameters and iteratively infer the ability of new subjects. This experiment splits m subjects into a training group (80%) and a testing group (20%). The training group represents subjects for which we have full item predictions and annotations; the test- ing group represents a new group of subjects that we need to rank. To efficiently rank, we should iteratively choose items to annotate that yield the most information about the ranking if all the data were annotated. This experiment compares how well several item selection strategies work. For each selection method, we (1) choose a sample size, (2), sample from the development set, (3) compute the ranking of subjects, and (4) compute Kendall?s rank correlation (Figure 4.7).13 Which item selection strategies should we compare? As a baseline, we use na?ve random sampling. Like prior work, we compare selecting items with the highest difficulty and the highest discriminability (Lalor et al., 2019) as well as the sum of the two.14 We propose that items should be selected according to their Fisher information content (Weiss, 1982) (p?ij) 2 Ii(? 2 j) = ? = ?i pij(1? pij) (4.2)pij(1 pij) 13We compute correlations with the complete development set on ten trials to build 95% confi- dence intervals. 14We train an irt-disc model to simplify sampling (e.g., avoiding a tradeoff between feasibility and discriminability). 64 P-Value 1.00 0.95 0.90 0.85 0.80 0.75 0.70 Sampling Method 0.65 High Information High Discrimination 0.60 High Disc + Diff High Difficulty 0.55 Random 0.50 10 20 30 100 200 300 1,000 2,000 10,000 Development Set Sample Size Figure 4.7: Suppose we need to cold start, collect annotations for a new subject: what order would most rapidly increase correlation to the test data? As we expect, the correlations eventually converge, but with little data, irt has better correlation than other methods. We suspect that the irt information underperforms early on when the subject ability estimate is unstable. as derived by Lord et al. (1968, p. 70). Intuitively, if we do not yet know the true skill ?j, we should pick items whose expected response we are most uncertain about. Our uncertainty (entropy) is max- imized when the likelihood of a correct response pij is the same as the likelihood of an incorrect response 1 ? pij, which corresponds to the maximal value of Ii(?j); it is also sensible this value increases as discriminability ?i increases. To infer the maximally informative items, we estimate the ability ?j of each subject using the currently selected items, use the ability to compute the infor- mation of each yet-to-be-annotated item for each subject, and then aggregate the informativeness ? Info(i) = Ii(?j) (4.3) j by item i summed over subjects j. This approach is similar to uncertainty sampling and reduces to it for the irt-base model (Lewis and Gale, 1994). We initially seed with the twenty-five most discriminative items (details in Appendix B.4). Like computerized adaptive testing (Moreno et al., 1984), Figure 4.7 shows that at lower sample sizes three of the irt sampling methods are better than random sampling?difficulty does worse. The other irt methods have comparable correla- tion. Thus, by using irt, dad can both improve rankings and guide annotation. 65 Correlation to Test Rank Discriminability: 3.24 Difficulty: 3.86 Feasibility: 0 Mean Exact Match: 0 Wikipedia Page: Computational Complexity Theory Question: In the determination of complexity classes, what are two examples of types of Turing machines? Official Answer: probabilistic Turing machines, non-deterministic Turing machines Context: Many types of Turing machines are used to define complexity classes, such as deterministic Turing machines, probabilistic Turing machines, non-deterministic Turing machines, quantum Turing machines, symmetric Turing machines and alter- nating Turing machines. They are all equally powerful in principle, but when resources (such as time or space) are bounded, some of these may be more powerful than others. Figure 4.8: This question is regarded as infeasible by the irt model. Upon further inspection, the answer omits five acceptable answers, but more importantly does not permit all combinations of Turing machines. 4.5 Qualitative Insights on Leaderboards dad also helps qualitative analysis of items and subjects. First, irt identifies overfitting and generalizes partitioning datasets by difficulty. Then we show that? like in educational testing?irt identifies good and bad items. 4.5.1 Guiding Analysis with IRT Several works curate easy and hard qa subsets based how many models answer correctly (Rondeau and Hazen, 2018) or heuristics (Sugawara et al., 2018). irt can create similar subsets using irt-feas, the best 1D model. Difficulty finds where subjects improve while discriminability and feasibility can surface items that may be invalid. For example, one low feasibility question (Figure 4.815) asks ?what are two examples of types of Turing machines?? which has two problems: (1) the answer omits five types and (2) span-based evaluation precludes selecting non-contiguous types. After excluding items with negative discriminability?they are likely erroneous? we sort items into bins. We break both difficulty and discriminability into four bins?taking the 25th, 50th, and 75th percentiles?creating eight total bins. Then we select representative squad subjects with their exact match (Figure 4.9). Let?s examine a feasible item with positive difficulty and discriminability like ?what re- form was attempted following the Nice treaty??16 In this case, the annotator?s span is too long?resulting in almost no correct answers and a low fuzzy match score.17 In contrast, one highly discriminative question succeeds because there are multiple plausible guesses to ?who did the Normans team up with in Anatolia??18 While 15Question ID 56e1b00ce3433e14004230a1 16A: ?there was an attempt to reform the constitutional law of the eu and make it more trans- parent.? (Appendix Figure B.2) 17For simplicity, we use squad?s dichotomous exact match; squad also has F1?a token-level fuzzy match metric. 18Example with statistics in Appendix Figure B.3. 66 Test Acc Difficulty Discriminability Description SA-Net on Albert 91 95 99 97 79 78 96 98 98 Dev Acc Bottom Model 1.0 Hallmark RoBERTa 87 94 98 94 63 65 90 96 98 Overfit BERT + Synthetic Self-Training 83 95 96 91 42 60 79 91 94 Top Model ARSG-BERT 75 98 99 97 85 89 94 98 98 Tree-LSTM + BiDAF + ELMo 58 90 77 42 19 46 52 67 62 Lowd-Lo wHighHigh Low w h h 0.2 Me ed-M Med -Lo -Higed H ig M Figure 4.9: We partition evaluation data by irt difficulty and discriminability with accuracy in each quartile. Most improvements in high-accuracy systems come from getting high-difficulty questions right. items with low discriminability (and thus prone to annotation errors) are difficult for all subjects except the overfit args- bert model. We include top-performing squad subjects, several notable subjects (systems), and a pair from the bottom of the leaderboard. both the Armenian state and Turkish forces are superficially plausible answers, only Turkish forces is correct; nonetheless, some models are fooled. Using irt to guide subject analysis is helpful; next, we test how efficient it is in identifying annotation error. 4.5.2 Identifying Annotation Error To test if irt can identify annotation error we inspect sixty squad develop- ment set items. We select ten items from each of these groups: the most negative discriminability, discriminability nearest to zero, the highest discriminability, the least difficult, most difficult, and irt model errors. For each, we annotate whether the item was correct, was ?correct? yet flawed in some way, or simply wrong (Fig- ure 4.10). Inter-annotator agreement between three authors on this three-way an- notation with Krippendorff?s ? (Krippendorff, 2004; Artstein and Poesio, 2008) is 0.344. Despite only modest agreement, just as in the development of education tests, negative discriminability is predictive of bad items. When discriminability is negative, then the probability of getting the answer right is higher when ability is lower, which is undesirable: Ken consistently loses to Burt on those items. This could identify bad items in evaluation sets for removal. 4.6 Related Work This chapter draws together two primary threads: we use irt to understand datasets, which has been applied to other nlp tasks, and apply it to improving leaderboards. Finally, we explore how the insights of irt can improve not just the analysis of test sets but to improve the construction of test sets. 67 Name Diff: High Diff: Low Disc: High Disc: Neg Disc: ?0 IRT Val 8 4 0 rrect wedo la Wron g orre ct lawe d g Wron orre ct d lawe Wron g orre ct wed ronga rrec t l W o lawe d Wron g rrecto lawe d ng C F C F C F C F C F C F Wro Explanation None Is Answerable Is Answerable + Misleading One Answer Wrong Answer Partially Correct Ambiguous Incomplete Answer Misleading Ambiguous + Missing Answer Bad Question Bad Question + Bad Answers No Answer High Probability + No Answer Low Probability Answer Set Incomplete Figure 4.10: We annotate squad items by discriminability, difficulty, and irt pre- diction errors. For example, one question with negative discriminability was classi- fied as ?Wrong? with the explanation that the annotated answer indicates it is not answerable, but the question actually is answerable. items with negative discrim- inability or where irt?s prediction is wrong have a much higher rate of annotation error (?Flawed? or ?Wrong?). Using similar methodology, errors in datasets could be more rapidly identified. IRT in NLP irt is gaining traction in machine learning research (Mart?nez- Plumed et al., 2016, 2019) where automated metrics can be misleading (Sedoc et al., 2019): machine translation (Hopkins and May, 2013) and chatbot evaluation (Sedoc and Ungar, 2020). Concurrent with our work, Vania et al. (2021) compare nlp test sets with irt. Closest to our work in nlp is Otani et al. (2016), who rank machine translation subjects and compute correlations with gold scores. Similarly, Mart?nez-Plumed and Hern?ndez-Orallo (2020) use irt on non-language ai video game benchmarks. Just as we use irt to identify difficult or easy items, Lalor et al. (2016) create challenge sets for textual entailment. We test irt as a way to guide annotation, but it can also be used to train nlp models; for example, deep models learn ?easy? examples faster (Lalor et al., 2018) and maintain test accuracy when training data is down-sampled with irt (Lalor et al., 2019). Improving Leaderboards The rise of leaderboards in nlp has encouraged criti- cal thought into improving them (Linzen, 2020), improving evaluation more broadly (Eger et al., 2020), and thoughtful consideration of their influence on the direction of re- search (Sculley et al., 2018; Dotan and Milli, 2020). dad aims to make leaderboard yardsticks (Hernandez-Orallo, 2020) more reliable, interpretable, and part of curat- ing the benchmark itself. In line with our reliability goal, just as statistical tests should appear in publications (Dror et al., 2018; Dodge et al., 2019), they should be ?freebies? for leaderboard participants (Ethayarajh and Jurafsky, 2020). Alter- natively, Hou et al. (2019) posit that leaderboards be automatically extracted from publications. Aggregating multi-task (Wang et al., 2019b,a; Fisch et al., 2019) and multi-metric benchmarks (Ma et al., 2021) is an open question which?although we do not address?is one use for irt. This chapter implicitly argues that leaderboards should be continually up- 68 Count dated. As a (static) leaderboard ages, the task(s) become overfit (Recht et al., 2019) which?although mitigable (Blum and Hardt, 2015; Anderson-Cook et al., 2019)?is best solved by continually collecting new data (Kiela et al., 2021). Ide- ally, these new data should challenge models through adversarial collection (Wallace et al., 2019b; Nie et al., 2020) and other methods (Gardner et al., 2020a). However, if making a too easy leaderboard harder is not possible, we should recognize that the leaderboard has outlived its helpfulness and should be retired (Voorhees, 2000b). Part of this chapter centers on alternate task efficacy rankings, but this na?vely assumes that task efficacy is the sole use case of leaderboards. Indeed, focusing solely these factors can mislead the public (Paullada et al., 2020), may not reflect hu- man language capabilities (Schlangen, 2020), or even be ecologically valid (de Vries et al., 2020). Leaderboards are also well positioned to provide incentive struc- tures for participants that prioritize fairness (Bender and Friedman, 2018) and effi- ciency (Strubell et al., 2019; Schwartz et al., 2020; Min et al., 2021) or incorporate testing of specific capabilities (Ribeiro et al., 2020; Dunietz et al., 2020). To enable these more nuanced analyses, leaderboards should accept runnable models rather than static predictions (Ma et al., 2021). Active Learning Beyond irt, the analysis of training dynamics and active learn- ing (Settles, 2009) is helpful for actively sampling specific items or identifying low- quality items (Brodley and Friedl, 1999). For example, Swayamdipta et al. (2020) and Pleiss et al. (2020) propose alternative training dynamics-based methods for identifying difficult items as well annotation errors. Even closer to goals, Rahman et al. (2020) use active learning to build a test collection. Explicitly measuring how effectively examples separate the best subject from the rest allows test set curators to ?focus on the bubble? (Boyd-Graber and B?rschinger, 2020), prioritizing examples most likely to reveal interesting distinctions between submitted systems. Alternate Formulations irt is an example of convergent evolution of models that predict subject action given an item. Ideal point models (Poole and Rosenthal, 2017) consider how a legislator (subject) will vote on a bill (item) and use a similar mathematical formulation. The venerable elo model (Glickman and Jones, 1999) and modern extensions (Herbrich et al., 2007) predict whether a player (subject) will defeat an opponent (item) with, again, a similar mathematical model. Cer- tain irt models can also be formulated as nonlinear mixed models (Rijmen et al., 2003), where the item parameters are fixed effects and the latent subject parame- ters are random effects. This allows for comparisons between irt models and other mixed-effects models under a consistent framework. irt-base and irt-disc can be formulated as nonlinear mixed models, and irt-feas can be formulated as a dis- crete mixture model over items. As we discuss further in the next section, dad?s application of irt can be further improved by adopting interpretable extensions of these models. 69 4.7 Conclusion This chapter advocates incorporating decades of research in crafting education tests to improve how we evaluate the capabilities of nlpmodels. We propose and validate an alternate irt ranking method for leaderboard evaluations, show it can guide annotation, detect annotation error, and naturally partition evaluation data. Just as educators moved from classical testing to irt, the nlp community should consider future evaluations with irt. 4.8 Limitations Although there is much to gain through irt evaluation, there are limitations which make it hard to implement. First, it requires access to item-level responses for all examples for all subjects which are often only available to organizers. Second, Urbano (2016) notes that sampling mutually exclusive subsets has drawbacks? samples are not entirely independent. Lastly, our work is a proof of concept using squad 2.0 as a test bed, and our results may not generalize. 4.9 Future Work We see a few directions for future work which we outline here and describe further in Chapter 7. First, this work validates irt?s usefulness for leaderboards, but to fully realize its potential irt needs to be implemented in an existing or new benchmark. Second, our irt models do not incorporate the item content (e.g., ques- tion text) when predicting responses, but in principle could; Bayesian models with metadata (Card et al., 2018) and ideal point models from political science (Poole and Rosenthal, 1985) that incorporate bills and speeches do exactly this (Gerrish and Blei, 2011; Nguyen et al., 2015; Kraft et al., 2016). Analogously, irt for leader- boards can and should also incorporate text from passages, questions, and answers to better model what makes questions difficult. Such a model can also predict which characteristics would create discriminating or difficult items. Third, statistical test- ing is strongly recommended in nlp (Dror et al., 2018; Dodge et al., 2019), and we should investigate how irt based statistical tests differ (?B.5). Lastly, multidimen- sional irt models to evaluate multiple skills could aid multi-task leaderboards like mrqa (Fisch et al., 2019) and Dynaboard (Ma et al., 2021). While this chapter leaves the dataset and format unchanged, the broader mes- sage is that we should understand how question difficulty and discriminability affect qa evaluations. In the next two chapters, we introduce a Manchester paradigm qa format that?by construction?has more discriminative power (Chapter 5) and use human-machine cooperative writing to create more challenging questions (Chap- ter 6). 70 Chapter 5: The Case for Incremental Question Answering Evaluation You?ll rarely hear it acknowledged on Jeopardy! or Who Wants to Be a Millionaire, but many of the elite winners on these shows got their trivia training in high school and college quiz bowl. . .When I took the written test at my Jeopardy! audition. . . I felt like a runner who?d been training in high altitude Mexico City, just to get his lungs in such superpowered shape that events at lower elevations seemed like a piece of cake. Brainiac: Adventures in the Curious, Competitive, Compulsive World of Trivia Buffs Ken Jennings (2006, p. 81) In contrast with asking questions to learn (Chapter 3), scholastic trivia compe- titions use questions to test knowledge and intelligence.1 In this way and in line with the Manchester paradigm, trivia games are a variant of the Turing test since we ex- pect that answering a set of questions as well as a human is a minimum bar towards demonstrating human-like intelligence (Yampolskiy, 2013). In the previous chap- ter, we considered a task (squad) where participants are scored dichotomously:2 they are either correct or wrong. A byproduct of this choice is that when subjects are both correct or both wrong, we learn substantially less about their latent skill (?4.4.5). Unfortunately, as Ray Hamel3 notes ?the problem of gauging difficulty, of making a question easy enough to be accessible, but tough enough so that listeners still have to scratch their heads? (Jennings, 2006, p. 228) is challenging. This chapter makes the case that the format used by the trivia game Quizbowl (qb) better distinguishes between players (machine or human) of disparate skill while incorporating dynamic elements that also make it fertile space for sequential decision-making in qa.4 Beyond the format of qb, we demonstrate how collabora- 1This chapter is based on Rodriguez et al. (2019). 2squad?s fuzzy matching metrics are arguably a concession to annotation noise, rather than genuine partial credit. 3Ray Hamel is a member of the Trivia Bowl Hall of Fame, estimates he has written over 40,000 trivia questions, and has written over 750 (trivia) crossword puzzles for Games, The New York Times, The Los Angeles Times, Newsday, Dell Champion Crosswords, and CrosSynergy (Hamel, 2000). 4This work substantially expands the number questions compared to Boyd-Graber et al. (2012) and player-question records compared to He et al. (2016b). We also make the setting significantly harder by not restricting questions to only the most frequently asked about answer (1K versus 24K) as in Iyyer et al. (2015). This work also incorporates the exhibition match from Boyd-Graber et al. 71 At its premiere, the librettist of this opera portrayed a character who asks for a glass of wine with his dying wish. That character in this opera is instructed to ring some bells to summon his love. At its beginning, a man who claims to have killed a serpent has a padlock put on his mouth because of his lying. The plot of this opera concerns a series of tests that Tamino must undergo to rescue Tamina from Sorastro. For 10 points, name this Wolfgang Mozart opera titled for an enchanted woodwind instrument. Answer: The Magic Flute Figure 5.1: qb is a trivia game where questions begin with clues that are initially difficult, but become progressively easier until a giveaway at the end of the question. Players answer as soon as they know the answer so as a result the earlier they answer the more knowledgeable they are. For example, answering after the first sentence indicates the player recognizes the librettist (Emanual Schikaneder) and knows that they played Papageno in The Magic Flute (die Zauberfl?te). Answering at the end of the question only requires surface knowledge of Mozart?s opera works. tion with a thriving and enthusiastic community provides natural ways to engage with and educate the public about nlp research; Chapter 6 takes this a step farther by combining the efforts of humans and machines to author better qa datasets. 5.1 An Introduction to Quizbowl Answering questions is an important skill for both humans and computers. Exams form the foundation of educational systems and?for many societies?of the civil system (Fukuyama, 1995). Computers answering questions in the Turing test is the standard definition of artificial intelligence (Turing, 1950). But another more trivial form of question answering is more pervasive in popular culture. Trivia games are pervasive and popular: from quizzing in India (Roy, 2016) to ?What? Where? When?? in Russia (Korin, 2002) to ?Who wants to be a Mil- lionaire? (Clarke et al., 2001; Lam et al., 2003), trivia encourages people to acquire, recall, and reason over facts. For computers, Yampolskiy (2013) argues that these skills are ai-complete: solve question answering and you have solved ai generally. A central idea in this chapter is that the intense research in question answering would benefit in adopting the innovations and lessons learned from human trivia competition, as embodied in a trivia format called Quizbowl (qb). In qb, questions are posed incrementally?word by word?and players must interrupt the question when they know the answer (Figure 5.1). ?Speed is therefore crucial? since it rewards players who can answer with less information than their opponents (Jennings, 2006). This is not just a gimmick to separate it from other question answering formats: players must simultaneously think about what is the (2017). Finally, we create a new evaluation procedure that better estimates how models fare in the real-world versus human opponents. 72 most likely answer and after every word decide whether it is better to answer or wait for more information. To succeed, players and machines alike must answer ques- tions, maintain accurate estimates of their confidence, and factor their opponents? abilities. The combination of these skills makes qb challenging for machine learning algorithms. A dedicated and skilled community forged qb over decades (?5.2), creating a diverse and large dataset (?5.3). We refer to this dataset as the qanta dataset because (in our opinion) Question Answering is Not a Trivial Activity.5 Playing qb requires deciding what to answer (?5.5) and when to answer (?5.6). Our final contribution is a framework that combines independent systems for each of these sub-tasks. Despite its simplicity, our implementation of this framework is competitive with the best players. Section 5.8 showcases qb as a platform for simultaneously advancing research and educating the public about the limits of machine learning through live human?computer competitions. Finally, we discuss ongoing and future research using trivia questions to build machines that are as capable of reasoning and connecting facts as humans. 5.2 Why Quizbowl? When discussing machine learning and trivia, the elephant in the room is always ibm?s tour-de-force match (Ferrucci et al., 2010) against Ken Jennings and Brad Rutter on Jeopardy! Rather than ignore the obvious comparisons, we take this on directly and use the well-known Jeopardy! context?which we gratefully acknowledge as making our own work possible?as a point of comparison, as qb is a better differentiator of skill between participants, be they human or machine (?5.2.1 and ?5.2.2). Throughout, we discuss the history of trivia to show that the hard- won lessons humans learned about question answering transfer to machine question answering. The qa format categorization of Gardner et al. (2020b) names three tasks where framing the problem as qa is useful : (1) filling human information needs, (2) qa as annotation or probe, and (3) as a transfer mechanism.6 Like searchqa (Dunn et al., 2017, Jeopardy!), qb does not explicitly probe specific linguistic phenomena; it uses language to ask what humans know. In contrast to questions posed to search engines or digital assistants (Nguyen et al., 2016; Kwiatkowski et al., 2019), qb is less ambiguous: question writers ensure that the descriptions uniquely identify one and only one answer, a non-trivial goal (Voorhees and Tice, 2000). Thus, although the goals in qb are superficially similar to open domain information-seeking, they are distinct and align with the Manchester paradigm (?2.1.2). The qb format is compelling and consistent because of its evolution (Fig- ure 5.2) over its fifty-year history.7 Many of the challenges the nlp community 5Dataset available at http://datasets.qanta.org. 6The Cranfield and Manchester paradigms (?2.1.1 and ?2.1.2) are related to (1) and (2). 7After returning from World War II and inspired by uso morale-building activities, Canadian Don Reid sketched out the format with the first host Allen Ludden. After a radio premiere 73 1962 University 1979 TrivialChallenge Pursuit 2009 NAQT takes Over ACUI 1958 Van Doren 1977 College Bowl NAQT, PACEScandal In ACUI 1996 Founded 1953 College Bowl Pub QuizzesOn Radio 1965 Begin 1991 ACF Founded Birth of Modern Trivia Popularization Professionalization Figure 5.2: Trivia has gone from a laid-back pastime to an organized, semi- professional competition format. The qb framework, in particular, which arose from College Bowl (us) and University Challenge (uk) emphasizes fairness and the ability to discover the better question answerer. As organizations such as the Aca- demic Competition Federation and National Academic Quiz Tournaments emerged, the format has focused on academic, well-run tournaments. faces in collecting good question answering datasets at scale (Hermann et al., 2015) were first encountered by trivia aficionados. For example, avoiding predictive yet useless patterns in data (Jia and Liang, 2017; Kaushik et al., 2020); players do not like re-used clues making questions trivially easy. Similarly, players prize the novelty of questions which is often created by tak[ing] two elements you wouldn?t have thought had a commonality and put[ing] them together and then you go, ?Oh, yea, that?s pretty cool, there?s a connection there that nobody?s seen.? (Ray Hamel (Jennings, 2006)) Creating multi-hop reasoning questions is a shared goal with HotPotQA (Yang et al., 2018b), but it has been challenging to write questions that actually require it (Min et al., 2019). We distill these lessons, describe the craft of question writing that makes qb a compelling question answering task (?5.2.3), and enumerate some nlp challenges required to truly solve qb (?5.2.4). We conclude by framing qb as a hybrid task between question answering and sequential decision-making (?5.2.5). 5.2.1 What is a Buzzer Race? The scapegoat for every Jeopardy! loser and the foundation of every Jeopardy! winner is the buzzer (Harris, 2006). A buzzer is a small handheld device that players in 1953, College Bowl moved to television in 1959 and became the first television show to win a Peabody (Baber, 2015). The format established many careers: the future president of the National Broadcasting Corporation (nbc), Grant Tinker, served as the game?s first scorekeeper (the newly designed game and its scoring was so confusing that Allen Ludden often had to ad lib to let Tinker catch up). The format was intriguing enough that Granada studios copied it?initially without permission?into what became the uk cultural touchstone University Challenge (Taylor et al., 2012), establishing the career of Bamber Gascoigne. 74 press to signal that they can correctly respond to a clue. The fundamental difference between Jeopardy! and qb?and what makes qb more suitable for research?is how clues are revealed and how players use the buzzer. Jeopardy! is a television show and uses the buzzer to introduce uncertainty, randomness, and thus excitement for the viewer at home.8 In Jeopardy!, players can only use the buzzer when the moderator has finished reading the question.9 If players use the buzzer before the question is finished, they are locked out and prevented from answering the question for a fraction of a second (an eternity in the fast-paced game of Jeopardy!). This advantaged Watson in its match against two opponents with feeble human thumbs and reflexes, as Jeopardy! uses the buzzer to determine who among those who know the answer has the fastest reflexes.10 While Watson gets an electronic signal when it was allowed to buzz, the two humans watch for a light next to the Jeopardy! game board to know when to buzz. Thus, Watson?an electronic buzzing machine?snags the first choice of questions, while the two humans fight over the scraps. In Jeopardy! reflexes are almost as important as knowledge. Next we show how the structure of qb questions and its use of a buzzer rewards depth of knowledge rather than reflexes. 5.2.2 Pyramidality and Buzzers In contrast, qb is a game honed by trivia enthusiasts which uses buzzers as a tool to determine who knows the most about a subject. This is possible because the questions are interruptable. Unlike Jeopardy!, players can interrupt the questions when they know the answer (recall questions are multi-sentence in qb). This would make for bad television (people like to play along at home and cannot when they cannot hear the whole question), but makes for a better trivia game that also requires decision-making under uncertainty. This alone is insufficient however; if an easy clue appears early in the ques- tion then knowing hard clues later in the question is irrelevant. Questions that can be answered with only a fraction of their input are a bad foundation for re- search (Sugawara et al., 2018; Feng et al., 2019). qb addresses this problem by structuring questions pyramidally. In pyramidal questions, clues are incorporated so that harder, more obscure information comes first in the question, and easier, more obvious information comes at the end of the question. Thus, when a player answers before their opponents, they are more knowledgeable than their opponents. This also makes qb an attractive machine learning research domain. The 8As Jennings (2006) notes about qb, ?certainly it?s hard to play along at home with a game where the questions can be interrupted after the moderator has only read a few syllables.? 9In Jeopardy! terminology is reversed so that a moderator reads clues termed answers to which players must supply the correct question. To avoid confusion, we follow standard terminology. 10In a Ken Jennings interview with npr (Malone, 2019), the host Kenny Malone summarized it well as ?To some degree, Jeopardy! is kind of a video game, and a crappy video game where it?s, like, light goes on, press button?that?s it.? Ken Jennings agreed, but characterized it as ?beautiful art and not a really crappy video game?. 75 giveaways are often easy for computers too: they are prominent on Wikipedia pages and have appeared in many questions. Thus, it is easy for computers to answer most questions at some point : qb is not an impossibly difficult problem. The challenge then becomes to answer the questions earlier, using more obscure information and higher-order reasoning. Humans who play qb have the same yearning; they can answer most of the questions, but they want to deepen their knowledge to buzz in just a little earlier. They keep practicing, playing questions and going to tournaments to slowly build skill and knowledge. qb is engineered for this to be a rewarding experience. The same striving can motivate researchers: it does not take much to buzz in a word earlier. As small incremental improvements accumulate, we can have more robust, comprehensive question answering systems. And because qb has a consistent evaluation framework, it is easy to see whose hard work has paid off. Thus, the form of qb questions?the product of decades of refining how to measure the processing and retrieval of information of humans?can also compare machines? question answering ability. We next describe the cultural norms of ques- tion writing in the qb community that contribute to making it a challenging task for humans and machines alike. 5.2.3 The Craft of Question Writing The goal of qb is to reward ?real? knowledge. This goal is the product of a long history that has resulted in community norms that have evolved the competition into a thriving, carefully designed trivia ecosystem. By adopting these conventions, ma- chine learning can benefit from the best practices for question answering evaluation without repeating the same mistakes. Every year, question writers in the community focus on creating high qual- ity questions that are novel and pyramidal. Experts write thousands of questions each year.11 To maintain the quality and integrity of competition, the commu- nity enforces rules consistent with machine learning?s question for generalization as described by Boyd-Graber and B?rschinger (2020): avoiding ambiguity, ensur- ing correctness, eschewing previously used clues, and allowing for fair comparisons between teams (Lujan and Teitler, 2003; Vinokurov; Maddipoti) of 10,000 middle school students, 40,000 high school students, and 3,200 college students (National Academic Quiz Tournaments, 2020). At the same time, in preparation for tourna- ments students study questions from previous years. These dueling groups?players and writers?create a factual arms race that is the foundation for the quality of qb questions. Aligning annotators? motiva- tions (von Ahn, 2006)?such as playing a game?with the goals of the data collec- tion improves the quality and quantity of data. A similar arms race between dataset 11Regional competition questions are written by participants; championship competition ques- tions are written by professionals hired by either the Academic Competition Federation (acf), National Academic Quiz Tournaments (naqt), or the Partnership for Academic Competition Ex- cellence (pace). While the exact organizational structure varies, initial draft questions are vetted and edited by domain experts. 76 exploiters (attackers) and those seeking to make datasets more robust (defenders) exists in other machine learning domains like computer vision (Carlini and Wagner, 2017; Hendrycks et al., 2021) and Build-It, Break-It, (Fix-It) style tasks (Ettinger et al., 2017; Thorne et al., 2019; Dinan et al., 2019a; Nie et al., 2020). In qb, answers are uniquely identifiable named entities such as?but not lim- ited to?people, places, events, and literary works. These answers are ?typified by a noun phrase? as in Kupiec (1993) and later in the trec qa track (Voorhees, 2003a). Similar answer types are also used by other factoid question answering datasets such as SimpleQuestions (Bordes et al., 2015), SearchQA (Dunn et al., 2017), Triv- iaQA (Joshi et al., 2017), and NaturalQuestions? short answers (Kwiatkowski et al., 2019). In its full generality, qb is an Open Domain qa task (Chen et al., 2017; Chen and Yih, 2020). However, since the vast majority of answers correspond to one of the six million entities in Wikipedia (?5.3.4),12 we approximate the open-domain setting by defining this as our source of answers (Section 5.9.1 reframes this in read- ing comprehension?s span selection format). Like the ontology of ImageNet (Deng et al., 2009), no formalism is perfect, but it enables automatic answer evaluation and linking to a knowledge base. In qb though, the challenge is not in framing an answer, it is in answering at the earliest possible moment. The pyramidal construction of questions?combined with incrementality? makes qb a more fair and granular comparison. For example, the first sentence of Figure 5.1?also known as the lead in?while obscure, uniquely identifies a single opera. Questions that begin misleadingly are scorned and derided in online discus- sions as ?neg bait?;13 Thus, writers ensure that all clues are uniquely identifying even at the start. The entirety of questions are carefully crafted, not just the lead-in. Middle clues reward knowledge but cannot be too easy: frequent clues in questions or clues prominent in the subject?s Wikipedia page are considered ?stock? and should be reserved for the end. These same insights have been embraced by machine learning in the guise of adversarial methods (Jia and Liang, 2017) to eschewing superficial pattern matching. In contrast, the final giveaway clue should be direct and well- known enough; someone with even a passing knowledge of The Magic Flute would be able to answer. This is the product of a complicated and nuanced social dynamic in the qb community. Top teams and novice teams often play on the same questions; questions are?in part?meant to teach (Gall, 1970) so are best when they are fun and fair for all. The pyramidal structure ensures that top teams use their deep knowledge and quick thinking to buzz on the very first clues, but novice teams are entertained 12A minority of answers cannot be mapped. Some answers do not have a page because Wikipedia is incomplete (e.g., not all book characters have Wikipedia pages). Other entities are excluded by Wikipedia editorial decisions: they lack notability, are combined with other entities(e.g., Gargantua and Pantagruel and Romulus and Remus). Other abstract answers will likely never have Wikipedia pages (women with one leg, ways Sean Bean has died in films). 13?Negging? refers to interrupting a question with a wrong answer; while wrong answers do happen, a response with a valid chain of reasoning should be accepted. Only poorly written questions admit multiple viable answers. 77 and learning until they get to an accessible clue. Just about everyone answers all questions (it is considered a failure of the question writer if the question ?goes dead? without an answer). qb is not just used to test knowledge; it also helps discover new information and as a result diversifies questions (?oh, I did not know the connection between the band the Monkees and correction fluid!?).14 While most players will not recognize the first clue (otherwise the question would not be pyramidal), it should be interesting and connect to things the player would care about. For example, in our Magic Flute question, we learn that the librettist appeared in the premiere, a neat bit of trivia that we can tuck away once we learn the answer. These norms have established qb questions as a framework to both test and educate human players. Our thesis is that these same properties can also train and evaluate machine question answering systems. Next, we highlight the nlp and ml challenges in qb. 5.2.4 Quizbowl for Natural Language Processing Research We return to Figure 5.1, which exemplifies nlp challenges common to many qb questions. We already discussed (pyramidality): each sentence uniquely iden- tifies the answer but each is easier than the last. The most knowledgable answers earlier and ?wins? the question. But what makes the question difficult apart from obscurity (Boyce-Jacino and DeDeo, 2018)? Answering questions early is signifi- cantly easier if machines can resolve coreference (Ng, 2010) and entity linking (Shen et al., 2015). First, the computer should recognize ?the librettist? as Schikaneder, whose name never appears in the question. This special case of entity linking to knowledge bases is sometimes called Wikification (Cheng and Roth, 2013; Roth et al., 2014). The computer must recognize that ?the librettist? refers to a specific person (men- tion detection), recognize that it is relevant to the question, and then connect it a knowledge base (entity linking). In addition to linking to entities outside the question, another challenge is connecting coreferences within a question. The interplay between coreference and question answering is well known (Stuckardt, 2003), but Guha et al. (2015) argue that qb coreference is particularly challenging: referring expressions are longer and oblique, world knowledge is needed, and entities are named after other referring ex- pressions. Take the character Tamino (Figure 5.1): while he is eventually mentioned by name, it is not until after he has been referred to obliquely (?a man who claims to have killed a serpent?). The character Papageno (portrayed by Schikaneder) is even worse; while referred to twice (?character who asks for a glass of wine?, ?That character?), Papageno is never mentioned by name. To fully solve the question, a model may have to solve a difficult coreference problem and link the reference to Papageno and Schikaneder. 14Bette Nesmith Graham, the mother of Monkees band member Michael Nesmith, invented correction fluid in 1956. 78 These inferences, like in the clue about ?the librettist?, are often called higher- order reasoning since they require creating and combining inference rules to derive conclusions about multiple pieces of information (Lin and Pantel, 2001). Questions that require only a single lookup in a knowledge base or a single ir query are un- interesting for both humans and computers; thus, they are shunned for qb lead-in clues. Indeed, the first sentences in qb questions are the most difficult clues for hu- mans and computers because they often incorporate surprising, quirky relationships that require skill and reasoning to recognize and disentangle. Interest in multi-hop question answering led to the creation WikiHop through templates (Welbl et al., 2018) and HotPotQA through crowdsourcing (Yang et al., 2018b). In contrast to these artificially or crowdsourced created datasets, qb questions focus on links that experts view as relevant and important. Finally, even the final clue (called a ?giveaway? because it?s so easy for humans) could pose issues for a computer. Connecting ?enchanted woodwind instrument? to The Magic Flute requires solving wordplay. While not all questions have all of these features, these features are typical of qb questions and showcase their richness. Crowdsourced datasets like OpenBooksQA (Mihaylov et al., 2018) and Com- monSenseQA (Talmor et al., 2019) have artifacts that algorithms can game (Geva et al., 2019): they find the right answer for silly reasons. For example, answering correctly with just a handful of words from a squad question (Feng et al., 2018), none of a bAbI question (Kaushik and Lipton, 2018), or the image in a question about an image (Goyal et al., 2017). Although the qanta dataset and other ?nat- urally occurring? data likely do contain machine exploitable patterns, they do not face the same quality issues since the author?s motivation is intrinsic: to write an entertaining and educational question as in qb. 5.2.5 Quizbowl for Machine Learning Research While answering questions showcases the nlp challenges, deciding when to answer showcases the ml challenges related to decision theory (Raiffa, 1968). As in games like Poker (Brown and Sandholm, 2019), qb players have incomplete infor- mation: they do not know when their opponent will answer, do not know what clues will be revealed next, or if they will know the next clues. In our buzzer model, the qa model output is but one piece of information used to make the decision?under uncertainty?of when to buzz in. Since a decision must be made at every time step (word), we call this an incremental classification task. We formalize the incremental classification task as a Markov Decision Pro- cess (Zubek and Dietterich, 2002, mdp). The actions in this mdp correspond to what a player can do in a real game: click the buzzer and provide their current best answer or wait (one more word) for more information. The non-terminal states in the state space are parameterized by the text of the question revealed up to the current time step, the player?s current best guess, and which player (if any) has already buzzed incorrectly. Rewards are only given at terminal states and transi- tions to those states are determined by which player correctly answered first. Ad- ditionally, we treat the opponent as a component of the environment as opposed 79 to another agent in the game.15 This task?the buzzing task?has connections to work in model confidence calibration offline (Yu et al., 2011; Nguyen and O?Connor, 2015) as well as online (Kuleshov and Ermon, 2017), cost-sensitive learning (Elkan, 2001), acquisition of features with a budget (Lizotte et al., 2003), and incremental classification (Melville et al., 2005). For humans, effective qb play involves maintaining a correctness estimate of their best answer, weighing the cost and benefits of answering now versus waiting, and making buzzing decisions from this information. Naively, one might assume that model calibration is as simple as examining the probability output by the (neural) qa system, but neural models are often especially poorly calibrated (Guo et al., 2017) and calibrations often fail to generalize to out of domain test data (Kamath et al., 2020). Since qb training data spans many years, models must also contend with domain shift (Ovadia et al., 2019). Model calibration is naturally related to deciding when to buzz?also known as answer triggering in qa and information retrieval (Voorhees, 2001; Yang et al., 2015). Unlike standard answer triggering though, in qb the expected costs and ben- efits are continually changing. Specifically, there are costs for obtaining new in- formation (seeing more words) and costs for misclassifications (guessing incorrectly or waiting too long). This parallels the setting where doctors iteratively conduct medical tests until they are confident in a patient?s diagnosis (Zubek and Dietterich, 2002; Chai et al., 2004). Although this can be framed as reinforcement learning, we instead frame buzzing in Section 5.6 as incremental classification as in Trapeznikov and Saligrama (2013). In this framing, a binary classifier at each time step determines when to stop obtaining new information and render the decision of the underlying (qa) model. As Trapeznikov and Saligrama (2013) note, evaluation in this scenario is conceptually simple: compare the costs incurred to benefits gained. Evaluation We evaluate the performance of our systems through a combination of standalone comparisons (Section 5.7.1) and simulated qb matches (Section 5.7.3). For standalone evaluation we incrementally feed systems new words and record their responses. We then calculate accuracy for each position in the question (e.g., after the first sentence, halfway through the question, and at the end). While standalone evaluations are useful for developing systems, the best way to compare systems and humans is with evaluations that mimic qb tournaments. A recurring theme is our mutually beneficial collaboration with the qb commu- nity: host outreach exhibitions (Section 5.8), annotate data, play with and against our systems (Section 5.10.2), and collect the qanta dataset. This community cre- ated this rigorous format for question answering over decades and continues to help understand and measure the question answering abilities of machines. 15This is not precisely true in our live exhibition matches; although we treat the opponent as part of the environment, our human opponents do not and usually adapt to how our system plays. For instance, it initially had difficulty with pop culture questions. 80 5.3 The QANTA Dataset This section describes the qanta dataset from the qb community (?5.3.1). The over 100,000 human-authored, English questions from qb trivia tournaments (?5.3.2) allows systems to learn what to answer. More uniquely, 3.9 million filtered records of humans playing qb online (?5.3.3) allows systems learn when to ?buzz in? against opponents (?5.4). 5.3.1 Sources of Quizbowl Questions The qb community maintains and curates several public databases of questions spanning 1997 to today.16 On average, 10,000 questions are written every year. Our dataset has 119,247 questions with over 650 thousand sentences and 11.4 million tokens. To help players practice and to build a dataset showing how humans play, we built the first website for playing qb online (Figure 5.3a). After initial popularity, we shut down the site; however, enterprising members of the qb community resurrected and improved the application. 89,477 players used the successor (Figure 5.3b) and have practiced 5.1 million times on 131,075 unique questions. A filtered17 subset of 3.9 million player records forms the second component of our dataset, which we call gameplay data. 5.3.2 QANTA Questions Table 5.1 compares qa datasets written by humans. Because often each qb sentence has enough information for players to answer, each qanta instance can be broken into four to six pseudo sentence-answer pairs. Although our dataset does not have the most questions, it is significantly larger in the number of sentences and tokens. In addition to qanta having more sentences, questions are longer (Figure 5.4), especially compared to crowd-sourced datasets. As a side effect of both being longer and not crowdsourced, qb sentences are syntactically complex and topically diverse (Figure 5.5). Topical Diversity of Questions Creating diverse datasets is a shared goal be- tween researchers developing nlp resources and organizers of qb tournaments. qb questions are syntactically diverse with dense coreference (Guha et al., 2015) and cover a wide range of topics. Diversity takes the form of questions that reflect the topical, temporal, and geographical breadth of a classical liberal education. For ex- ample, the Academic Competition Federation mandates that literature cover Amer- ican, British, European, and world literature (Vinokurov et al., 2014). Moreover, 16Questions in were obtained (with permission) from http://quizdb.org and http:// protobowl.com. 17We include only a player?s first play on a question and exclude players with less than twenty questions. 81 Answering questions as: Jordan Question from ACF Nationals 2007 (Berkeley) Correct! Don't show questions from this packet again People who answered before you did: Kevin Don't show questions from this tournament again Parag Irene People who answered after you did: Text Reveal Speed Cecilia Jim One is Monte Carlo if at least half of the possible results for all x in a language it says "yes" and "no" otherwise. One is called unambiguous if for any x there is at most one Incorrect Answers: accepting computation. One is called oblivious if the position of the cursor at the t-th step Turntable depends only on the t and the length of the input. One is non-deterministic if its sets of Toaster next states may contain more than one element. For ten points, identify this model of Computer computation named for Poland Algorithm Answer (or press space) (a) Our 2012 interface was the first way to play qb online. (b) The qb interface for collecting most of our gameplay records. It improved over our own through features like real-time competitive play and chatrooms. Figure 5.3: Our interface and a popular modern interface for playing qb online. Both interfaces reveal questions word-by-word until a player interrupts the system and makes a guess. authors must ?vary questions across time periods??with no more than one post 1990 literature?and questions must ?span a variety of answers such as authors, novels, poems, criticism, essays, etc.? There are similarly detailed proscriptions for the rest of the distribution. Figure 5.5 shows the category and sub-category distribution over areas such as history, literature, science, and fine arts. Taken together, qb is a topically diverse dataset across broad categories and finer-grained sub-categories. This diversity con- trasts with a sample of 150 questions from NaturalQuestions (Kwiatkowski et al., 2019)18 which indicates that questions are predominantly about Pop Culture (40%), History (19%), and Science (15%); see Appendix C.2 for complete results. This em- phasizes that to do well, players and systems need to have both breadth and depth of knowledge. 18The authors annotated 150 questions from the development set using the same categories as qb. 82 Dataset QA Pairs Tokens SimpleQuestions (Bordes et al., 2015) 100K .614M triviaqa (Joshi et al., 2017) 95K 1.21M squad 1.0 (Rajpurkar et al., 2016) 100K .988M searchqa (Jeopardy!) (Dunn et al., 2017) 216K 4.08M qanta 2012 (Boyd-Graber et al., 2012) 47,610 / 7,949 1,073,085 qanta 2014 (Iyyer et al., 2014) 163,667 / 30,658 4,009,059 qanta 2018 (This Work) 650K / 120K 11.4M Table 5.1: The qanta dataset is larger than most question answering datasets in qa pairs (120K). However, for most qb instances each sentence in a question can be considered a qa pair so the true size of the dataset is closer to 650K QA pairs. In Section 5.5 using sentence level qa pairs for training greatly improves model accuracy. The qanta dataset has more tokens than all other qa datasets. Statistics for qanta 2012 and 2013 only include publicly available data. Answer Entity Type Diversity qb questions are also diverse in the kinds of en- tities that appear as answers (25K entities in the training data). A dataset which is topically diverse, but only asks about people is not ideal. Using the Wikidata knowl- edge graph we obtain the type of each answer and plot frequencies in Figure 5.6. Most questions ask about people (human), but with a broad diversity among other types. These two breakdowns show that qb is topically and answer-wise diverse. To qb aficionados this is unsurprising; the primary educational goal of qb is to encourage students to improve their mastery over wide ranges of knowledge. We now turn to details about the gameplay dataset. 5.3.3 Gameplay Records: Recording Humans Playing Quizbowl Online Like the 2002 trec qa track (Voorhees, 2004), squad 2.0 (Rajpurkar et al., 2018), and nq (Kwiatkowski et al., 2019), deciding when not to answer is crucial to playing qb. Unlike these tasks though, deciding when to answer is not just model calibration or triggering, but also should reflect the opponent?s behavior (Billings et al., 1998). To address this, we use gameplay data (Table 5.2) which contain records of quizbowlers playing questions from prior tournaments: words in each question were revealed one-by-one until the player guessed the question?s answer. We use these records as (1) training data so that models can learn to imitate an oracle buzzing policy (Coates et al., 2008; Ross and Bagnell, 2010; Ross et al., 2011) and (2) as human baselines for offline evaluations (?5.7). Like Mandel et al. (2014), gameplay records both simulate humans for training and evaluating policies. To simulate play against a human, we see which agent? human or machine?first switches from the wait action to the buzz action. For example, in Table 5.2 the user correctly guessed ?Atlanta? at word forty-seven. If an agent played against this player they would need to answer correctly before word 83 QB (sentence) Jeopardy! TriviaQA SQuAD SimpleQuestions 40 30 20 10 0 0 10 20 30 40 50 0 10 20 30 0 10 0 10 20 0 10 20 Number of Questions (Thousands) Figure 5.4: Size of question answering datasets. Questions in the qanta dataset have longer sentences than any other dataset. The instances from SimpleQuestions, SQuAD, and triviaqa are comparatively short, which makes it less likely that they are as diverse as qb or Jeopardy!. For each dataset, we compare the lengths of questions rather than paired context paragraphs; to avoid the histogram being overly skewed we remove the top 5% of examples by length from each dataset. forty-seven to win. In all but one outcome, replaying the human record exactly recreates a live face-off. When a machine incorrectly buzzes first we lack what the human would ultimately guess, so we assume their guess would have been correct since skilled players almost always answer correctly by the end of the question. During training, these data help agents learn optimal buzzing policies based on their own uncertainty, the questions, and their opponents? history (He et al., 2016b).19 With this data, we compute how models would fare against human players individually, players partitioned by skill, and in expectation (?5.7.1). In contrast to this strategy, crowdsourced tasks (e.g., squad) often use the accuracy of a single annotator to represent human performance, but this is problematic as it collapses the distribution of human ability to a single crowd-worker and does not accurately reflect a task?s upper bound compared to multiple annotation (Nangia and Bowman, 2019; Kwiatkowski et al., 2019). In the gameplay data, we have ample data with which to robustly estimate average and sub-group human skill; for example, 90,611 of the 131,075 questions have been played at least five times. This wealth of gameplay data is one aspect of qb?s strength for comparing humans and machines. An additional aspect unique to trivia games is that participants are intrinsi- cally motivated experts. Compensation?i.e., extrinsic motivation?in crowdsourc- 19This work significantly expands the number of player-question records. We also make the setting significantly harder by not restricting questions to only the most frequently asked about answers (1K versus 24K). Finally, we create a new evaluation procedure (?5.7.1) that better esti- mates how models fare in the real-world versus human players. The first version of the gameplay dataset and models was introduced in: He He, Jordan Boyd-Graber, and Hal Daum? III. Opponent Modeling in Deep Reinforce- ment Learning. International Conference on Machine Learning, 2016. 84 Length in Words World None Europ B erit ais nh Wor E lu drope None Othe C r Literaturela C slassi sc ical History Current Events None Religio Norse None nPhil Science G osophy None P eo ogp r aC ph Norse u ylture None gy Fine Arts Norse ioloB None ics hys try PP s op C N Oo tr heu rlt su er em i e h NoneC None None None Figure 5.5: Questions in qb cover most if not all academic topics taught in school such as history, literature, science, the fine arts, and social sciences. Even within a single category, questions cover a range of topics. Topically, the dataset is biased towards American and European topics in literature and history. ing is notoriously difficult. If they feel underpaid, workers do not give their best effort (Gneezy and Rustichini, 2000), and increasing pay does not always translate to quality (Mason and Watts, 2009). In light of this, Mason and Watts (2009) recom- mend intrinsic motivation, a proven motivator for annotating images (von Ahn and Dabbish, 2004) and protein folding (Cooper et al., 2010). Second, although multiple non-expert annotations can approach gold standard annotation, experts are better participants when available (Snow et al., 2008). Thus, other tasks may understate human performance with crowdworkers lacking proper incentives or skills. Good quizbowlers are both accurate and quick. To measure skill, we compute and plot in Figure 5.7 the joint distribution of average player accuracy and buzzing position (percent of the question revealed). The ideal player would have a low aver- age buzzing position (early guesser) and high accuracy; thus, the best players reside in the upper left region. On average, players buzzes with 65% of the question shown with 60% accuracy (Figure 5.7). Although there are other factoid qa and?more specifically?trivia datasets, qb is the first and only dataset with a large dataset of gameplay records which allows us to train models and run offline benchmarks. 5.3.4 Preprocessing Before moving to model development, we describe necessary preprocessing to eliminate answer ambiguity, pair questions to gameplay data, and creating dataset folds that enable independent yet coordinated training of distinct guessing and buzzing models. Preprocessing is covered in significantly more detail in Appendix C.1. 85 M O att h C h A os m e E tro p N ao rth n ut rse S o e rc mi ren y c Se cience Visual Auditory Norse Classic Other Ancient Eu C ropl eassica E B l u ritir shopean American ors e N e Nor s olo gy p thro mic s An con o logyE ych o Ps rap hy g y Geo h ilos op Ph gy holo /My t eligi on gy R tho lo My ce Art Scie n Audiovisua l al Soci Music Other ica n r Am e Fine Arts Po HistoryRe pli g Cion ulture Literature Science human NOMATCH anatomical structuredynasty Scienceethnic group part of a plan Ht istory trea p tyhysical q Hu isa ton rytity film Science type of q reu la ign iotu nm partic History s letate of the U rivp eh r Sy ns ited S ct ia et ne ce fi ic ct aio l p s n ra ol p ec rh t P y op Culture m au ras ci tcal dc i e s re Rea lis go ie om nHistory positi Sp ta o cn ieG nce eai xn on ographt yc inh c g His S tory re em i c t iL ei n u y te ce ratu Liter at nce ic S re cie literary work al cF iein n S e ce A ial F Sc rts Soc rts in ien A e cA ee G r Fin n H e ts o io S i gr elig y c a g ph R lotho hy i sto y My sop e r e n y hilo r ult u c P C e Po p Literature History Figure 5.6: Distribution of wikidata.org answer types (?instance of? relation) further broken down by category. Most answers have matching types and reference a person, literary work, or geographic entity. Among these types, there is a good balance of answers spread across literature, history, fine arts, and science. Answer types with only one category are largely self-explanatory (e.g., mythological answers types to the mythology category). The special category ?NOMATCH? are answers without a matched type and similar types are merged into larger categories. Matching QB Answers to Wikipedia Pages Throughout this work, we frame qb as a classification task over the set of Wikipedia page entities (?5.2.3), which necessarily requires pairing answers to distinct pages if one exists. We pair ques- tions and their answers to Wikipedia pages in two steps: parsing potential answers from moderator instructions and matching to Wikipedia entities.20 In qb, ?an- swers? are in actuality instructions to the moderator that provide additional detail on what answers are acceptable. For example, answer strings like ?Second Vat- ican Council [or Vatican II]? indicate to accept either surface form of the same concept. Fortunately, the vast majority of these ?answer instructions? are auto- matically parsable due to their semi-regular structure. The second step?described further in Appendix C.1.4?matches parsed answers to pages through a combination of strict textual matching, expert-curated matching rules (e.g., only match ?camp? to Camp_(style) if ?style? or ?kitsch? are mentioned), and expert annotated pair- ings between questions and pages.21 In total, we paired 119,093 out of 132,849 with Wikipedia titles (examples in Appendix C.1.4). 20We preprocess the English Wikipedia 4/18/2018 dump with https://github.com/attardi/ wikiextractor. 21Primarily, the authors of this article annotated the answer to page pairings. 86 Fine Arts y ce tor cienis ial ScH So log y Art s yth o Fine ure M Litera t lict r aphy nf Ge og co figur e l gic a olo my th untry Religionco History Social Science Philosophy ienc e l Sc Soc ia oph y ilos Ph e nc Sc ie Date Thu Oct 29 2015 08:55:37 GMT-0400 (EDT) UID 9e7f7dde8fdac32b18ed3a09d058fe85d1798fe7 QID 5476992dea23cca90550b622 Position 47 Guess atlanta Result True Question text This Arcadian wounded a creature sent to punish Oeneus for improperly worshipping Artemis and killed the centaurs Rhaecus and Hylaeus. . . Table 5.2: An entry from the gameplay dataset where the player correctly guesses ?Atlanta? at word 47. The entry qid matches with the proto_id field in the ques- tion dataset where additional information is stored such as the source tournament and year. Dataset Folds The goal of the folds in the qanta dataset is to standardize the training and evaluation of models for the guessing and buzzing sub-tasks. Towards this goal, we sub-divide the qanta dataset by sub-task and standard machine learn- ing folds (e.g., training, development, and test). We create the standard machine learning folds by partitioning the data according to tournament type and year. To increase the quality of evaluation questions, we only include questions from cham- pionship level tournaments in the development and test folds.22 To derive the final folds, we temporally divide the data (Arlot and Celisse, 2010) so that only (cham- pionship) questions from 2015 and onward are used in evaluation folds. The subdivision by task simultaneously addresses the issue that some ques- tions lack gameplay data (thus are not helpful for buzzer training) and partitioning the data so that the buzzer calibrates against questions unseen during training (de- tails in Appendix C.1.3). Table 5.3 shows the size of each sub-fold; unassigned questions correspond to those where the answer to page matching process failed. Finally, hundreds of new qb questions are created every year which provides an opportunity for continually adding new training questions and replacing outdated test questions. Ultimately, this may help temper overconfidence in the generaliza- tion of models (Patel et al., 2008b) since we expect there to be covariate shift, prior probability shift, and domain shift in the data (Quionero-Candela et al., 2009) as questions evolve to reflect modern events. The qanta datasets, a copy of the Wikipedia data used, intermediate artifacts, and other related datasets are available at http://datasets.qanta.org. 87 0.6 1 0.4 0.2 0 0.75 2.5 5 7.5 10 Log number of records 2.5 2 1.5 0.50 1 0.5 0 0 0.25 0.50 0.75 1 Accuracy 0.25 3 2 1 0 0 0.25 0.50 0.75 0 Average buzzing position 0 0.25 0.50 0.75 Average buzzing position Figure 5.7: Left: each protobowl user is represented by a dot, positioned by average accuracy and buzzing position; size and color indicate the number of questions an- swered by each user. Right: distribution of number of questions answered, accuracy, and buzzing position of all users. An average player buzzes with 65% of the question shown, and achieves about 60% accuracy. 5.4 Deciding When and What to Answer One could imagine many machine learning models for playing qb: an end- to-end reinforcement learning model or a heavily pipelined model that determines category, answer type, answer, and decides when to buzz. Without making any value judgment on the right answer, our approach divides the task into two subsystems: guessing and buzzing (Figure 5.8). This approach mirrors ibm Watson?s23 two model design (Ferrucci et al., 2010; Tesauro et al., 2013). The first model answers questions, and the second decides when to buzz. Dividing a larger task into sub-tasks is common throughout machine learning, particularly when the second is making a prediction based on the first?s prediction. For example, this design pattern is used in object detection (Girshick et al., 2014, generate bounding box candidates then classify them), entity linking (Ling et al., 2015, generate candidate mentions and then disambiguate them to knowledge base entries), and confidence estimation for automatic speech recognition systems (Kalgaonkar et al., 2015). In our factorization, guessing is based solely on question text. At each time step (word), the guessing model outputs its best guess, and the buzzing model determines whether to buzz or wait based on the guesser?s confidence and features derived from the game state. This factorization cleanly reduces the guesser to question answering while framing 22We use questions from acf Regionals, acf Nationals, acf Fall, pace nsc, and nasat from 2015 onward for development and test sets. 23In Watson, the second system also determines wagers on Daily Doubles, wagers on Final Jeopardy, and chooses the next question (e.g., history for $500) 88 Accuracy Density Density Density Fold Number of Questions train + guess 96, 221 train + buzz 16, 706 dev + guess 1, 055 dev + buzz 1, 161 test + guess 2, 151 test + buzz 1, 953 unassigned 13, 602 All 132, 849 Table 5.3: We assign each question in our dataset to either the train, develop- ment, or test fold. Questions in the development and test folds come from national championship tournaments which typically have the highest quality questions. The development and test folds are temporally separated from the train and develop- ment folds to avoid leakage. Questions in each fold are assigned a ?guess? or ?buzz? association depending on if they have gameplay data. Unassigned refers to questions for which we could not map their answer strings to Wikipedia titles or there did not exist an appropriate page to match to. the buzzer as a cost-sensitive confidence calibrator. This division of modeling labor makes it significantly easier to train the buzzer as a learned calibrator of the guesser?s softmax classifier predictions. This is crucial since the probabilities in neural softmax classifiers are unreliable (Guo et al., 2017). Like how we train a calibration model (buzzer) over a classifier (guesser), Corbi?re et al. (2019) train a calibration model on top of an image classification model which is a more effective approach in high dimensional spaces compared to nearest-neighbor based confidence measures (Jiang et al., 2018). However, not all buzzing errors are equal in severity; thus, part of the buzzer?s challenge is in incorporating cost- sensitive classification. By partitioning model responsibilities into separate guessing and buzzing models, we can mitigate the calibration-based drawbacks of neural softmax classifiers while naturally using gameplay data for cost-sensitive decision- making. Machines playing qb by guessing and buzzing semi-independently is also con- venient from an engineering perspective: it simplifies model training and is easier to debug. More importantly, it allows us and subsequent researchers to focus on a sub-task of their choosing or the task as a whole. If you are interested in only question answering, focus on the guesser. If you are interested in multi-agent co- operation or confidence estimation, focus on the buzzer. Following the discussion of our guessing (?5.5) and buzzing (?5.6) systems we describe our evaluations and results in Section 5.7.1. Section 5.8 summarizes the outcomes of our live, in-person, exhibition matches against some of the best trivia players in the world. 89 Input Output Model At its premiere, the librettist of this opera portrayed a ... For 10 points, name this Wolfgang Mozart Opera titled character who asks for a glass of wine with his dying wish for an enchanted woodwind instrument. Guesser Guess: Cavellaria RusticanaScore: .0287 Guesser Guess: The Magic Flute?????????? Score: .997 Buzzer Action: Wait Buzzer Action: Buzz Figure 5.8: The qanta framework for playing Quiz Bowl with semi-independent guesser and buzzer models. After each word in the input is revealed the guesser model outputs its best guesses. The buzzer uses these in combination with positional and gameplay features to decide whether to take the buzz or wait action. The guesser is trained as a question answering system that provides guesses given the input text seen so far. Buzzers take on dual roles as calibrators of the guesser confidence scores and cost-sensitive decision classifiers by using the guesser?s score, positional features, and human gameplay data. 5.5 Guessing QB Answers Guessing answers to questions is a factoid question answering task and the first step towards our models playing qb (Figure 5.8). We frame the question an- swering sub-task in qb as high dimensional multi-class classification over Wikipedia entities (i.e., answers are entities defined by distinct Wikipedia pages). This section describes three families of question answering models: information retrieval models (?5.5.1), linear models (?5.5.2), and neural models (?5.5.3). Despite distinct differ- ences, these approaches share a common structure: create a vector representation x of the input question, create a vector representation for each candidate answer ai, and then return the answer Ai corresponding to arg maxi f(x,ai) where f is some similarity function.24 5.5.1 Explicit Pattern Matching with Information Retrieval The first model family we discuss are traditional information retrieval (ir) models based on the vector space model (Salton et al., 1975). Vector space mod- els are particularly effective when term overlap is a useful signal?as in factoid qb (Lewis et al., 2020a). For example, although early clues avoid keyword us- age, giveaways often include terms like ?Wolfgang Mozart? and ?Tamino? that make reaching an answer easier. Consequently, our vector space ir prove to be a strong baseline (?5.7.1). To frame this as an ir search problem, we treat guessing as document retrieval. Input questions are search queries and embedded into a tf-idf (Jones, 1972; Ra- jaraman and Ullman, 2011) vector x. For each answer Ai ? Atrain in the qb training data, we concatenate all training questions with that answer into a document Di 24For brevity and clarity, we omit bias terms. 90 embedded as ai into the same vector space.25 The textual similarity function f is Okapi bm25 (Robertson and Walker, 1994) and scores answers ai against x. Dur- ing inference, we return the answer Ai of the highest scoring document Di. We implement our model using Apache Lucene and Elastic Search (Gormley and Tong, 2015). However, the ir model?s reliance on pattern matching often fails early in the question. For example, in the first sentence from Figure 5.1 the author intentionally avoids keywords (?a character who asks for a glass of wine with his dying wish?). Purely traditional ir methods, while effective, are limited since they rely on key- words and cannot ?soft match? terms semantically. Thus, we move on to machine learning methods that address some of these shortcomings. 5.5.2 Trainable Pattern Matching with Linear Models In addition to the ir model, we also test a linear model baseline that reduces multi-class classification to one-versus-all binary classification. While an ir model derives term weights from corpus statistics and a hand-crafted weighting scheme, a one-versus-all linear model with one-hot term features x finds term weights that maximize the probability of the correct binary prediction for each answer. The in- put features x are derived from a combination of sparse n-grams and skip-grams.26 Since the number of classes is too high for standard one-versus-all multi-class clas- sification,27 we instead use a logarithmic time one-versus-all model (Agarwal et al., 2014; Daum? et al., 2017). However, this model is limited since it only considers lin- ear relationships between n-gram terms, the model uses?at best?local word order, and the sparse representation does not take advantage of the distributional hypoth- esis (Harris, 1954). Next we describe neural models that use more sophisticated forms of representation and composition to address these shortcomings. 5.5.3 Neural Network Models The final family of methods we consider for qb question answering are neural methods. We describe the shared components of the neural models (e.g., general architectures and training details) and compare their composition functions. In our model (Figure 5.9), we follow a widely used architecture in nlp to embed words independently in a vector space, contextualize their representations, temporally reduce representations, and then classify with a softmax layer (Col- lobert and Weston, 2008). The first component of the model embeds question q with k tokens into m-dimensional representations w = [w1, . . . ,wk]. Next, a function c(?) : Rk?m ? Rk?l contextualizes words as l-dimensional embeddings v = [v1, . . . ,vk] = c(w). Since this is still a variable length sequence of repre- sentations and the classifier requires a fixed size representation, we use a reducer 25We also tested, one document per training example, different values for bm25 coefficients, and the default Lucene practical scoring function. 26The order of n-grams and skip-grams was determined by hyper parameter search 27There are approximately 25,000 distinct answers. 91 At its premiere, the librettist of this... w0 w1 wk Word Embeddings v0 v1 vk Composition Function (DAN, RNN, CNN...) Fixed Size Representation h Classifier (Linear + Softmax) Guess Figure 5.9: All our neural models feed their input to an embedding function, then a composition function, and finally a classification function. The primary variation across our models is the choice of composition function used to compute a fixed, example-level representation from its variable length input. r(?) : Rk?m ? Rn to derive an n-dimensional dense feature vector x = r(v). We call specific pairs of contextualizers and reducers composition functions (?2.4.3). The final model component?the classifier?computes logit scores s Ti = x ?ai as the dot product between the features x and trainable answer embeddings ai. From this, we use the softmax to compute a probability distribution exp(s) p = softmax(s) = ? (5.1)k i=1 exp(si) over answers and train the model with the cross entropy loss ?k L = yi log(p?i) (5.2) i where yi = 1 for the true answer and yi = 0 otherwise. In our experiments, we evaluate three classes of composition functions (i.e., contextualizer-reducer pairs): unordered composition with deep averaging networks (Iyyer et al., 2015), recur- rent network-based composition (Elman, 1990; Hochreiter and Schmidhuber, 1997; Palangi et al., 2016; Cho et al., 2014a), and transformer-based composition (Vaswani et al., 2017; Devlin et al., 2018). Unordered Composition with Deep Averaging Networks Our first (un- ordered) neural composition function is the deep averaging network (dan). We introduced dans as a simple, effective, and efficient method for qb question an- swering.28 Despite their disregard of word order, dans are competitive with more 28This work has new experiments comparing new composition functions and focuses on incor- porating additional data. The dan first was introduced in: 92 sophisticated models on classification tasks such as sentiment analysis (Iyyer et al., 2015). Although there are cases where word order and syntax matter, many ques- tions are answerable using only key phrases. For example, predicting the mostly likely answer to the bag of words ?inventor, relativity, special, general? is easy; they are strongly associated with Albert Einstein. All composition functions?such as dans?are fully described by the choice of contextualizer and reducer. In dans, the contextualizer c is the identity function, and the reducer is broken into two components. First, the dan averages word embeddings v to create an initial hidden state k 1 ? h0 = vi. (5.3) k i=1 The final fixed-size representation x = hz is computed with z feed-forward layers through the recurrence hi = gelu(Wi ? hi?1 + bi) (5.4) whereWi and bi are parameters of the model and gelu is the Guassian Error Linear Unit (Hendrycks and Gimpel, 2017b). Although dans are not the most accurate model, they are an attractive trade-off between accuracy and computation cost. Ordered Composition In contrast to dans, order-aware models like rnns, lstms, and grus can model long range dependencies in supervised tasks (Linzen et al., 2016). Since all these models belong to the family of recurrent models, we choose one variant to describe in terms of its associated contextualizer and reducer.29 In our model, the composition function c(v) = gru(v) (5.5) is a multi-layer, bi-directional gru (Cho et al., 2014a). The reducer (forward) (backward) r(v) = [vk ;v0 ] (5.6) concatenates the final layer?s forward and backward hidden states. Combined, this forms the first ordered composition we test. Transformer models, however, better represent context at the cost of complex- ity (Vaswani et al., 2017; Devlin et al., 2018). Specifically, we input the cls token, question, and sep token to uncased bert-base. Thus, the contextualizer c(v) = bert(v) (5.7) Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daum? III. Deep Unordered Composition Rivals Syntactic Methods for Text Classification. Association for Compu- tational Linguistics, 2015. 29Hyper parameter optimization indicated that gru networks were the most accurate recurrent model. 93 is simply bert and the reducer ?k1 r(v) = vi (5.8) k i=1 is the average of the output states from the final layer associated with the question?s wordpiece tokens.30 Next, we move on from model descriptions to training specifics. Training Details In standard qa tasks, training over full questions is standard, but with qb?s incremental setup this results in less accurate predictions. If the ex- ample is the complete text, the model ignores difficult clues to focus on the ?easy? part of the question, preventing learning from ?hard? clues. Instead of each training example being one question, we use each of a question?s sentences as a singe train- ing example. While this training scheme carries the downside that models may not learn long-range dependencies across sentences, the empirical accuracy improvement outweighs the disadvantages. In addition to these two approaches, we also tested variable-length, but did not observe an improvement over the sentence-based train- ing scheme.31 In non-transformer models we use 300-dimensional word embeddings initial- ized with glove for words in the vocabulary and randomly initialized embeddings otherwise.32 We regularize these models with dropout (Srivastava et al., 2014) and batch normalization (Ioffe and Szegedy, 2015). Loss functions were optimized with adam (Kingma and Ba, 2015) and models were trained with early stopping, and learning rate annealing. All neural models were implemented in PyTorch (Paszke et al., 2019) and AllenNLP (Gardner et al., 2018). We optimize hyper parameters by running each setting once and record the parameter settings corresponding to the top development set accuracy. The models with the best parameters are run an (additional) five times to create estimates of variance for each tracked metric (?5.7.1). Although not exhaustive, these models are strong baselines for the question answering component of qb. Section 5.10 identifies areas for future modeling work; throughout the rest of this work however we focus on completing a description of our approach to playing qb by combining guessers and buzzer (?5.6). Following this we describe how we evaluate these systems independently (?5.7.1), jointly (?5.7.3), offline, and in live events (?5.8). 5.6 Buzzing Winning in qb requires answering accurately with as little information as possible. It is crucial?for humans and computers alike?to accurately measure 30We also tested using the cls token with worse results. 31Variable length training creates k training examples from a question comprised of k sentences. Each example includes the text from the start position up to and including sentence k. 32Randomly initialized embeddings use a normal distribution with mean zero and standard deviation one. 94 their confidence and buzz as early as possible without being overly aggressive. The first part of our system, the guesser, optimizes for guessing accuracy; the second part, the buzzer, focuses on deciding when to buzz. Since questions are revealed word-by-word the buzzer makes a binary decision at each word: buzz and answer with the current best guess, or wait for more clues. The outcome of this action depends on the answers from both our guesser and the opponent.33 To make this clear, we review the game?s mechanics. If we buzz with the correct answer before the opponent can do so, we win 10 points; but if we buzz with an incorrect answer, we lose 5 points immediately, and since we cannot buzz again, the opponent can wait till the end of the question to answer, which might cost us 10 extra points in the competition. Thus, it should be clear that ?a wrong answer in Quizbowl is extraordinarily costly? (Jennings, 2006, p. 93). Before we discuss our strategy to buzzing, consider a buzzer with perfect knowledge of whether the guesser is correct or not, but does not know anything about the opponent: a locally optimal buzzer. This buzzer would buzz as soon as the guesser gets the answer correct. A stronger buzzer exists: an omnipotent buzzer with perfect knowledge of what the opponent will do; it would exploit the opponent?s weaknesses: delay buzzing whenever an opponent might err. The agent would then get a higher relative reward: once from the opponent?s mistake and then for getting it correct. The buzzer we develop in this chapter targets a locally optimal strategy: we focus on predicting the correctness of the guesser and do not model the opponent. This buzzer is effective: it both defeats players in our gameplay dataset (?5.3.3) and playing against real human players (?5.8). The opponent modeling extension has been explored by previous work, and we discuss it in Section 5.9. 5.6.1 A Classification Approach to Buzzing Given the initial formulation of buzzing as a mdp (?5.2.5), it would be natural to learn the task with reinforcement learning using the final score; however, we instead use a convenient reduction to binary classification. Since we can compute the optimal buzzing position easily as opposed to with expensive rollouts, we can reduce the problem to classification (Lagoudakis and Parr, 2003). At each time step, the model looks at the sequence of guesses that the guesser has generated so far, and makes a binary decision of whether to buzz or to wait. Under the locally optimal assumption, the ground truth action at each time step equals the correctness of the top guess: it should buzz if and only if the current top guess is correct. Another view of this process is that the buzzer is learning to imitate the oracle buzzing policy from the ground truth actions (Coates et al., 2008; Ross and Bagnell, 2010; Ross et al., 2011). Alternatively, the buzzer can also be seen as an uncertainty estimator (Hendrycks and Gimpel, 2017a) of the guesser. The guesses create a distribution over all possible answers. If this distribution 33We use point values from the typical American format of the game. The exact values are unimportant, as they change the particulars of strategy but not the approach. 95 faithfully reflects the uncertainty of guesses, the buzzer could be a simple ?if?then? rule: buzz as soon as the guesser probability for any guess gets over a certain threshold. This threshold system is our first baseline, and we tune the threshold value on a held-out dataset. However, this does not work because the confidence of neural models is ill- calibrated (Guo et al., 2017; Feng et al., 2018) and worsens with domain shift (Ka- math et al., 2020). Our neural network guesser often outputs a long tail distribution over answers concentrated on the top few guesses, and the confidence score of the top guess is often higher than the actual uncertainty (the chance of being correct). To counter these issues, we extract features from the top ten guesser scores and train a classifier on top of them. Some important features include a normalized version of the top ten scores and the gap between them (Appendix C.1.5 lists all features). There is also important temporal information; for example, if the guesser?s top prediction?s score steadily increases, this signals the guesser is certain about the top guess. Conversely, a fluctuating top prediction (the answer is Hope Diamond. . . no, I mean Parasaurolophus. . . no, I mean Tennis Court Oath) is a sign that perhaps the guesser is not that confident (regardless of the ostensible score). To capture this, we compare the current guesser scores with the previous time steps and extract features such as the change in the score associated with the current best guess, and whether the ranking of the current top guess changed in this time step. To summarize, at each time step, we extract a feature vector, including current and temporal features, from the sequence of guesses generated by the guesser so far. We implement the classifier with both fully connected Multi-layer Perceptron (mlp) and with Recurrent Neural Network (rnn). The classifier outputs a score between zero and one indicating the estimated probability of buzzing. Following the locally optimal assumption, we use the correctness of the top guess as ground truth action: buzz if correct and wait if otherwise We train the classifier with logistic regression; during testing, we buzz as soon as the buzzer outputs a score greater than 0.5. Both models are implemented in Chainer (Tokui et al., 2015); we use hidden size of 100, and lstm as the recurrent architecture. We train the buzzer on the ?buzzertrain? fold of the dataset, which does not overlap with the training set of the guesser, for twenty epochs with the Adam optimizer (Kingma and Ba, 2015). Both buzzers have test accuracy of above 80%, however, the classification accuracy does not directly translate into the buzzer?s performance as part of the pipeline, which we look at next. 5.7 Offline Evaluation A core idea of this chapter is that the construction of qb questions lends itself to a fairer evaluation of both humans and machine qa models: to see who is better at answering questions, see who can answer the question first. However, this is often impractical during model development, especially if the questions are ?new? (they have not been played by humans or computers). Moreover, a researcher might be uninterested in solving the buzzing problem. Offline evaluations where the 96 guesser and buzzer are evaluated independently with static data strikes a balance between ease of model development and faithfulness to qb?s format. To address this, Section 5.7.1 describes the metrics to compare offline model accuracy. Following an error analysis (?5.7.2), Section 5.7.3 evaluates buzzing models by replacing this oracle buzzer with trained buzzing models. 5.7.1 Evaluating the Guesser Ideally, we would compare systems in a head-to-head competition where the model (or human) who correctly buzzed and answered the most questions would win (?5.8). However, this involves live play, necessitates a buzzing strategy, and complicates evaluation of the guesser in isolation. Intuitively though, a model that consistently buzzes correctly earlier in the question is better than a model that buzzes late in the question. In our evaluations, we use three metrics that reflect this intuition: accuracy early in the question, accuracy late in the question, and the expected probability of beating a human assuming an optimal buzzing strategy. Accuracy-Based Evaluation The easiest and most common method for evalu- ating closed domain question answering methods is accuracy over all questions in the test set. We report two variants of this: (1) accuracy using the first sentence and (2) accuracy using the full question. While it is possible to answer some ques- tions during the first sentence, it is the first and hardest position we can guarantee could be answered. Although we report accuracy on full questions, this metric is a minimum bar: the last clues are intentionally easy (?5.2.3). However, while start-of- question and end-of-question accuracy help development and comparison with other qa tasks, it is silent on human?computer comparison. We address this shortcoming next. Expected Probability of Defeating Human Players While comparing when systems buzz is the gold standard, we lack gameplay records for all test set questions, and it is unreasonable to assume it is easy to obtain them. Instead, marginalize over empirical human gameplay to estimate the probability ?(t) that a human would have correctly answered a question by position t. Then, we combine this with model predictions and marginalize over t to obtain the expected probability of winning against an average player on an average gameplay question. A similar idea?to compute the expected probability of winning a heads up match?has also been used in machine translation (Bojar et al., 2013). We compute the expected probability of winning (ew) in two steps. First, we compute the proportion of players ? Nt?(t) = 1 , (5.9) N that have answered a question correctly by position t. N is the total number of question-player records and Nt is the number of question-player records where the 97 1 Data Source Expected Wins Curve Score Average of Most Played Question with 215 Plays 0.75 Question with 205 Plays Question with 198 Plays Question with 197 Plays Question with 177 Plays Question with 176 Plays 0.50 Question with 173 PlaysQuestion with 172 Plays Question with 167 Plays Question with 166 Plays Data Type 0.25 Curve Score | AverageSingle Question 0 0.25 0.50 0.75 1 Position in Question (%) Figure 5.10: We plot the expected wins score with respect to buzzing position (solid dark blue). For the ten most played questions in the buzztest fold we show the empirical distribution for each individual question (dotted lines) and when aggre- gated together (solid light blue). Among the most played questions, expected wins over-rewards early buzzes, but appropriately rewards end-of-question buzzes. player answered correctly by position t. We empirically estimate the expected prob- ability of winning ?(t) = 0.0775t? 1.278t2 + 0.588t3 (5.10) from the gameplay data as a cubic polynomial (Figure 5.10). At t = 0, the potential payoff is at its highest since no one has answered the question. At t = ?, the potential payoff is at its lowest; all the players who would have correctly answered the question already have. If the computer gets the question right at the end, it would only score points against opponents who did not know the answer at all or answered incorrectly earlier in the question. ew marginalizes over all questions q and all positions j, and counts how many times model m produced a guess g(m, q, j) that matched the answer of the ques- tion a(q). Specifically we compute 1 ??? EW(m) = Em [pwin] = |Q| 1 [g(m, q, j) = a(q)] ? (j) , (5.11) q?Q j=1 where 1|Q| is the count of question?position records. The indicator function is exactly an oracle buzzer: it gives credit if and only if the answer is correct. However, this rewards models with unstable predictions; for example, a model would be rewarded twice for a sequence of predictions that were correct, wrong, and correct. We dis- courage this model behavior by using a stable variant of ew which only awards points if the current answer and all subsequent answers are correct. With this for- malism, it is also straightforward to compute the expected winning probability for 98 Empirical Probability of Winning Accuracy (%) Start End Dev Test Dev Test E [pwin] Model Top Mean Top Mean Top Mean Top Mean Dev Test Lin. 2.56 2.56?0. 1.58 1.58?0. 11.9 11.9?0. 9.25 9.25?0. 6.62 4.96 ir 9.48 9.48 6.23 6.23 62.2 62.2 54.5 54.5 45.8 38.8 dan 10.7 10.4?0.3 8.28 7.88?0.3 60.0 59.1?0.9 51.0 51.4?1 42.6 35.5 rnn 10.5 9.46?0.7 7.86 7.78?0.4 52.3 51.8?1 46.4 45.9?0.9 27.6 23.3 bert 12.5 11.1?0.8 9.34 9.49?0.3 53.4 55.0?0.9 47.0 48.8?0.9 36.6 31.6 Table 5.4: We compare several models by accuracy at start-of-question, end-of- question, and ew. In the table, models are sorted by start-of-question development set accuracy. Standard deviations for non-ir models are derived from five trials; standard deviation is not reported for the ir model since it is deterministic. any guesser-buzzer combination by replacing the oracle buzzer (indicator function) with a function that equals one if the guess is correct and the buzzer yielded a ?buzz? decision. We compare buzzers in Section 5.6, but now move to experimental results for the guesser. Guesser Comparison Experiments We evaluate our guessers using start accu- racy, end accuracy, and expected wins (Table 5.4). All models struggle at the start of the question with the best accuracy at only 9.34%. This is unsurprising: the first sentence contains the most difficult clues and is difficult for even the best human players. Models fare significantly better near the end of the question with giveaway clues. However, even the best model?s 54.5% accuracy leaves much room for future work. While the bert model has the best early-question accuracy, it lags behind the ir and dan for end of question accuracy. We suspect that order-aware models over-emphasize less important parts of the question; additionally, the gap between sentence training and full question inference advantages models that did not need to learn an aggregation over longer sequences. This pattern is also reflected in the ew scores;bert?as expected?outperforms the rnn model. Finally, across accuracy and ew we see substantial drops between the development and test sets, which suggests overfitting. Next, we investigate the errors models make. 5.7.2 Identifying Sources of Error This section identifies and characterizes several failure modes of our models. First, we compare the predictions of blackbox neural models and an ir model?an explicit pattern matcher. Following this, we identify data skew towards popular answers as a major source of error for less popular answers. Lastly, we manually break down the test errors of one model. 99 Berrrrtttt:::: Corrrrrrrrecctttt IIIIR:::: Corrrrrrrrecctttt Berrrrtttt:::: Wrrrrong IIIIR:::: Wrrrrong Berrrrtttt:::: Wrrrrong IIIIR:::: Wrrrrong Berrrrtttt:::: Corrrrrrrreccctttt IIIIR:::: Corrrrrrrrectttt (a) Start of the question error compari- son (b) End of the question error comparison Figure 5.11: The bert and ir models are mostly wrong or correct on the same subset of questions. At the end of the question, most of the questions the bert model is correct on, the ir model is also correct on. Behavioral Comparison of Neural and IR Models One way to analyze black- box models like neural networks is to compare their predictions to better understood models like the ir model. If their predictions?and thus exterior behavior?are similar to a better understood model it suggests that they may operate similarly. Figure 5.11 shows that even the bert and ir models are correct and wrong on many of the same examples at end-of-question. Since one model?the ir model?is an explicit pattern matcher this hints that neural qb models learn to be pattern matcher as hinted by other work (Jia and Liang, 2017; Rajpurkar et al., 2018; Feng et al., 2018). Next we investigate this pattern matching hypothesis at the instance-level. For our instance-level analysis we sample examples of correct and incorrect predictions. First we randomly sample a test question that all models answer correctly after the first sentence (Figure 5.12). This particular example has similar phrasing to a training example (?A holder of this title commissioned. . .miniatures?) so it is unsurprising that all models get it right. In our second analysis, we focus on a specific answer (Turbulence) and its twenty-seven training questions. Figure 5.13 shows a sample question for this answer that the rnn model answered correctly but that ir model did not. The most frequent words in the training data for this answer are ?phenomenon? (twenty-three times), ?model? (seventeen times), ?equation? (thirteen times), ?numerically? (once), and ?tensor? (once). In this analysis we removed or substituted these word with synonyms and then checked if the model?s prediction was the same. Substituting words in this question shows that the model is over-reliant on spe- cific terms. After removing the term ?phenomenon,? the model changed its answer 100 Test Question (first sentence): A holder of this title commissioned a set of miniatures to accompany the story collection Tales of a Parrot. Training Question (matched fragment): A holder of this title commissioned Abd al-Samad to work on miniatures for books such as the Tutinama and the Hamzanama. Answer: Mughal Emperors Figure 5.12: A test question that was answered correctly by all models after the first sentence; a normally very difficult task for both humans and machines. A very similar training example allows all models to answer the question through trivial pattern matching. Test Question (first sentence): This phenomenon is resolved without the help of a theoretical model in costly DNS methods, which numerically solve for the rank-2 tensor appear- ing in the RANS equations. Answer: Turbulence Score (rnn): .0113 Synonym Attacks: phenomenon ? event, model ? representation Figure 5.13: Only the rnn model answers this question correctly. To test the ro- bustness of the model to semantically equivalent input modifications, we use sears- based (Ribeiro et al., 2018) synonym attacks and cause the model prediction to become incorrect. Although this exposes a flaw of the model, it is also likely that the low confidence score would likely lead a buzzer model to obstain; this highlights one benefit of implicitly incorporating confidence estimation into the evaluation. to Ising model (a mathematical model of ferromagnetism). If we instead substitute the term with synonyms such as ?occurrence?, ?event?, and ?observable event? the answers are still incorrect. Similarly, if ?model? is replaced by ?representation? the rnn also makes incorrect predictions. At least for this question, the model is not robust to these semantics-preserving modifications (Ribeiro et al., 2018). Next we move to aggregate error analysis. Errors Caused by Data Sparsity For many test set answers, scarcity of training data is a significant source of error. Most egregiously, 17.9% of test questions have zero corresponding training examples. Beyond these questions, many more answers have few training examples. While some topics are frequently asked about, one goal of question writers is to introduce new topics for students to learn from. For example, although physics is a common general topic, Electromagnetism has only been an answer to one qb question. The distribution of training examples per unique answers is skewed (Figure 5.14), and countries?like Japan?are asked about much more frequently. Unsurprisingly, plotting the number of training examples per test question answer versus model accuracy shows significant drops in accuracy for about 101 Electromagnetism 10000 Romanian_language Treaty_of_Paris_(1763) Queequeg Violin_sonata 1000 Garbage_collection_(computer_science) Freyja 100 Bose Einstein_condensate Golgi_apparatus 10 Thor Japan 1 1 2 3 4 5 10 25 50 100 Number of Training Examples per Answer Figure 5.14: The distribution of training examples per unique answer is heavily skewed. The most frequent answer (Japan) occurs about 100 times. Nearly half of the questions have one training example and just over sixty percent have either one or two training examples. half of the test questions (Figure 5.15). Error Breakdown We conclude our error analysis by inspecting and breaking down the errors made by the rnn model at the start and end of questions. Of the 2,151 questions in the test set, 386 have zero training examples leaving 1,765 questions that are answerable by our models. Of the remaining questions, the rnn answers 1,540 incorrectly after the first sentence and 481 at the end of the question. To avoid errors likely due to data scarcity, we only look to questions with at least 25 training examples; the number of errors on this subset at the start and the end of the question is 289 and 36. Table 5.5 lists reasons for model errors on a random sample of 50 errors from the start of the question and all 36 errors from the end of the question. The predominant sources of error are when the model predicts the correct answer type (e.g., person, country, place), but chooses the incorrect member of that type. This accounts for errors such as choosing the wrong person, country, place, or event. The rnn especially confuses countries; for example, in Figure 5.16, it confuses the Spain and the United States, the parties to the Adam Onis Treaty. The relative absence of incorrect answer type errors at the end of questions may be attributable to the tendency of late clues including the answer type (such as ?name this country. . . ?). In the process of manual error breakdown we also found five annotation errors where the assigned Wikipedia answer did not match the true answer. This low number of errors further validates the robustness of our answer mapping process. 102 Number of Answers Model BERT DAN IR RNN VW Start End 0.4 50% of Test 50% of Test Questions Questions 0.2 0 0 25 50 75 0 25 50 75 N or Fewer Training Examples Figure 5.15: The more an answer is asked in the training set, the easier it is for all models, both at the start and end of the question. This is a significant source of errors since accuracy on at least 50% of test questions?those with seven or less training examples?is significantly lower for all models. 5.7.3 Evaluating the Buzzer We first evaluate our buzzer against the locally optimal buzzer which buzzes as soon as the guesser gets the answer correct. However, this can be overly ambitious and unrealistic since the guesser is not perfectly stable: it can get the answer correct by chance, then vacillate between several candidates before settling down to the correct answer. To account for this instability, we find the first position that the guesser stabilizes to the correct answer and set it as the optimal buzzing position. In other words, we compare against the optimal buzzer used in the definition of stable ew score. To be exact, we start at the last position that the guess is correct, go backwards until the guess is incorrect and consider this the locally optimal buzzing position; we set the ground truth to all positions before this to zero, and all positions after it to one. We use the same guesser (rnn) in combination with different buzzers (Ta- ble 5.6), and quantitatively compare their expected wins (?5.7.1). Both mlp and rnn buzzers win against the static threshold baseline, but there is a considerable 103 Test Accuracy Error Reason Start Count End Count Wrong Country 11 17 Wrong Person 16 2 Wrong Place 1 5 Wrong Type 15 5 Wrong Event 0 1 Nonsense 7 2 Annotation 1 4 Table 5.5: The table is an error breakdown for questions with at least twenty- five training examples. To analyze errors at the start of questions, we randomly sampled fifty errors and for end of question took all thirty-six errors. End of question errors are primarily wrong country errors as in Figure 5.16 where the model answers United States instead of Spain. Errors at the start of the question though are more diverse. The most common error is guessing the correct answer type, but not the specific member of that type; examples of this error class include answering Albert Einstein instead of Alan Turing, or Iowa instead of Idaho. Model acc ew Score Threshold 0.013 -9.98 mlp 0.840 0.272 -2.31 rnn 0.849 0.302 -1.01 Optimal 1.0 0.502 2.19 Table 5.6: The accuracy (acc), expected wins (ew), and qb score (Score) of each buzzer on the validation set. Both mlp and rnn outperform the static threshold baseline by a large margin, but there is still a considerable gap from the optimal buzzer. gap between rnn and the optimal buzzer. Low expected wins means the buzzer is either too aggressive or not aggressive enough. To characterize their weaknesses, we compare the buzzers? behavior over time (Figure 5.18). The static threshold buzzer is too aggressive, especially early in the questions as is also seen in Figure 5.17. This behavior to some extent resonates with the observation that the confidence of neural models needs calibration (Guo et al., 2017). The difference between mlp and rnn is small but rnn is less likely to be overly aggressive early in the question. For a more fine-grained analysis, we simulate games where our system plays against individual human players using the gameplay dataset (?5.3.3). Based on the guesser, we classify questions as ?possible? or not. If the guesser gets the answer correct before the opponent answers, it is possible for the buzzer to win the question. Otherwise, it is impossible for the buzzer to do anything to beat the opponent. Based on this categorization, Figure 5.19 further breaks down the outcomes: the rnn is 104 Test Question: This country seized four vessels owned by Captain John Meares, which were registered in Macau and disguised with Portuguese flags, starting a dispute over fishing rights. To further negotiations with this country, Thomas Jefferson signed the so-called ?Two Million Dollar Act.? This country agreed not to police a disputed spot of land, which was subsequently settled by outlaws and ?Redbones?, and which was called the ?Neutral Ground.? This country was humiliated by England in the Nootka Crisis. Harman Blennerhassett?s farm on an island in the Ohio River was intended as the launching point of an expedition against this European country?s possessions in a plan exposed by James Wilkinson. This country settled navigation rights with the United States in Pinckney?s Treaty, which dealt with the disputed ?West? section of a colony it owned. For 10 points, name this European country which agreed to the Adams-Onis Treaty hand- ing over Florida. Guess: United States Answer: Spain Figure 5.16: Although the answer to this question is Spain, many of the terms and phrases mentioned are correlated with the United States. Thus, the rnn model answers United States instead of the correct answer Spain. This is one of many examples where the model answers with the correct answer type (country), but incorrect member of that type. less likely to be overly aggressive in both possible and impossible cases. 5.8 Live Exhibition Events No amount of thorough experimentation and analysis of machine learning sys- tems can match the public interest and rubber-meets-the-road practicalities of live matches between humans and machines. ibm?s Watson in Jeopardy! (Ferrucci et al., 2010), Deep Blue in chess (hsiung Hsu et al., 1995), and Google?s AlphaGo in Go (Silver et al., 2016) were both tremendous scientific achievements and cultural watersheds. In the case of chess and Go, they transformed how the games are played through insight gained from the collaboration of humans and machines. Lastly, al- though our offline evaluation is reasonable, a live evaluation verifies that the two correspond since that is not always clear (Hersh et al., 2000). In a similar spirit, we have hosted eight live events since 2015 where we show- case our research to the public by having humans and machines compete against each other.34 Except for our nips 2015 Best Demonstration against ml researchers, our system?s opponents have been strong trivia players. Their achievements in- clude victories in numerous national qb championships (high school and college), Jeopardy!, and similar trivia competitions. Our inaugural event in 2015 at the qb High School National Competition Tournament (hsnct) pitted an early and vastly different version of our system 34Videos of our events are available at http://events.qanta.org. 105 This instrument plays the only extended solo in the overture to Verdi?s 1 Luisa Miller. This is the solo instrument in a piece that opens with the movement ?The Perilous Shore?. This instrument introduces the main theme to ?The Pines of Janiculum? from Respighi?s The Pines of Rome. This instrument has a long solo at the beginning of the Adagio from Rachmaninoff?s Second Symphony, and it first states the Shaker theme in Copland?s Appalachian Spring. John Adams? Gnarly Buttons is for this instrument. Heinrich Baermann, a virtuoso on this instrument, was the dedicatee of Carl Maria von Weber?s two concertos for it. The basset horn is a variant of, for 10 points, what 2 single-reed woodwind instrument, which plays a notable glissando at the opening of Gershwin?s 3 Rhapsody in Blue? Answer: Clarinet Optimal Buzz: Correct: Wrong: Threshold Buzz: Bassoon MLP Buzz: Bassoon RNN Buzz: Clarinet Human Buzzes: 1 Violin, 2 Obo, 3 Flute, Clarinet Figure 5.17: In this question, the Threshold and mlp buzzers are too aggres- sive and buzz before the guesser?s answer is correct. In contrast, the rnn is more conservative and buzzes shortly after the optimal point which is?by a wide margin?still earlier than the earliest (correct) human buzz . against a team of tournament organizers in a match that ended in a tie.35 Later that year, a similar system defeated Ken Jennings of Jeopardy! fame at the University of Washington, but lost convincingly (145?345) at hsnct 2016. The subsequent year at hsnct 2017, our redesigned system narrowly defeated its opponents (260?215). This system used an ir guesser combined with a question type classifier (Li and Roth, 2002) and an rnn buzzer.36 Although this impressive result appears to follow the trend of machines improving until they defeat skilled humans, it is far from the whole story as we will see in Chapter 6. In parallel with these events, we hosted events where teams?humans and machines?were selected from open competition. Our first of these style events was hosted as part of a naacl 2016 workshop on question answering. Before the event, local high school teams competed against each other, and researchers submitted their machine systems which also played simulated matches against each other. At the event the best human and machine teams played against each other with the high school team defeating an early version of Studio ousia?s system (Yamada et al., 2017, 2018b).37 In 2017, we hosted a similar workshop at nips where an improved 35Our software did not handle ties correctly and terminated instead of playing tiebreaker ques- tions. 36In absolute terms, the type classifier did not improve accuracy; however, in our matches we display the top five scoring guesses, and the type classifer improved the ?plausibility? of that list. 37The ousia system embeds words and entities separately, and uses a dan-based architecture over these. 106 Threshold MLP RNN 1 Buzzing 0.75 Both 0.50 Neither Only buzzer 0.25 Only optimal 0 0 0.25 0.50 0.75 1 0 0.25 0.50 0.75 1 0 0.25 0.50 0.75 1 Position Figure 5.18: Comparing buzzers? behavior over time against the optimal buzzer. The red crossed area and dotted blue area combined indicates when the buzzer thinks that the guesser is correct, the other two combined when the buzzer thinks the guesser is wrong. The red (crossed) and orange (unhatched) areas combined indicates when the buzzer matches the optimal buzzer. Our goal is to maximize the red areas and minimize the blue areas. The static threshold baseline is overly aggressive, especially at earlier positions in the question (large dotted blue area); mlp and rnn both behaves reasonably well, and the aggressiveness of rnn is slightly more balanced early on in the question. version of ousia?s system yet again defeated its machine competition, but this time also defeated the invited human team. Events and collaborations like these show that qb is more than just another question answering task. By engaging with the qb community to digitize qb ques- tions in a machine-readable form we not only made our research possible, but started the ecosystem of tools that students now rely on to practice with before competi- tions. In the next step towards deeper collaboration with this community we are building ways for humans and machines to cooperate in competition (Feng and Boyd-Graber, 2019) and in writing questions (Chapter 6). We accomplish all this while simultaneously providing ways for students of all ages to engage with and benefit from research through our live exhibition events. 5.9 Related Work Quizbowl is a question answering and sequential decision-making task. Where Chapter 2 provides a general review of qa tasks, this section compares and contrasts qb to some of these tasks and datasets. Following this, we make connections between the decision-making aspect of qb to model calibration and prediction under domain shift. Having discussed the uncertainty in model confidence scores, we finally discuss handling uncertainty caused by opponents through opponent modeling. 107 Frequency True Outcome False 15 10 True 5 0 False -5 -10 True -15 False 0 300 600 900 Count Figure 5.19: Breaking down the buzzer?s performance on the individual question level. Impossible question means there is nothing the buzzer can do to beat the opponent. It is clearer that rnn performs better than mlp, making fewer mistakes of being overly aggressive. 5.9.1 Question Answering Datasets In our comparison, we focus on factoid qa tasks in particular. The least complex types of questions are often called ?simple questions? since they can be answered by a single simple fact. For example, SimpleQuestions (Bordes et al., 2015) is specifically designed so that questions can be answered using one knowledge-base triplet, and WikiMovies (Miller et al., 2016b) automatically generates questions from knowledge base triplets. Similarly, WebQuestions (Berant et al., 2013) uses the Google Suggest api to collect questions containing specific entities and crowd- workers only answer questions using Freebase facts. Despite the relative ease in creating these datasets, they lack the complexity and linguistic diversity of ?natural? data which is partly responsible for tasks like SimpleQuestions being essentially solved (Petrochuk and Zettlemoyer, 2018). In qb, the capability to answer simple question like these is a bare minimum requirement and takes form as the final, ?giveaway? clues at the end of questions which even novice (human) players usually answer correctly. Humans have played trivia games and tournaments for decades (Boyd-Graber and B?rschinger, 2020) and as a result there are ample non-qb sources of trivia questions. The most famous example?Jeopardy!?was converted into the searchqa dataset (Dunn et al., 2017). Other trivia-based qa datasets include TriviaQA which is built from fourteen trivia sites (Joshi et al., 2017) and Quasar-T which is built from questions collected by a reddit user (Dhingra et al., 2017). While some of these are paired with potentially useful supporting evidence, a hallmark of these datasets and qb is that the question alone unambiguously identifies an answer. However, providing supporting documents to answer questions is another pop- ular way to frame qa tasks. Although Trecqa (Voorhees and Tice, 2000) and Natu- 108 Possibility Threshold MLP RNN ralQuestions (Kwiatkowski et al., 2019) differ in that answers are not known a priori by the asker (?2.2.3), they are good examples of questions that are paired with ver- ified supporting documents. In these tasks, the questions?user queries from search engines?were written without knowledge of any particular supporting documents and afterwards annotators attempted to find appropriate supporting documents. Although triviaqa provides potentially relevant documents, they are not human verified; similarly, ms marco (Nguyen et al., 2016), WikiReading (Hewlett et al., 2016), and Newsqa (Trischler et al., 2017) also provide unverified supporting docu- ments. squad in particular falls outside this paradigm since?in general?questions are dependent on the selected context paragraph (?2.2.2). Another set of tasks focuses on creating questions that require multiple sup- porting documents and multi-step reasoning. In qb, early clues are often a compo- sition of multiple facts about the answer. For example, to guess ?Die Zauberfl?te? from ?At its premiere, the librettist of this opera portrayed a character who asks for a glass of wine with his dying wish?, one would have to combine two pieces of text from Wikipedia: ?Papageno enters. The priests grant his request for a glass of wine and he expresses his desire for a wife.? and ?Emanuel Schikaneder, librettist of Die Zauberfl?te, shown performing in the role of Papageno?. While this work focuses on tossup qb questions, a second type of qb question?bonuses?emphasize this multi-step aspect through multi-part questions (Elgohary et al., 2018). Multi-step reasoning datasets primarily differ by providing ground-truth an- notations to sufficient supporting documents. In Wikihop (Welbl et al., 2018), multi-hop questions are automatically constructed from Wikipedia, Wikidata, and WikiReading. HotPotQA (Yang et al., 2018b) follows a similar structure, but ques- tion text is crowdsourced rather than automatically generated. However, Min et al. (2019) show that although these questions are meant to be only solvable by using multi-step reasoning, that many are answerable with single-hop reasoning. Sub- sequent datasets like qasc (Khot et al., 2020), drop (Dua et al., 2019), and break (Wolfson et al., 2020) focus on creating questions that are much more likely to require multi-step reasoning through adversarial annotation. However, multi-step questions are not the only way to make questions more difficult. Adversarial question authoring and adversarial filtering (Zellers et al., 2018) are other ways to increase difficulty. The general framework of adversarial authoring filters questions either during or after annotation by whether a strong baseline an- swers it correctly. For example, Wallace et al. (2019b) show that when qb question writers?while authoring a question?are shown what a model would answer and why, that they create questions that are significantly more difficult for machines while being no more difficult for human players. Along similar lines, Bartolo et al. (2020) let writers see what a model would answer, but iteratively create new ad- versarial questions, re-train the model, and collect new adversarial questions anew. Although contrast sets explicitly do not have a model in the loop (Gardner et al., 2020a), they are similarly intended to challenge models through example perturba- tions. Other effective example perturbations for automatically creating adversarial examples include adding sentences to context paragraphs (Jia and Liang, 2017) and in general semantic-preserving permutations (Ribeiro et al., 2018). The common 109 thread in all these works is to make questions more difficult for machines so that models continue to improve. 5.9.2 Answer Triggering and Model Calibration Just as in qb?s buzzer, in real-world applications of machine learning it is important to know when to trust a model?s confidence in its predictions (Jiang et al., 2012). In qa, knowing when to answer is known as answer triggering (Yang et al., 2015; Rajpurkar et al., 2018) and the core task?correctly estimating model confidence?is model calibration (Zadrozny and Elkan). Having machines that ac- curately convey their confidence is doubly important since humans have the unfor- tunate tendency to place too much trust in machines (Sundar, 2007). Despite this, the rise of deep learning seems to have only made this more challenging (Guo et al., 2017; Feng et al., 2018). Fortunately, recent work has made progress in this prob- lem by making connections to out-of-domain detection (Kamath et al., 2020). Along similar lines, the buzzing task in qb is partially dependent on having well-calibrated models. 5.9.3 Opponent Modeling qb is far from the only game where players benefit from modeling oppo- nent behavior. Opponent modeling is particularly important in games with hidden information?in qb this takes the form of the yet-to-be-revealed question and the per-question skill of the opponent. One classic example of opponent modeling under uncertainty is Poker (Billings et al., 1998) where identifying and adapting to oppo- nents is central to the game. In games like Scrabble, opponent behavior can even be used to infer hidden information?such as their remaining tiles?to make it easier to anticipate and counter their strategy (Richards and Amir, 2007). Similarly, in real-time strategy games the opponent?s strategy is often hidden and hierarchically structured (Schadd et al., 2007); they may play aggressive?similar to aggressive buzzing?making defensive play more advantageous. Traditionally, qb is played in teams, so a full model should account for your own team?s skills as well as the mix- ture of opponent skills (e.g., certain players may be better at history questions). Although this is unaddressed in qb, this general framing has been considered in games where players compete with each other for limited resources, but are advan- taged by doing so semi-cooperatively (Von Der Osten et al., 2017). One potential area of future work is in focusing on opponent modeling in qb with an emphasis on accounting for teams of players and strategies that evolve over the course of a match (e.g., play less conservatively when trailing). 5.10 Future Work Since future work in Chapter 7 focuses on directions to improve dataset con- struction and qa evaluations, here we identify research directions that are particu- larly well suited to qa in particular. Improving qb models can incorporate many 110 commonplace tasks in nlp other than question answering. Reasoning about entities can be improved through better named entity recognition, entity linking, coreference and anaphora resolution. Some of the more difficult clues in qb however presume that the player has read and integrated the content of books such as important plot points. Further work in reading comprehension and summarization could help answer some of these questions. At a more general level, the extraction of infor- mation from external knowledge sources (such as books or Wikipedia) is important since the distribution of training examples per answer is heavily skewed, and some new questions ask about current events. Next, we take a closer look at future work using the perpetual influx of new and diverse questions from annual tournaments and working with the supportive qb community. 5.10.1 Generalization in Factoid Question Answering Although the precise format of qa is a bit idiosyncratic, it continues to grow in size and diversity year-over-year due to continually running tournaments and the factual arms race between players and writers (?5.2.3). With the popularization of qb, the number of questions per year quadrupled in the ten-year period between 2007 and 2017 (Figure 5.20). As the dataset continues to grow, it will demand that machines and humans broaden their knowledge of past events while also updating their knowledge with current events. Every year presents an opportunity to test both how well models generalize to novel questions and how well they generalize to questions about current events. For example, in a 2017 exhibition match our model missed a question about the company responsible for driving down rocket launch costs (SpaceX); a phenomenon which only manifested itself several years prior. These questions could become part of the effort to build dynamic leaderboards like in Chapter 7 which will hopefully improve the generalization of qa models. For qb specifically, a major unsolved challenge is few-shot and zero-shot qa. The scarcity of training examples per unique answer (Figure 5.14) causes a substan- tial drop in the accuracy of qb models (Figure 5.15). The central challenge is one that every student agonizes over while reviewing for exams: anticipating what topics will be on the exam! In qb, ?being able to predict and learn about the non-canonical answers that are on the cusp of being asked about is one way to become a great player?(can, 2020). In our data and throughout exhibition matches, we estimate that between 15-20% of questions have novel answers. A similar challenge is faced in computer vision where not all object categories are known beforehand and for many there is scarce training data (Xian et al., 2018). Although there are relatively simple domain adaptation methods based on feature augmentation (Daume III, 2007; Kim et al., 2016b) and adversarial learning (Chen and Cardie, 2018), these still do not address the core challenge of anticipating test-time topics. Recent work provides a potential solution to this by constructing a large set of ?probably-asked questions? (Lewis et al., 2021) to use for training. Beyond this, although there is a cultural understanding of the ?canon? of qb, this has not been rigorously studied, which could serve to inform how to best train zero-shot qa models for qb, but more interestingly shed insight into how a dedicated community?s shared memory 111 Quantity Total Questions Distinct Answers 120000 90000 60000 30000 0 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 Year Figure 5.20: The growth of the qanta dataset in number of questions and number of distinct answers over the past twenty years starting in 1997. The dataset has grown by at least 5,000 questions every year since 2010. All questions with matched answers are included, and we construct the plot by using the tournament year of each question. Independently, participation in qb (and thus number of students writing questions) has roughly doubled every year since 2008. has evolved over time. 5.10.2 Robust, Trustable, and Explainable Machine Learning qb naturally encourages qa systems that are well calibrated, but is a useful setting for improving robustness (Goel et al., 2021), explainability (Belinkov and Glass, 2019) and trust (Mitchell et al., 2019) in machine learning systems. For example, qa is also a good platform evaluating interpretations through machine- human cooperative play (Feng and Boyd-Graber, 2019). Beyond this though, as Chapter 6 explores, qb is a good platform for testing the explainability of qa systems through adversarial authoring. Our collaborative research in qb thus far is only a beginning. 5.11 Conclusion This chapter introduces and makes the case for incremental, pyramidal ques- tion answering with qb. Solving qb questions requires sophisticated nlp such as 112 Count up to Year (inclusive) resolving complex coreference, multi-hop reasoning, and understanding the rela- tionships between the plethora of entities that could be answers. Fundamental to answering qb questions is that the questions are incremental; this is both fun and good for research. It is fun because it allows for live, engaging competitions be- tween humans and computers. To evaluate systems we use three methods: offline accuracy-based metrics adapted to the incremental nature of qb, simulated matches against machines and humans, and live exhibition matches. Although the best mod- els have sixty percent accuracy at the end of questions, this is well below the best players, and early clues remain particularly challenging. This format?the prod- uct of refining human question answering competitions over decades?is also good for research because it allows for fair, comprehensive comparison of systems and iterative improvement as systems answer questions earlier and earlier. However, the benefits to research go beyond format or specific sub-tasks, and extend to our symbiotic collaboration with the public. Exhibition matches double as outreach events and opportunities to put machine systems to the test on previously unseen questions. In this next chapter, we show that by collaborating with the qb community we can combine the strengths of machines and humans to improve question quality. Rather than compete against qb systems, writers are collaborate with machine learning tools to discover bad clues which helps create questions more interesting to humans that also better test the generalization of systems. By aligning the goals of the trivia and qa communities, we can create datasets that better discriminate different levels of language and knowledge understanding. Beyond the specific qb community, live exhibitions also serve the general pub- lic. Exhibition games demonstrate what nlp and ml systems can do and what they cannot. When the tricks that computers use to answer questions are revealed to lay audiences, some of the mystique is lost, but it can also encourage enthusiasts to investigate our techniques to see if they can do better. Our open data and code facilitates this open competition, and the qb community helps make it fun and engaging. qb isn?t just another dataset or task; it is a rich platform for nlp research that co-evolves with the qb community. One example of this is our digitization of qb questions through our online interface which created and popularized online qb play. The next chapter explores how machines and humans can collaborate to write better questions while advancing nlp research. Beyond this, we hope that new unforeseen research directions will continue emerging from this collaboration while simultaneously giving back to the qb community through new and exciting ways of engaging with state-of-the-art research in machine learning and natural language processing. 113 Chapter 6: Centaur Authoring of Adversarial Questions Similar to how the mythological centaur was half-human, half-horse, [these chess teams] were half-human, half-ai. Because humans & ais are strong on different dimensions, together, as a centaur, they can beat out solo humans and computers alike. Nicky Case on Kasparov?s Centaur Chess This chapter continues building on the idea of creating more discriminative questions, but instead of focusing on format as in Chapter 5, we focus on more discriminative data.1 As Chapter 4 points out (?4.4.5), questions are least discrim- inative when the likelihood of both subjects having the same response is highest because the question is either too easy or too hard. One thing that can make qb questions too easy for machines is if statistical patterns give away the answer (?5.7.2); the end effect is that an otherwise challenging question to make progress on is wasted. This chapter introduces an annotation framework where humans and machines cooperate to write (qb) questions that contain fewer of these statistical patterns. Although this may be undesirable under the Cranfield paradigm (?2.1.1),2 under the Manchester paradigm (?2.1.2) this is important since under that paradigm the presence (or absence) of specific statistical patterns should not matter. 6.1 Introduction Proponents of machine learning claim human parity on tasks like reading com- prehension (Yu et al., 2018) and commonsense inference (Devlin et al., 2018). De- spite these successes, many evaluations neglect that computers solve natural lan- guage processing (nlp) tasks in a fundamentally different way than humans. Models can succeed without developing ?true? language understanding (Bender and Koller, 2020), instead learning superficial patterns from crawled (Chen et al., 2016) or manually annotated datasets (Kaushik and Lipton, 2018; Gururangan et al., 2018). In the context of testing for intelligent behavior, answering from patterns alone is undesirable. Thus, recent work stress tests models via adversarial evalua- tion: elucidating a system?s capabilities by exploiting its weaknesses (Jia and Liang, 2017; Belinkov and Glass, 2019). Unfortunately, while adversarial evaluation reveals 1This chapter is based on the tacl publication Wallace et al. (2019b). 2Some patterns may be useful; at the same time, ir systems should be robust so the desirability of statistical patterns is contextual. 114 Hypothesize Create Model Human Adversarial Existing Phenomenon Get Examples Verification or Not ? [ Methods ] [ Our ] Create Human Adversarial CategorizeMethod Interpretations Authoring or Not ? Phenomena Hypothesize Create Model Human Adversarial Phenomenon Get Examples Verification or Not ? Create Human Adversarial Categorize Interpretations Authoring or Not ? Phenomena Figure 6.1: Adversarial evaluation in nlp typically focuses on a specific phenomenon (e.g., word replacements) and then generates the corresponding examples (top). Consequently, adversarial examples are limited to the diversity of what the un- derlying generative model or perturbation rule can produce and also require down- stream human evaluation to ensure validity. Our setup (bottom) instead has human- authored examples, using human?computer collaboration to craft adversarial exam- ples with greater diversity. simplistic model failures (Ribeiro et al., 2018; Mudrakarta et al., 2018), exploring more complex failure patterns requires human involvement (Figure 6.1): automat- ically modifying natural language examples without invalidating them is difficult. Hence, the diversity of adversarial examples is often severely restricted. Instead, our half-human, half-computer (centaur) approach uses human cre- ativity to generate adversarial examples. A user interface presents model interpre- tations and helps users craft model-breaking examples (?6.3). We apply this to a qb (Chapter 5) where trivia enthusiasts?who already write questions for academic competitions?create diverse examples that stump existing qa models. The adversarially authored test set is nonetheless as easy as regular questions for humans (?6.4), but the relative accuracy of strong qa models drops as much as 40% (?6.5). We also host live human vs. computer matches?where models typically defeat top human teams?but unlike our prior exhibition matches (?5.8) observe spectacular model failures on adversarial questions. Analyzing the adversarial edits uncovers phenomena that humans can solve but computers cannot (?6.6), validating that our framework uncovers creative, tar- geted adversarial edits (?6.7). Our resulting adversarial dataset presents a fun, challenging, and diverse resource for future qa research: a system that masters it will demonstrate more robust language understanding. 6.2 Adversarial Evaluation for nlp Adversarial examples (Szegedy et al., 2013) often reveal model failures better than traditional test sets. However, automatic adversarial generation is tricky for nlp (e.g., by replacing words) without changing an example?s meaning or invalidat- ing it. Recent work side-steps this by focusing on simple transformations that pre- serve meaning. For instance, Ribeiro et al. (2018) generate adversarial perturba- tions such as replacing What has ? What?s. Other minor perturbations such as typos (Belinkov and Bisk, 2018), adding distractor sentences (Jia and Liang, 2017; Mudrakarta et al., 2018), or character replacements (Ebrahimi et al., 2018) preserve meaning while degrading model performance. 115 Generative models can discover more adversarial perturbations but require post hoc human verification of the examples. For example, neural paraphrase or language models can generate syntax modifications (Iyyer et al., 2018), plausible captions (Zellers et al., 2018), or nli premises (Zhao et al., 2018). These meth- ods improve example-level diversity but mainly target a specific phenomenon, e.g., rewriting question syntax. Furthermore, existing adversarial perturbations are restricted to sentences? not the paragraph inputs of qb and other tasks?due to challenges in long-text generation. For instance, syntax paraphrase networks (Iyyer et al., 2018) applied to qb only yield valid paraphrases 3% of the time (Appendix D.1). 6.2.1 Putting a Human in the Loop Instead, we task human authors with adversarial writing of questions: gener- ating examples which break a specific qa system but are still answerable by humans. We expose model predictions and interpretations to question authors, who find ques- tion edits that confuse the model. The user interface makes the adversarial writing process interactive and model- driven, in contrast to adversarial examples written independent of a model (Ettinger et al., 2017). The result is an adversarially authored dataset that explicitly exposes a model?s limitations by design. Human-in-the-loop generation can replace or aid model-based adversarial gen- eration approaches. Creating interfaces and interpretations is often easier than designing and training generative models for specific domains. In domains where adversarial generation is feasible, human creativity can reveal which tactics auto- matic approaches can later emulate. Model-based and human-in-the-loop generation approaches can also be combined by training models to mimic human adversarial edit history, using the relative merits of both approaches. 6.3 Our QA Testbed: Quizbowl Like the work in Chapter 5, we continue using qb as it is the ?gold standard? of academic competitions between universities and high schools is qb. 6.3.1 Known Exploits of Quizbowl Questions Like most qa datasets, qb questions are written for humans. Unfortunately, the heuristics that question authors use to select clues do not always apply to com- puters. For example, humans are unlikely to memorize every song in every opera by a particular composer. This, however, is trivial for a computer. In particular, a simple qa system easily solves the example in Figure 6.2 from seeing the reference to ?Un Bel Di?. Other questions contain uniquely identifying ?trigger words? (Har- ris, 2006). For example, ?martensite? only appears in questions on steel. For these examples, a qa system needs only use a simple and non-generalizable if?then rule. 116 The protagonist of this opera describes the future day when her lover will arrive on a boat in the aria ?Un Bel Di? or ?One Beautiful Day?. The only baritone role in this opera is the consul Sharpless who reads letters for the protagonist, who has a maid named Suzuki. That protagonist blindfolds her child Sorrow before stabbing herself when her lover B. F. Pinkerton returns with a wife. For 10 points, name this Giacomo Puccini opera about an American lieutenant?s affair with the Japanese woman Cio-Cio San. Answer: Madama Butterfly Figure 6.2: This question is easily answered by a model after seeing the reference to ?Un Bel Di.? Our adversarial writing process highlights terms like these which humans can then modify to make clues more challenging for computer. One might wonder if this means that factoid qa is thus an uninteresting, nearly solved research problem. However, some qb questions are fiendishly difficult for computers. Many questions have intricate coreference patterns (Guha et al., 2015), require reasoning across multiple types of knowledge, or involve complex wordplay. If we can isolate and generate questions with these difficult phenomena, ?simplistic? factoid qa quickly becomes non-trivial. 6.3.2 Models and Datasets We conduct two rounds of adversarial writing. In the first, authors attack the traditional information retrieval (ir) system from Section 5.5.1 which we also used in the nips 2017 qb shared task (Boyd-Graber et al., 2018). In the second round, authors attack either the ir model or a neural qa model. The neural model is a bidirectional rnn using the gated recurrent unit architec- ture (Cho et al., 2014b) (?5.5.3). Both models in this round are trained using the full qanta dataset (?5.3). The second round dataset incorporates more diverse answers than the first (25,000 entities versus 11,000 in round one).3 6.3.3 Interpreting Quizbowl Models To help write adversarial questions, we expose what the model is thinking to the authors. We interpret models using saliency heat maps: each word of the question is highlighted based on its importance to the model?s prediction (Ribeiro et al., 2016). For the neural model, word importance is the decrease in prediction probability when a word is removed (Li et al., 2016; Wallace et al., 2018). We focus on gradient- based approximations (Simonyan et al., 2014; Montavon et al., 2018) for their com- putational efficiency. To interpret a model prediction on an input sequence of n words w = ?w1,w2, . . .wn?, we approximate the classifier f with a linear function of wi derived from the first-order Taylor expansion. The importance of wi, with 3The qanta dataset was built contemporaneously with the initial data collection of this chapter, and we used the improved version in the second data collection round. 117 Figure 6.3: The author writes a question (top right), the qa system provides guesses (left), and explains why it makes those guesses (bottom right). The author can then adapt their question to ?trick? the model. embedding vi, is the derivative of f with respect to the one-hot vector: ?f ?f ?vi ?f = = ? vi. (6.1) ?wi ?vi ?wi ?vi This simulates how model predictions change when a particular word?s embedding is set to the zero vector?it approximates word removal (Ebrahimi et al., 2018; Wallace et al., 2018). For the ir model, we use the ElasticSearch Highlight api (Gormley and Tong, 2015), which provides word importance scores based on query matches from the inverted index. 6.3.4 Adversarial Writing Interface The authors interact with either the ir or rnn model through a user interface4 (Figure 6.3). An author writes their question in the upper right while the model?s top five predictions (Machine Guesses) appear in the upper left. If the top prediction is the right answer, the interface indicates where in the question the model is first correct. The goal is to cause the model to be incorrect or to delay the correct answer position as much as possible.5 The words of the current question are highlighted using the applicable interpretation method in the lower right (Evidence). We do not 4https://github.com/Eric-Wallace/trickme-interface/ 5The authors want normal qb questions which humans can easily answer by the very end. For popular answers, (e.g., Australia or Suez Canal), writing novel final give-away clues is difficult. We thus expect models to often answer correctly by the very end of the question. 118 enforce time restrictions or require questions to be adversarial: if the author fails to break the system, they are free to ?give up? and submit any question. The interface continually updates as the author writes. We track the ques- tion edit history to identify recurring model failures (?6.6) and understand how interpretations guide the authors (?6.7). 6.3.5 Question Authors We focus on members of the qb community: they have deep trivia knowledge and craft questions for qb tournaments (?5.2.3). We award prizes for questions read at live human?computer matches (?6.5.3). The question authors are familiar with the standard format of qb questions (Lu- jan and Teitler, 2003). The questions follow a common paragraph structure, are well edited for grammar, and finish with a simple ?give-away? clue (?5.2.2). These con- straints benefit the adversarial writing process as it is very clear what constitutes a difficult but valid question. Thus, our examples go beyond surface-level ?breaks? such as character noise (Belinkov and Bisk, 2018) or syntax changes (Iyyer et al., 2018). Rather, questions are difficult because of their semantic content (examples in Section 6.6). 6.3.6 How an Author Writes a Question To see how an author might write a question with the interface, we walk through an example of writing a question?s first sentence. The author first selects the answer to their question from the training set?Johannes Brahms?and begins: Karl Ferdinand Pohl showed this composer some pieces on which this composer?s Variations on a Theme by Haydn were based. The qa system buzzes (i.e., it has enough information to interrupt and answer correctly) after ?composer?. The author sees that the name ?Karl Ferdinand Pohl? appears in Brahms? Wikipedia page and avoids that specific phrase, describing Pohl?s position instead of naming him directly: This composer was given a theme called ?Chorale St. Antoni? by the archivist of the Vienna Musikverein, which could have been written by Ignaz Pleyel. This rewrite adds in some additional information (there is a scholarly disagreement over who wrote the theme and its name), and the qa system now incorrectly thinks the answer is Fr?d?ric Chopin. The user can continue to build on the theme, writing While summering in Tutzing, this composer turned that theme into ?Variations on a Theme by Haydn?. 119 Science 17% History 22% Literature 18% Fine Arts 15% Religion, Mythology, Philosophy, and Social Science 13% Current Events, Geography, and General Knowledge 15% Total Questions 1213 Table 6.1: The topical diversity of the questions in the adversarially authored dataset based on a random sample of 100 questions. Again, the author sees that the system buzzes ?Variations on a Theme? with the correct answer. However, the author can rewrite it in its original German, ?Vari- ationen ?ber ein Thema von Haydn? to fool the system. The author continues to create entire questions the model cannot solve. 6.4 A New Adversarially-Authored Dataset Our adversarial dataset consists of 1213 questions with 6,541 sentences across diverse topics (Table 6.1).6 There are 807 questions written against the ir system and 406 against the neural model by 115 unique authors. We plan to hold twice- yearly competitions to continue data collection. 6.4.1 Validating Questions with Quizbowlers We validate that the adversarially authored questions are not of poor quality or too difficult for humans. We first automatically filter out questions based on length, the presence of vulgar statements, or repeated submissions (including re-submissions from the qb training or evaluation data). We next host a human-only qb event using intermediate and expert players (former and current collegiate qb players). We select sixty adversarially authored questions and sixty standard high school national championship questions, both with the same number of questions per category (list of categories in Table 6.1). To answer a qb question, a player interrupts the question: the earlier the better. To capture this dynamic, we record both the average answer position (as a percentage of the question, lower is better) and answer accuracy. We shuffle the regular and adversarially authored questions, read them to players, and record these two metrics. The adversarially authored questions are on average easier for humans than the regular test questions. For the adversarially authored set, humans buzz with 6Data available at http://trickme.qanta.org. 120 41.6% of the question remaining and an accuracy of 89.7%. On the standard ques- tions, humans buzz with 28.3% of the question remaining and an accuracy of 84.2%. The difference in accuracy between the two types of questions is not significantly different (p = 0.16 using Fisher?s exact test), but the buzzing position is earlier for adversarially authored questions (p = 0.0047 for a two-sided t-test). We expect the questions that were not played to be of comparable difficulty because they went through the same submission process and post-processing. We further explore the human-perceived difficulty of the adversarially authored questions in Section 6.5.3. 6.5 Computer Experiments This section evaluates qa systems on the adversarially authored questions. We test three models: the ir and rnn models shown in the interface, as well as a Deep Averaging Network (?5.5.3) to evaluate the transferability of the adversarial questions. We break our study into two rounds. The first round consists of adver- sarially authored questions written against the ir system (?6.5.1); the second round questions target both the ir and rnn (?6.5.2). Finally, we also hold live competitions that pit the state-of-the-art Studio Ousia model (Yamada et al., 2018b) against human teams (?6.5.3). 6.5.1 First Round Attacks: IR Adversarial Questions Transfer To All Models The first round of adversarially authored questions target the ir model and are significantly harder for the ir, rnn, and dan models (Figure 6.4). For example, the dan?s accuracy drops from 54.1% to 32.4% on the full question (60% of original performance). For both adversarially authored and original test questions, early clues are dif- ficult to answer (accuracy about 10% for the first quarter of the question). However, during the middle third of the questions, where buzzes in qb most frequently oc- cur, accuracy on original test questions rises quicker than the adversarially authored ones. For both, the accuracy rises towards the end as the clues become ?give-aways?. 6.5.2 Second Round Attacks: RNN Adversarial Questions are Brittle In the second round, the authors also attack an rnn model. All models tested in the second round are trained on a larger dataset (?6.3.2). A similar trend holds for ir adversarial questions in the second round (Fig- ure 6.5): a question that tricks the ir system also fools the two neural models (i.e., adversarial examples transfer). For example, the dan model was never targeted but had substantial accuracy decreases in both rounds. However, this does not hold for questions written adversarially against the rnn model. On these questions, the neural models struggle but the ir model is largely unaffected (Figure 6.5, right). 121 Round 1 Attacks and Models DAN RNN IR 0.6 0.4 Questions Regular Test 0.2 Round 1 - IR Adversarial 0 0 0.5 1 0 0.5 1 0 0.5 1 Percent of Question Revealed Figure 6.4: The first round of adversarial writing attacks the ir model. Like regular test questions, adversarially authored questions begin with difficult clues that trick the model. However, the adversarial questions are significantly harder during the crucial middle third of the question. Round 2 Attacks and Models DAN RNN IR 0.6 Questions 0.4 Regular Test Round 2 - IR Adversarial 0.2 Round 2 - RNN Adversarial 0 0 0.5 1 0 0.5 1 0 0.5 1 Percent of Question Revealed Figure 6.5: The second round of adversarial writing attacks the ir and rnn models. The questions targeted against the ir system degrade the performance of all models. However, the reverse does not hold: the ir model is robust to the questions written to fool the rnn. 6.5.3 Humans vs. Computer, Live (again)! In the offline setting (i.e., no pressure to ?buzz? before an opponent) models demonstrably struggle on the adversarial questions. But, what happens in standard qb: live, head-to-head games? We run two live humans vs. computer matches. The first match uses ir adversarial questions in a forty question, tossup-only qb format. We pit a human team of national-level qb players against the Studio Ousia model (Yamada et al., 2018b), the current state-of-the-art qb system. The model combines neural, ir, and knowledge graph components (details in Appendix D.2), and won the 2017 nips shared task, defeating a team of expert humans 475?200 on regular qb test questions. Although the team at our live event was comparable to the nips 2017 team, the tables were turned: the human team won handedly 300?30. Our second live event is significantly larger: seven human teams play against models on over 400 questions written adversarially against the rnn model. The human teams range in ability from high school qb players to national-level teams (Jeopardy! champions, Academic Competition Federation national champions, top 122 Accuracy Accuracy Intermediate Expert National 1 0.8 Questions 0.6 Regular Test 0.4 IR Adversarial 0.2 RNN Adversarial 0 0.5 1 0.5 1 0.5 1 Percent of Question Revealed Figure 6.6: Humans find adversarially authored question about as difficult as normal questions: rusty weekend warriors (Intermediate), active players (Expert), or the best trivia players in the world (National). 1 Regular Test 0.8 IR Adversarial 0.6 RNN Adversarial 0.4 0.2 0 0.25 0.50 0.75 1 Percent of Question Revealed Figure 6.7: The accuracy of the state-of-the-art Studio Ousia model degrades on the adversarially authored questions despite never being directly targeted. This verifies that our findings generalize beyond the rnn and ir models. scorers in the World Quizzing Championships). The models are based on either ir or neural methods. Despite a few close games between the weaker human teams and the models, humanity prevailed in every match.7 Figures 6.6?6.7 summarize the live match results for the humans and Ousia model, respectively. Humans and models have considerably different trends in an- swer accuracy. Human accuracy on both regular and adversarial questions rises quickly in the last half of the question (curves in Figure 6.6). In essence, the ?give- away? clues at the end of questions are easy for humans to answer. On the other hand, models on regular test questions do well in the first half, i.e., the ?difficult? clues for humans are easier for models (Regular Test in Figure 6.7). However, models, like humans, struggle on adversarial questions in the first half. 7Videos available at http://trickme.qanta.org. 123 Accuracy Accuracy Adversarial Regular Unigram overlap 0.40 0.37 Bigram overlap 0.08 0.05 Longest n-gram over- 6.73 6.87 lap Average ne overlap 0.38 0.46 ir Adversarial 0.35 rnn Adversarial 0.44 Total Words 107.1 133.5 Total ne 9.1 12.5 Table 6.2: The adversarially authored questions have similar n-gram overlap to the regular test questions. However, the overlap of the named entities (ne) decreases for ir Adversarial questions. 6.6 What Makes Adversarially-authored Questions Hard? This section analyzes the adversarially authored questions to identify the source of their difficulty. 6.6.1 Quantitative Differences in Questions One possible source of difficulty is data scarcity: the answers to adversarial questions rarely appear in the training set. However, this is not the case; the mean number of training examples per answer (e.g., George Washington) is 14.9 for the adversarial questions versus 16.9 for the regular test data. Another explanation for question difficulty is limited ?overlap? with the train- ing data, i.e., models cannot match n-grams from the training clues. We measure the proportion of test n-grams that also appear in training questions with the same an- swer (Table 6.2). The overlap is roughly equal for unigrams but surprisingly higher for adversarial questions? bigrams. The adversarial questions are also shorter and have fewer nes. However, the proportion of named entities is roughly equivalent. One difference between the questions written against the ir system and the ones written against the rnn model is the drop in nes. The decrease in nes is higher for ir adversarial questions, which may explain their generalization: the rnn is more sensitive to changes in phrasing, while the ir system is more sensitive to specific words. 6.6.2 Categorizing Adversarial Phenomena We next qualitatively analyze adversarially authored questions. We manually inspect the author edit logs, classifying questions into six different phenomena in two broad categories (Table 6.3) from a random sample of 100 questions, double counting questions into multiple phenomena when applicable. 124 Composing Seen Clues 15% Logic & Calculations 5% Multi-Step Reasoning 25% Paraphrases 38% Entity Type Distractors 7% Novel Clues 26% Total Questions 1213 Table 6.3: A breakdown of the phenomena in the adversarially authored dataset. 6.6.3 Adversarial Category 1: Reasoning The first question category requires reasoning about known clues (Table 6.4). Composing Seen Clues: These questions provide entities with a first-order rela- tionship to the correct answer. The system must triangulate the correct answer by ?filling in the blank?. For example, the first question of Table 6.4 names the place of death of Tecumseh. The training data contains a question about his death reading ?though stiff fighting came from their Native American allies under Tecumseh, who died at this battle? (The Battle of the Thames). The system must connect these two clues to answer. Logic & Calculations: These questions require mathematical or logical opera- tors. For example, the training data contains a clue about the Battle of Thermopylae: ?King Leonidas and 300 Spartans died at the hands of the Persians?. The second question in Table 6.4 requires adding 150 to the number of Spartans. Multi-Step Reasoning: This question type requires multiple reasoning steps be- tween entities. For example, the last question of Table 6.4 requires a reasoning step from the ?I Have A Dream? speech to the Lincoln Memorial and then another rea- soning step to reach Abraham Lincoln. 6.6.4 Adversarial Category 2: Distracting Clues The second category consists of circumlocutory clues (Table 6.5). Paraphrases: A common adversarial modification is to paraphrase clues to re- move exact n-gram matches from the training data. This renders our ir system useless but also hurts the neural models. Many of the adversarial paraphrases go beyond syntax-only changes (e.g., the first row of Table 6.5). 125 Question Prediction Answer Phenomenon This man, who died Battle of Tippecanoe Tecumseh Composing at the Battle of the Seen Clues Thames, experienced a setback when his brother Tenskwatawa?s influence over their tribe began to fade. This number is one hun- Battle of Thermopylae 450 Logic & dred fifty more than the Calcula- number of Spartans at tions Thermopylae. A building dedicated to Martin Luther King Jr. Abraham Lincoln Multi-Step this man was the site of Reasoning the ?I Have A Dream? speech. Table 6.4: The first category of adversarially authored questions consists of exam- ples that require reasoning. Answer displays the correct answer (all models were incorrect). For these examples, connecting the training and adversarially authored clues is simple for humans but difficult for models. Entity Type Distractors: Whether explicit or implicit in a model, one key com- ponent for qa is determining the answer type of the question. Authors take advan- tage of this by providing clues that cause the model to select the wrong answer type. For example, in the second question of Table 6.5, the ?lead-in? clue implies the answer may be an actor. The rnn model answers Don Cheadle in response despite previously seeing the Bill Clinton ?playing a saxophone? clue in the training data. Novel Clues: Some adversarially authored questions are hard not because of phrasing or logic but because our models have not seen these clues. These questions are easy to create: users can add Novel Clues that?because they are not uniquely associated with an answer?confuse the models. While not as linguistically in- teresting, novel clues are not captured by Wikipedia or qb data, thus improving the dataset?s diversity. For example, adding clues about literary criticism (Hard- wick, 1967; Watson, 1996) to a question about Lillian Hellman?s The Little Foxes: ?Ritchie Watson commended this play?s historical accuracy for getting the price for a dozen eggs right?ten cents?to defend against Elizabeth Hardwick?s contention that it was a sentimental history.? Novel clues create an incentive for models to use information beyond past questions and Wikipedia. Novel clues have different effects on ir and neural models: while ir models largely ignore them, novel clues can lead neural models astray. For example, on a question about Tiananmen Square, the rnn model buzzes on the clue ?World 126 Set Question Prediction Phenomenon Train Name this sociological phenomenon, the Suicide taking of one?s own life. Paraphrase Adv Name this self-inflicted method of death. Arthur Miller Train Clinton played the saxophone on The Bill Clinton Arsenio Hall Show. Adv He was edited to appear in the film ?Con- Don Cheadle Entity tact?. . . For ten points, name this American Type Dis- president who played the saxophone on an tractor appearance on the Arsenio Hall Show. Table 6.5: The second category of adversarial questions consists of clues that are present in the training data but are written in a distracting manner. Training shows relevant snippets from the training data. Prediction displays the rnn model?s answer prediction (always correct on Training, always incorrect on Adversarial). One of these concepts . . . a Hyperbola is a type of, for ten points, what shapes made by passing a plane through a namesake solid, that also includes the ellipse, parabola? whose area is given by one-third Pi r squared times height? Prediction: Conic Section (3) ? Sphere (7 ) Figure 6.8: The interpretation successfully aids an attack against the ir system. The author removes the phrase containing the words ?ellipse? and ?parabola?, which are highlighted in the interface (shown in bold). In its place, they add a phrase which the model associates with the answer Sphere. Economic Herald?. However, adding a novel clue about ?the history of shaving? renders the brittle rnn unable to buzz on the ?World Economic Herald? clue that it was able to recognize before.8 This helps to explain why adversarially authored questions written against the rnn do not stump ir models. 6.7 How Do Interpretations Help? This section explores how model interpretations help to guide adversarial au- thors. We analyze the question edit log, which reflects how authors modify questions given a model interpretation. A direct edit of the highlighted words often creates an adversarial example (e.g., Figure 6.8). Figure 6.9 shows a more intricate example. The left plot shows the Question Length, as well as the position where the model is first correct (Buzzing Position, lower is better). We show two adversarial edits. In the first (1), the author 8The ?history of shaving? is a tongue-in-cheek name for a poster displaying the hirsute leaders of Communist thought. It goes from the bearded Marx and Engels, to the mustachioed Lenin and Stalin, and finally the clean-shaven Mao. 127 Confidential TACL submission. DO NOT DISTRIBUTE. 900 In his speeches this . . . As a Senator, numerous times. However, better model interpre- 950 901 this man supported Paraguay in the tation techniques and visualizations can ease this 951 902 Chaco War, believing Bolivia was backed burden. 952 903 by Standard Oil. Another benefit of leveraging human adversaries 953 904 + this man?s campaign was endorsed by is that they can generate examples that are more di- 954 905 Milo Reno and Charles Coughlin. verse than automatic methods (Jia and Liang, 2017; 955 906 Prediction: Huey Long (3)! Huey Long (3) Iyyer et al., 2018). This diversity provides insight 956 907 into numerous model limitations, rather than a sin- 957 908 In his speeches this . . . As a Senator, this man?s gle one. Combining these two approaches, perhaps 958 909 campaign was endorsed by Milo Reno and by training models to mimic user edit history, could 959 910 Charles Coughlin. be fruitful to explore in future work. 960 911 + a Catholic priest and radio show host. 961 912 Prediction: Huey Long (3)! Huey Long (3) 9 Related Work 962 913 Figure 6: An failed attempt to trick the neural Creating evaluation datasets to get a fine-grained 963 914 model. The user modified the question multiple analysis of particular linguistics features or model 964 915 times, replacing words suggested by the interpreta- attributes has been explored in past work. The 965 916 tion, but was unable to break the system. LAMBADA dataset tests a model?s ability to under- 966 917 stand the broad contexts present in book passages 967 918 (Paperno et al., 2016). Other work focuses on natu- 968 The BioLIp database stores data on the 919 ral language inference, where challenge examples 969 interaction of these species with proteins. 920 highlight existing model failures (Wang et al., 2018; 1 2 970 Examples of these molecules with C2 symme- 921 Glockner et al., 2018; Naik et al., 2018). Our worktry can increase enantioselectivity, as in their 971 922 is unique in that we use human as adversaries to ex-Josiphos variety 972. . . pose model weaknesses, which provides a diverse 923 Prediction: Ion (7)! Ligand (3) 1 973set of phenomena (from paraphrases to multi-hop 924 974 molecules reasoning) that models can?t solve.925 Examples of these species with C2 975 symmetry can increase enantioselectivity, as in Other work has explored specific limitations926 976 their Josiphos variety of NLP systems. Rimell et al. (2009) show that927 . . . Prediction: Ligand ( )! Ion ( ) 2 parsers struggle on test examples with unbounded 977 928 3 7 dependencies. A closely related work to ours is 978 929 Figure 7: An failed attempt to trick the neural Ettinger et al. (2017) who also use humans as ad- 979 930 980 model. The user modified the question multiple versaries. Unlike their Build-it Break-it setting, we Figure 6.9: The Question Length93a1nd the ptoimsietsi,orenplwachinegrweotrhdsesumggoedsteeld ibsyfithresitntceroprrreetac-t have a ready-made audience of ?breakers? who are 981 (Buzzing Position, lower is better9)32are showtinona, bsuat wqaus uensatbiolentoisbrewakrithteesny.steImn. (1), the motivated and capable of generating adversarial ex- 982 933 author makes a mistake by removing a sentence that makes the question easier for amples. Our work also differs in that we use model 983 934 interpretation methods to facilitate the breaking in 984 the ir model. In (2), the author 9u35ses the interpretation, replacing the highlightedand predictions to aid in the generation of chal- an collaborative manner. 985 word (shown in bold) ?molecules?93w6 ith ?specleinegsi?ngtoextarmicpklest.hTehirs nnenw amnnoodtaetli.on method is Finally, we discussed recent work on adversarial 986 937 salient given the difficulty of collecting large-scale NLP attacks in Section ??. These types of input 987 938 removes the first sentence of the question, wdatasets that do not contain superficial c modifications target one specific type of phenom- 988 939 elhsiccahnmusaektoes?gtahmee?quaetasstkio(nGuerausrainegr luefsomr od-an et atlh.,e ena (e.g., syntatic modifications). These methods 989 model (buzz position decreases). T94h0 e author20c1o8u;nCtheernaecttasl.t,h20is16in). Othuer asdevceorsnadriaeldwirtit(in2g), are a complementary strategy to adversarial evalua- 990 where they use the interpretation94t1o craft a ftramrgeewtoerdk amlleovdiaitfiesctahteisoenanwnohtaictihonbarteiafakcsts tbhye tion of NLP models. 991 ir model. 942 exposing model pathologies (and their learned arti- 992 943 facts) during the data annotation process. 10 Conclusion 993 In his speeches this . . . As a Sen9a4t4or, While our adversarial writing setup requires It is difficult to automatically expose the limitations 994 this man supported Paraguay i9n45the clear interpretations of a model, annotators can of a machine learning system, especially when that 995 Chaco War, believing Bolivia9w46as backesdtillbgyenSetraatnedchaarldlenOgiinlg. examples for neural sys- system solves a fixed held-out evaluation set. In 996 947 tems even using IR output. The effort required from our setup, humans try to break a trained system. By 997 this man?s campaign was endors9e4d8 by MiloanRnoetnatoorsainndcreCashesadrulerisngCthoeuagdhvleirnsa.rial writing supporting this breaking process with interpretation 998 Prediction: Huey Long (3) ? H9u49ey Long p(r3oc)ess, as they may need to rewrite an example methods, users can understand what a model is 999 In his speeches this . . . As a Senator, this man?s campaign was endorsed by Milo Reno and 10 Charles Coughlin. a Catholic priest and radio show host. Prediction: Huey Long (3) ? Huey Long (3) Figure 6.10: A failed attempt to trick the neural model. The author modifies the question multiple times, replacing words suggested by the interpretation, but is unable to break the system. However, models are not always this brittle. In Figure 6.10, the interpretation fails to aid an adversarial attack against the rnn model. At each step, the author uses the highlighted words as a guide to edit targeted portions of the question yet fails to trick the model. The author gives up and submits their relatively non- adversarial question. 128 6.7.1 Interviews With Adversarial Authors We also interview the adversarial authors who attended our live events.9 Mul- tiple authors agree that identifying oft-repeated ?stock? clues was the interface?s most useful feature. As one author explained, ?There were clues which I did not think were stock clues but were later revealed to be?. In particular, the author?s question about the Congress of Vienna used a clue about ?Krak?w becoming a free city?, which the model immediately recognized. Another interviewee was Jordan Brownstein,10 a national qb champion and one of the best active players, who felt that computer opponents were better at questions that contained direct references to battles or poetry. He also explained how the different writing styles used by each qb author increases the difficulty of questions for computers. The interface?s evidence panel allows authors to read existing clues which encourages these unique stylistic choices. 6.8 Related Work New datasets often allow for a finer-grained analysis of a linguistic phenomenon, task, or genre. The lambada dataset (Paperno et al., 2016) tests a model?s under- standing of the broad contexts present in book passages, while the Natural Ques- tions corpus (Kwiatkowski et al., 2019) combs Wikipedia for answers to questions that users trust search engines to answer (Oeldorf-Hirsch et al., 2014). Other work focuses on natural language inference, where challenge examples highlight model failures (Wang et al., 2019b; Glockner et al., 2018; Naik et al., 2018). Our work is unique in that we use human adversaries to expose model weaknesses, which pro- vides a diverse set of phenomena (from paraphrases to multi-hop reasoning) that models cannot solve. Other work puts an adversary in the data annotation or postprocessing loop. For instance, Dua et al. (2019) and Zhang et al. (2018b) filter out easy questions using a baseline qa model, while Zellers et al. (2018) use stylistic classifiers to filter language inference examples. Rather than filtering out easy questions, we instead use human adversaries to generate hard ones. Similar to our work, Ettinger et al. (2017) use human adversaries. We extend their setting by providing humans with model interpretations to facilitate adversarial writing. Moreover, we have a ready- made audience of question writers to generate adversarial questions. The collaborative adversarial writing process reflects the complementary abili- ties of humans and computers. For instance, ?centaur? chess teams of both a human and a computer are often stronger than a human or computer alone (Case, 2018). In Starcraft, humans devise high-level ?macro? strategies, while computers are su- perior at executing fast and precise ?micro? actions (Vinyals et al., 2017). In nlp, computers aid simultaneous human interpreters (He et al., 2016a) at remembering forgotten information or translating unfamiliar words. 9Interviews at https://youtu.be/MzM1oNsm8MQ. 10https://www.qbwiki.com/wiki/Jordan_Brownstein 129 Other approaches to adversarial evaluation of nlp models (Section 6.2) typ- ically target one phenomenon (e.g., syntactic modifications) and complement our human-in-the-loop approach. Since the original collection of this adversarial dataset, several other datasets have incorporated a similar adversarial human-in-the-loop mechanism (?2.2.3). 6.9 Conclusion One of the challenges of machine learning is knowing why systems fail. This work brings together two threads that attempt to answer this question: visualiza- tions and adversarial examples. Visualizations underscore the capabilities of existing models, while adversarial examples?crafted with the ingenuity of human experts? show that these models are still far from matching human prowess. Our experiments with both neural and ir methodologies show that qa models still struggle with synthesizing clues, handling distracting information, and adapting to unfamiliar data. Our adversarially authored dataset is only the first of many iterations (Ruef et al., 2016). As models improve, future adversarially authored datasets can elucidate the limitations of next-generation qa systems. While we focus on qa, our procedure is applicable to other nlp settings where there is (1) a pool of talented authors who (2) write text with specific goals. Future research can look to craft adversarially authored datasets for other nlp tasks that meet these criteria. For tasks without access to experts, subsequent work in a similar spirit has shown centaur authorship to still be helpful (?2.2.3). In next chapter we also discuss the role of adversarial authorship as one (very useful) tool in the dataset construction toolbox. Throughout this thesis we improve qa evaluation: this chapter focuses on improving evaluation data so that it more robustly tests qa models for intelligent behavior. In the next chapter, we combine ideas from adversarial centaur authorship with the irt model developed in Chapter 4 and propose future work to further guide human authors towards writing more discriminative questions. While building towards this future work, we describe a larger view of dataset construction and position centaur authorship as one tool in a greater toolbox for creating better evaluation data. 130 Chapter 7: Conclusion The proliferation of competing articulations, the willingness to try anything, the expression of explicit discontent, the recourse to philosophy and to debate the fundamentals, all these are symptoms of a transition from normal to extraordinary research. Thomas S. Kuhn on Paradigm Shifts The Structures of Scientific Revolutions We began this thesis by introducing a new conceptual framework for under- standing two branches of qa that have similar yet distinct goals (Chapter 2). The first centers on satisfying users of information-seeking systems (Cranfield paradigm) while the second aims to build qa systems that exhibit intelligent behavior (Manch- ester paradigm). At the core of this framework is exploring: for what purpose are we building qa datasets, models, and evaluations? Although these goals overlap, they motivate very different research agendas. For example, in curiosity-driven information-seeking (Chapter 3), user satisfaction goes beyond whether an answer is correct, but to whether the answer inspires user engagement through followup inquiry. This is quite different from Quizbowl where qa systems (or humans) should yield the correct answer, for the correct reason (Chapter 5). Neither paradigm is superior to the other, but we should recognize that their goals are different. For example, adversarially authored questions (Chap- ter 6) are based on the (Manchester) idea that qa models should correctly answer conceptually similar questions, even if worded differently?otherwise, it would indi- cate a lack of understanding.1 After identifying this paradigm difference, we explore ways to improve qa evaluations through better data and formats. In particular, one problem this the- sis addresses is that since current qa evaluations are static and unchanging, they are less discriminative. Our improvements are grounded in the idea that as qa models continue to improve, that evaluations should have built-in mechanisms for retaining the ability to distinguish between better and worse models. Quizbowl, for example, retains discriminative ability through its incremental format (Chapter 5). Rather than the binary correct/wrong distinction of other tasks, qb is unique in that questions?by construction?test for multiple ability levels so take longer to become ?stale.? In cases where we do not control the ?freshness? of examples, we advocate using testing methodologies like Item Response Theory to dynamically 1However, answering correctly does not necessarily imply understanding. 131 score models (Chapter 4). However, these ideas are only a beginning in rethinking how evaluations can best support qa research. Just as trec shaped the direction of ir research (Voorhees et al., 2005), shared benchmark tasks have shaped the course of machine learning and nlp re- search (Dotan and Milli, 2020). To make an analogy, shared tasks are akin to large and expensive shared experiments, instrumentation, and laboratories in the physi- cal sciences like the Human Genome Project, telescopes (e.g., Hubble and Keppler), particle colliders such as the lhc, and the International Space Station. These shared resources set the course of research for many years at a time and reflect the scientific priorities and values of their creators. As an example, the investment in qa within trec ?generated a resurgence of research on question answering? (Voorhees et al., 2005, p. 12). Taking a wider view, arguably the Cranfield experiments (Cleverdon, 1967) began a Kuhnian paradigm shift (Kuhn, 2012)2 in ir and eventually nlp that was subsequently accelerated by trec. A similar shift occurred in machine learning more broadly with ?the unreasonable effectiveness of data? (Halevy et al., 2009) which was further accelerated by ever-growing datasets (Sun et al., 2017) and computational resources (Kaplan et al., 2020). However, there is increasing awareness that these successes are not without their dangers, whether it is in nlp (Hovy and Spruit, 2016; Bender et al., 2021) or adjacent fields like computer vision (Birhane and Prabhu, 2021). Across both of these fields, the dangers and harms include, but are certainly not limited to propagating the racial (Blodgett et al., 2016; Buolamwini and Gebru, 2018; Merullo et al., 2019) and gender (Cao and Daum?, 2020) biases of society (Friedman and Nissenbaum, 1996). This is most prominently seen in the creation of the acm Conference on Fairness, Accountability and Transparency in 2018, along with the veritable wealth of work in the analysis, interpretation, and evaluation of neural models (Belinkov and Glass, 2019). Thus, we are arguably at the precipice of another Khunian paradigm shift towards evaluation that looks beyond optimizing headline benchmark numbers in chase of state-of-the-art effectiveness (Linzen, 2020), and towards a paradigm that incorporates the use and impact of nlp models (Mitchell et al., 2019; Blodgett et al., 2020) into evaluations. Next, we propose how shared benchmark tasks (leaderboards) can fuse the ideas of this thesis with these values to improve qa evaluation. In both industrial practice (Holstein et al., 2019) and academic benchmark tests, a potent place to intervene is in the construction of datasets and benchmark tasks since this often sets the research agenda for many years thereafter and how data is curated has substantial real-world impacts (Rogers, 2021). For example, the first iteration of the EfficientQA shared task (Min et al., 2021) explicitly incorpo- rated efficiency as a core value (Schwartz et al., 2020) which yielded substantially 2Thomas Kuhn?s concept of paradigm shifts posits that science proceeds in four phases: (1) normal, incremental science under an established paradigm (e.g., Newtonian Mechanics), (2) in an attempt to explain significant anomalies in the current paradigm, extraordinary research rapidly pushes the boundaries of science forward under a new paradigm (e.g., General Relativity), (3) the new and old paradigms are debated for which ?should in the future guide research problems? (Kuhn, 2012, p. 157), and (4) the new paradigm becomes the dominant paradigm. 132 smaller systems (Lewis et al., 2021). The remainder of this thesis outlines two larger directions for future work and several extensions. The first two directions build on the ideas of dynamically updating evaluation (Chapters 4 and 6) and rethinking how various stakeholders should engage with leaderboards. The subsequent extensions focus on adapting additional irt methodology into nlp evaluations. 7.1 Future Work: Living Evaluations A central theme of this thesis is improving the discriminative power evalua- tions through additional annotation (Chapter 3), dynamically weighting examples (Chapter 4), incremental and pyramidal formats (Chapter 5), and adversarial eval- uation data (Chapter 6). These are all in service of a greater vision that evaluations should?from the onset?be built to evolve as progress is made on the task(s) of interest. Conceptually, these and other methods are tools in a larger toolbox of ways to improve the efficacy of evaluations. Taking this analogy further, our toolbox is somewhat disorganized. Although each particular tool may bias the data in one way or another, they are all useful since they still expand the scope of natural language phenomena that we test. We envision a data collection and evaluation framework that combines Item Response Theory and adversarial annotation in a format where examples test one or more levels of knowledge. 7.1.1 Incorporating Content Models into irt Models The first step towards this framework builds on the idea of irt for qa evalua- tion by incorporating a content model. A glaring drawback of standard irt methods is that to infer example properties such as difficulty or discriminability, they require scored predictions from at least a few models. Suppose we obtain a new example that is nearly identical to an existing example. To irt, these examples have no rela- tionship until we infer it by seeing that models perform similarly on both examples. The missing ingredient is a content model to tell us that these examples are similar and thus should have similar irt properties. For inspiration, we look towards political science where Ideal Point Mod- els (Poole and Rosenthal, 1985)?which are mathematically equivalent to irt?are used to predict the political leanings of legislators (?subjects?) based on their voting records (?responses?) on bills (?items?). As in our case, political scientists also en- deavour to incorporate the textual similarities of bills to infer voting patterns (Ger- rish and Blei, 2011; Nguyen et al., 2015; Kraft et al., 2016). To combine content models with irt models, we propose two extensions to the irt model in Chapter 4: (1) integrating example metadata (Card et al., 2018) like topical category and (2) incorporating a topic model (Blei et al., 2003). As with any Bayesian model, the primary changes will be to the generative story (?4.2). For metadata like topical category, we will assume an uninformative prior distribu- tion and make previously entirely latent variables like difficulty and skill depend on 133 the metadata prior. For example, each topical category could be associated with independent mean difficulties, which would allow the model to associate particu- lar topical categories with different difficulty distributions (e.g., perhaps literature questions are more difficult). The second extension builds on this concept by integrating a topic model into the irt model. Specifically, we link the irt and topic models through the latent properties of items like difficulty and discriminability. Since it is reasonable to presume that specific concepts have intrinsic difficulties, the latent difficulty of a particular example should inform both which topic is selected and the words used in that example. Thus, as before the generative story first draws the properties and then draws the values for dependent random variables?in this case, the responses and topic model features. In both extensions, standard methods for inferring distri- butions of new ?documents? (examples) could be applied to solve the problem that standard irt treats each example as an entirely new data point. However, they also provide the additional benefit of improving the interpretability of irt models, which we use next to guide example creation. 7.1.2 Guiding Example Creation Perhaps the most ambitious version of an irt-based tool for creating examples is one that does it all: users specify the desired difficulty of a question, and the tool outputs the example. However, a major challenge with this approach is that generating coherent text is a hard problem in its own right. We avoid this challenge and instead propose a tool that uses content-based irt models to guide human annotators in creating new examples. During annotation, the content-based irt model could infer example characteristics (e.g., topic category) that increase the likelihood of matching a pre-specified difficulty. For example, perhaps questions of a particular topic or that mention particular words or entities tend to be easy. As with centaur authoring (Chapter 6), this combines the strengths of machines like inferring distributional statistics with the strengths of humans like authoring questions. For this task, the metadata-based and topic-based views have complementary strengths. The topic-based irt models provide finer-grained word-level guidance but have the drawback that they can only suggest words seen during training. Perhaps history questions mentioning ?Ta?no? tend to be more difficult, but if a model does not see it during training it cannot suggest it. To cover for this weakness, metadata models could specify properties like the topical category or the entity type of the answer, which could help human annotators narrow down the type of question to write. Although the irt model synthesizes which types of questions leaderboard models find easy or hard, it is not a direct replacement for having a model-in-the- loop to test if questions are too easy (and thus potentially useless). In Chapter 6, although we employed multiple models, they were not used at the same time; in this followup work, we propose using multiple models from the leaderboard for adversarial checking, which would also provide feedback to the irt model. Although we certainly should collect new examples as leaderboards evolve, we should not so 134 readily commit underperforming examples to the deep. 7.1.3 Tender Loving Care for Underperforming Examples An important step towards improving evaluations is not simply collecting new datasets to fix the problems of existing datasets, but fixing problems in existing datasets. Towards this goal, methods like irt and those that derive this from train- ing dynamics (Swayamdipta et al., 2020; Pleiss et al., 2020) identify ?bad examples.? In a living evaluation, ?bad examples? are sent to annotators where they are either discarded as unfixable or fixed through editing. A third possibility exists though, an example could be perfectly reasonable, just too easy. 7.1.4 Multi-Examples For examples that are ?too easy,? it is difficult to know whether the underlying linguistic, knowledge, or reasoning skill being tested is genuinely too easy or if simply this particular instance is too easy. An elegant solution to this problem is to test the concept of interest multiple times and only mark it correct if all variations are correct. As Chapter 5 discusses, qb questions test knowledge multiple times, with unique aspects that each subsequent test is easier and easier. qb?s version of this idea is that if a model answers a question correctly at certain position, it should also answer the question correctly at all subsequent points; this intuition is encoded in the expected wins metric (?5.7.1). Gardner et al. (2020a) introduces a similar idea where expert annotators create multiple versions of test examples, and examples are only marked correct if all sub-examples are correct. Our proposal is rather than selecting these examples randomly, that we prioritize transforming weak examples into multi-examples. 7.1.5 Effects of Continually Updating Evaluations A substantial ?drawback? of living evaluations is that precisely because they are ever-changing, it is more challenging to compare models across time. Perhaps a current state-of-the-art model today is no longer state-of-the-art tomorrow once bad examples are fixed or transformed into multi-examples. First, we can mitigate this problem by requiring that runnable models are submitted to leaderboards. This is already done by several tasks such as (e.g., squad and NaturalQuestions) and should be adopted more broadly (Crane, 2018). Since runnable models are submitted, leaderboard organizers can rerun models as evaluation data change. A secondary benefit of this requirement is improved repro- ducibility; reproducing evaluation predictions and scores becomes trivial. Second, perhaps a continually updating evaluation will discourage research that solely optimizes metrics without providing insight into why a method works (Linzen, 2020). Even in industry applications where effectiveness may hold more sway, it is still important to understand the contributions and limitations of methods, other- wise we risk unpredictable and potentially harmful deployments (Buolamwini and 135 Gebru, 2018; Wallace et al., 2019a; Carlini et al., 2020). We will soon take this idea further and propose ways to change leaderboards so that they naturally en- courage deeper inspection of data and models at both single-model scale and across all models. 7.1.6 Model Cards Several authors advocate for more detailed reporting of model characteris- tics (Mitchell et al., 2019) and associated training data (Gebru et al., 2018; Bender and Friedman, 2018). Their motivation?which we whole-heartedly agree with?is to improve transparency and accountability, and documenting models when submit- ting to leaderboards should be required. Although to some this may seem onerous and cumbersome, we can make it more attractive to model developers by providing something in return. Thus far, we have discussed ways to incorporate information about the content of examples into irt, but have not considered integrating model properties. In ir, Query Performance Prediction (Carmel and Yom-Tov, 2010) infers the characteristics of models that are most predictive of system effectiveness for partic- ular inputs (queries). As a starting place, self-reported data from model cards (e.g., architecture, number of parameters, datasets used during training) could provide information for irt or other similar methods to use. This information could be incorporated into the irt generative story or analyzed as with standard statistical methods like inspecting feature importances of a linear model. By requiring runnable models though, we also open the door to more sophisticated analysis that might use aggregate data like score distributions or robustness to newly annotated examples. If leaderboards helped model developers determine which characteristics?across all submitted models?are most effective then it may provide intrinsic incentive for good model documentation. Next, we further develop the idea that leaderboards should do more than rank models according to one or more metrics. 7.2 What More Should Leaderboards Do? Taking a broader perspective, leaderboards are truly shared tasks in disguise. When benchmarks like squad are adopted by a community they implicitly become shared tasks. Shared tasks?whether they are explicitly or implicitly created? have been instrumental to progress to nlp in areas like speech recognition (Pallett, 2003), information retrieval (Voorhees et al., 2005), and question answering. Nissim et al. (2017) succinctly describe nlp shared tasks as revolving ?around two aspects: research advancement and competition. [They] see research advancement as the driving force and main goal behind organizing them. Competition is an instrument to encourage and promote participation. However, just because these two forces are intrinsic to shared tasks does not mean that they always act in the same direction.? Leaderboardized shared tasks have swung too far towards optimization of tar- get effectiveness metrics. While this has certainly driven rapid progress in nlp 136 modeling, as Bender et al. (2021) argues, Where much effort has been allocated to making models (and their train- ing data) bigger and to achieving ever higher scores on leaderboards often featuring artificial tasks, we believe there is more to be gained by focus- ing on understanding how machines are achieving the tasks in question. Sadly, sometimes it is because of competition that shared task participants who do not win feel describing their systems or insights is not worth their time (Parra Es- cart?n et al., 2017; Pedersen, 2019). At the same time, while there has been a great interest in the analysis of nlp models (Belinkov and Glass, 2019) and providing tooling to test robustness (Goel et al., 2021), there has not yet been an effort to close the loop and bringing these elements into benchmark tasks and thus into the view of model developers (Linzen, 2020). We argue that leaderboards should play a central role in re-aligning their implementation with their original intent: to advance research. Our re-imagining of leaderboards takes inspiration from computer visualization by identifying (1) the stakeholders of leaderboards, (2) the goals of these stakeholders expressed as abstract tasks, and (3) how leaderboard components contribute towards these tasks (Munzner, 2014, p. 43?45). We will first introduce the stakeholders and then discuss how each of their goals are satisfied in one or more of three leader- board visualizations. These three visualizations include a ranking-centric view, a model-centric view, and an example-centric view. By re-designing leaderboards with explicit goals in mind, we can use software tooling as a means to lower the barriers in achieving those research goals (Myers et al., 2000). 7.2.1 Stakeholders Ethayarajh and Jurafsky (2020) take a similar user-centric approach to de- fine the utility of leaderboards, but limit their analysis to nlp practitioners who submitted their models to leaderboards. Here, we take a wider view inspired by value-sensitive design (Friedman et al., 2008) and identify several direct and indi- rect leaderboard stakeholders, including the public, academia, industry, participants, and task organizers. The Public To the press and thus the public, leaderboards are a prominent means to communicate progress in ai technology but carry the risk of overstating the capabilities and understating the limitations of such technology. Focusing on the public as a stakeholder, leaderboards should faithfully communicate the progress actually made, emphasize its limitations, and convey uncertainty in each of these. Unsurprisingly, to this stakeholder, the ranking view will have the most importance (?7.2.2). Academia The role of leaderboards in academia is?for better or worse?unmistakable: obtaining state-of-the-art effectiveness is often a means to, if not an implicit require- ment to publish research papers at top-tier venues. Consequently, this motivates 137 some participants to submit models as a step towards publishing papers that ul- timately earn reputational prestige. With the growth of nlp and adjacent areas though, this model is unsustainable, and the living evaluation we propose muddies the entire concept of ?state-of-the-art.? Taking a step back, in any re-design of a rankings view, we should recognize that when optimization of a specific metric becomes the primary goal?as opposed to the original task?that leaderboards are vulnerable to the effects of Goodhart?s Law (Strathern, 1997). In their current form, one value of benchmarks is in unifying evaluation data so that approaches can be more easily compared?a goal that does not require the selection of specific, pos- sibly value-laden metrics (Dotan and Milli, 2020). We will argue that they should also help identify fertile research areas by highlighting characteristics of difficult examples (?7.2.4) and properties of effective models (?7.2.3). Industry Where academia is famously data-poor, industry is data-rich, and by providing suitable data for tasks of interest, industry shapes the course of academic research. For example, ir datasets like aol search queries (Pass et al., 2006), trec tasks using commercial search engine queries, ms marco, NaturalQuestions, and others encourage research where academic and industry interests intersect. Along similar lines, government programs like iarpa often contribute funding, datasets, and evaluation metrics to advance research of interest to the us government.3 These stakeholders are differentiated from academia in that they have a financial incentive in ensuring that progress on target metrics translates to advances that ultimately improve products (e.g., the Netflix Prize ?7.2.2). Thus, to this stakeholder, the ability to robustly measure progress resulting in practical impacts is of great value.4 Participants Next are the participants of benchmarks tasks. Since we have al- ready discussed leaderboards as sota rubber stamps, here we focus on participants seeking to improve their models. Somewhat paradoxically, leaderboards are perhaps least supportive of this group; aside from listing effectiveness metrics like accuracy, they?generally?do not provide any other features. The model-centric (?7.2.3) and example-centric (?7.2.4) leaderboard views will be designed to help this group find ways to improve their models. Organizers Lastly, task organizers are primarily concerned with managing the lifecycle of the shared task.5 To this group, providing insight into the health of the benchmark and how to fix issues is key. Towards this goal, features that aid in identifying and fixing bad examples are helpful. Therefore, to this group, the example-centric view (?7.2.4) will have the most value. Next, we describe the three proposed leaderboard webpage views and how they each meet stakeholder goals. 3The iarpa material program supports work in low-resource, multilingual information re- trieval. 4For example, material evaluation sets are regularly updated, with each using a new low- resource language to encourage work that is not overly reliant on any specific language pair. 5Here we consider goals independent of their academic or industry role(s). 138 7.2.2 Ranking View The standard role of leaderboards is to facilitate the comparison of models through aggregated metrics representative of the shared task?s goal(s). In most cases, there is a single metric of interest such as accuracy, but even when single metrics reflect the eventual end task as in many industry-sponsored shared tasks, they seldom integrate the full spectrum of concerns in the end task. In the Netflix prize (Bennett and Lanning, 2007), both the winning entry and followup made heavy use of ensemble methods (Koren, 2009) which were too difficult to scalably and efficiently directly use in production deployments (Amatriain and Basilico, 2012). That said, the competition was successful in producing research that led Netflix to incorporate successful innovations into production deployments. The missing element of the competition is that not all the values of interest were incorporated? in this case simplicity and efficiency. The first goal we focus on is to reward many types of research progress by increasing the diversity of metrics used in leaderboards. First though, we need to identify at least a few metrics to start with. Metrics To start, we propose that leaderboards should integrate metrics for ef- fectiveness, robustness, generalizability, efficiency, and fairness. Since effectiveness metrics are status quo, we will not discuss these beyond noting they should be appropriate to the end task. Towards motivating efficient ai, the GreenAI initia- tive (Schwartz et al., 2020) recommends counting the total number of floating-point operations as a proxy for measuring carbon impact. However, there is plenty of room for alternatives; for example, the EfficientQA (Min et al., 2021) shared task focused on decreasing the memory footprint of models which is important for mobile deployments (Sun et al., 2020). For fairness, we can begin with demographic par- ity (Barocas et al., 2019, p. 9?10),6 but there are also many alternatives, and each leaderboard should think critically about which to use. While incorporating these metrics is not the only way or even the best way to integrate alternative values into evaluations, it represents a first step. To integrate a robustness metric, we should make use of the aforementioned idea of multi-examples (?7.1.4). We initially presented this concept as a way to aug- ment existing examples through human annotation and requiring all sub-examples to be correct for the full example to be correct. This also naturally lends itself as the basis for a robustness metric; for example, the ratio of correct to incorrect multi-examples could be a robustness metric. In addition to human-created multi- examples, we should also employ techniques that use rule-based methods to generate additional examples (Jia and Liang, 2017; Ribeiro et al., 2018). Along similar lines, behavioral testing (Ribeiro et al., 2020) could also contribute towards a robustness score while helping to identify capability-based tests. Generalizability is already a topic of interest in leaderboards through multi- task leaderboards. Examples include superglue (Wang et al., 2019a) and mrqa (Fisch et al., 2019). The challenge in these multi-task benchmarks is aggregating multiple 6For example, effectiveness across a pre-defined set of partitions should be the same (e.g., accuracy for questions about men versus women). 139 Leaderboard Effectiveness Ranking Efficiency Ranking model a .95 Inspect ... model b .94 Inspect ... Fairness Ranking Robustness Ranking ... ... Rank Models by: Effectiveness Efficiency Fairness Robustness Define Sort Constraints Name Effectiveness Efficiency Fairness Robustness Figure 7.1: The ranking view and landing page of leaderboards should convey that the community values multiple types of progress. By highlighting the models ac- cording to different metrics, the ranking view de-emphasizes the importance of any single metric and encourages more thought into deciding what does the concept of ?best model? mean. Lastly, rather than highlight only the highest-scoring models, we shift towards highlighting clusters of comparable models; for example, group all models whose scores are not statistically-speaking different. effectiveness metrics into a single, representative number. The most common ap- proach averages task scores, but this implicitly assumes that each task is equally easy. As alluded to earlier (?4.9), dynamically weighting scores by difficulty is a strength of irt models from Chapter 4. Preliminary experiments (Appendix B.6) show this has promise, but that correlating task effectiveness with dimensions is challenging. Conceptually, the problem is that the generative model places no prior on the distribution of task difficulty in latent space. As future work, we propose changing the generative story to use Dirichlet priors so that difficulty is more likely to be concentrated in fewer dimensions (Teh et al., 2006). Additionally, a Bayesian domain adaptation approach could model per-task parameters separately while ty- ing them together through a hierarchical Bayesian prior (Finkel and Manning, 2009). Hopefully, this will improve the interpretability of and thus confidence in using irt skill parameters as a way to aggregate multi-task effectiveness. Table of Ranked Models Ultimately, the ranking view needs to contain a table showing metrics for each model. The question is how to rank models in a way that does not re-introduce the problems we are trying to mitigate such as sota-chasing at the expense of everything else (Rogers, 2019). We propose a middle-ground approach to designing the ranking view. First, we depart from the standard layout of leaderboards?displaying a ranked list of models by a metric?and instead show four groups of models (assuming the 140 leaderboard evaluates effectiveness, efficiency, fairness, and robustness). This change serves to de-emphasize the importance of any single metric (and its underlying value) and place more emphasis on alternative ways to improve models. Each group shows the top model according to the associated metric but groups it with all models that are considered comparable (Figure 7.1). For example, a common problem with current leaderboards is that the difference between top- performing models tends to be small, even to the point of being statistically in- distinguishable. In our proposed ranking view, statistical significance is one way to determine if models are comparable. For an efficiency-based metric, perhaps model comparability is defined by a pre-specified trade-off between accuracy and computational cost (e.g., a one-point drop in accuracy is acceptable for ten-point improvement in runtime). More generally, users (or task organizers) should consider trade-offs that align with their notions of utility (Ethayarajh and Jurafsky, 2020). Beyond tabular rankings, we should look towards new ideas in computer visu- alization. For example, when using mixed utility functions, this information could be conveyed with a hybrid between a table and a bar chart (Gratzl et al., 2013). To academia and the public, this design emphasizes that there are multiple ways to advance research and the importance of balancing these factors. The sec- ondary benefit of this design is in explicitly integrating uncertainty by identifying which groups of models are comparable. To industry stakeholders, the leaderboard better serves their goals by supporting the means to define utility while rewarding focus on particular aspects of the task (e.g., efficiency versus accuracy). From the organizer?s perspective, rewarding different kinds of progress may also incentivize a wider variety of shared task system descriptions. 7.2.3 Model View Leaderboards should have a model-centric view whose purpose is to guide model developers in improving their models. Frequently, the development of ml models is an iterative process of diagnosing problems and fixing them until the practitioner is satisfied with the quality of the model (Patel et al., 2008a). Thus, diagnostic tools are a means to ?naturally lead developers towards doing the right thing? (Patel et al., 2008a). Depending on which tools we design, we can guide developers in different directions. As with the ranking view, there are many exist- ing visualization techniques for model debugging including interpretations of single models (Wallace et al., 2019c; Wu et al., 2019), pairwise model comparison (Zhang et al., 2019), multi-model comparison (Zhang et al., 2019), attribute-based compar- ison (Arendt et al., 2021), example-based comparison (Amershi et al., 2015), and others. Some of these tools should play a role in a model-based view, but a unique aspect of a leaderboard-based tool is access to all other models. An open challenge in this area is exploring options that help compare a submitted model to other models in aggregate (e.g., perhaps by treating other models as an ensemble). 141 7.2.4 Example View The third view leaderboards should support centers on identifying patterns in the evaluation data. Since task organizers are concerned with the health of the shared task, this view should aim to identify and help fix inadequacies of the test data. A first step towards this should include incorporating the irt methods used to identify bad examples (?4.5), but extend to other ways to identify bad examples. For example, Swayamdipta et al. (2020) use the training dynamics of neural models to identify hard to learn examples, which often correspond to labeling errors. Since their method relies on training dynamics, in its standard form, it does not provide characteristics of development or test examples, but it would be easily modified to do so. K-fold validation would allow the development data to be used (temporarily) as training data to produce the desired quality metrics on test data. The example view should make it easy for organizers or participants to find these examples and validate whether they are flawed. For private evaluation data, this obviously must be done by organizers (or annotators), but this seems like a reasonable cost to pay for better evaluations. 7.3 Future Directions in Item Response Theory for NLP Thus far, we have briefly covered a few ways to improve irt for nlp; here, we enumerate two additional directions. 7.3.1 Multidimensional Clustering A glaring weakness of Chapter 4?s irt models is its reliance on one-dimensional parameter values. It is quite conceivable that parameters like difficulty and skill could vary along more than one dimension: say math difficulty and literature diffi- culty. Although we did train multi-dimensional models (?4.2), initial experimenta- tion failed to correlate specific dimensions with example features, so we only reported numerical effectiveness results. An alternative approach to pre-specified features is clustering models or examples by multi-dimensional irt parameters and doing post- hoc analysis of clusters. Ideally, multi-dimensional clustering on a multi-task benchmark like mrqa (Fisch et al., 2019) would show a relationship between clusters and sub-tasks. Unfortu- nately, initial investigation (Figure 7.2) shows only limited correlation with one of the six mrqa tasks. The main challenge of future work is to investigate why it seems like no particular dimension correlates with specific tasks. Our preliminary hypothesis is that nothing in the generative model encourages skills associated with particular tasks to accumulate in a single dimension versus spread equally across all dimensions. Thus, a modified generative model that uses Dirichlet priors to en- courage more meaningful latent skill distributions (?7.2.2) may fix this problem and allow for better empirical investigation of which skills are related to each other. 142 MRQA Task 60 BioASQ DROP DuoRC 40 RACE RelationExtraction TextbookQA 20 0 ?20 ?40 ?60 ?80 ?60 ?40 ?20 0 20 40 60 80 TSNE Dimension 0 Figure 7.2: In mrqa, tsne shows a relationship between whether the task is Nar- rativeQA with respect to multidimensional difficulty and discriminability. The mul- tidimensional irt model uses six dimensions to match the six mrqa tasks. 7.3.2 Statistical Testing In irt, a useful concept is the informativeness of examples towards inferring a subject?s latent skill (Equation (4.2) in ?4.4.5). Chapter 4 uses this to adaptively select examples for annotation. One may ask a related question: how much informa- tion (and by extension the number of annotations) is required to estimate a subject?s skill to a particular precision with a certain degree of confidence? In human testing, this is important for determining things like how long an exam should be; longer exams provide more information but are more costly to administer. This question is directly related to statistical testing in nlp to determine whether the difference between two systems is statistically significant or insignifi- cant. Although including such tests in publications is strongly recommended (Dror et al., 2018), most statistical tests in nlp publications only account for variation within whole model runs; however, when evaluation sets are shared, it would be more appropriate to use paired tests that examine how a collection of models score on shared examples. For example, reasonable statistical tests include Student?s t- test, the sign test, McNemar?s test, the Wilcoxon signed-rank test (Wilcoxon, 1945) (previously recommended by Dem?ar (2006)), Pitman?s permutation test, and the paired bootstrap test. Card et al. (2020) investigate how these and other statistical tests are used in nlp and find that in many cases the tests are under-powered (un- able to distinguish between better and worse models). As we will briefly outline, an irt-based statistical test brings several unique properties that future work should characterize empirically compared to existing tests. The irt tests differ in two significant ways from other tests: (1) it does not assume that items are equally informative, and (2) it assumes that the informative- ness of items is a function of the subject?s skill ?j. Under these assumptions, each item provides information about the true value of a subject?s latent skill, and as we accumulate more evidence, the reliability of this estimate improves (Tague-Sutcliffe, 143 TSNE Dimension 1 1992). As consequence of skill-dependent information, the statistical error associ- ated with the estimate also varies with skill. In irt, this error is called the standard error of estimation (see) ?(??|?) (De Ayala, 2013, p. 30) and is related to the Fisher information (p?i) 2 Ii(?) = ? (7.1)pi(1 pi) of each item; previously, we used item information to adaptively sample examples for annotation (?4.4.5). For a 2PL model, information Ii(?) = ? 2pi(1? pi) (7.2) is maximized when pi = (1 ? pi). Since Fisher information is additive, the infor- mation of the evaluation set is maximal when items have a 50% chance of being responded to correctly. As derived by De?Ayala (2013, p. 102), the standard errorof estimation 1 SEE(?) = ? . (7.3) i Ii(?) is computed by accumulating the information gained from each item. Given two subjects X and Y , the probability distribution of score differences N(?Y ? ? 2 2X , SEE(?X) + SEE(?Y ) ) (7.4) can be used to compute the probability that the difference in skill is statistically significant; for example, a difference of greater than two standard errors corresponds to an ? ? .05 significance level. Future work should investigate how the different assumptions the irt test makes influence experimental results. Ultimately, perhaps the main challenge in this line of work is in determining which statistical test is ?correct,? which is difficult or impossible to determine with real-world data; it is likely the investigation into comparing these tests will include simulations where differences are a priori known to be significant or insignificant. 7.4 Reflections and Synthesis Throughout this thesis, I have woven a story and believe it would be instructive to explain how I came to that story. As an early PhD student, my first step towards working on qa evaluation was building the qa system that would eventually defeat accomplished trivia players in a live exhibition match. Much as this excited me, it was somewhat unsatisfying that a system built on pattern matching and therefore distinctly not ?intelligent? had been sufficient to defeat skilled humans. This led to asking ?should we interpret the victory of a?by construction?not intelligent model a success for qa?? and eventually the better question of ?how should we define success for qa?? In an effort to not allow machines to ?cheat? their way to victory through statistical pattern matching, Chapter 6 and its adversarial questions suggested that if models do not rely on ?spurious? correlations, then that is success. In direct contrast, the Curiosity dataset in Chapter 3 defined success as satisfying the 144 user. The missing ingredient through all these questions is that defining success is a value judgment, and I was attempting to answer this question without considering values. This realization led me to develop the idea to view qa through the lenses of the Cranfield and Manchester paradigms. Crucially, each of these paradigms defines what it values and by doing so define their own notions of success. Viewing qa evaluation from the Turing Test inspired Manchester paradigm, the theoretical underpinnings for qb?s pyramidal and incremental format became clear. Methodologically speaking, qb tests for multiple, different levels of knowledge understanding which directly incorporated the idea that questions (clues in qb) are not equally informative of ability. This line of thought eventually grew into using irt for qa evaluation, but was not the only reason for adopting irt. The other attractive feature of irt is the additional insight it gives into the data and models in leaderboards. By making leaderboards?as a tool?more helpful (i.e., not just a ranked list of models), perhaps this would accelerate progress towards better qa systems. In thinking through how leaderboards might be made more helpful, the ques- tion of what goals and thus values to optimize for resurfaces. In revisiting the question, it is also crucial to internalize that widely adopted shared tasks often set the goals and focus entire communities of research for multiple years at a time. By identifying stakeholders and what they value, it makes leaderboards more effective in advancing research. And while we are thinking critically about the values in- tegrated into leaderboards, we can also prioritize new values?like efficiency and fairness?and thereby reward research prioritizing those values. Throughout, the underlying and driving questions in this thesis have been: how do we define ?better? and the necessarily dependent question of how to accurately measure that. Towards the goal of building qa systems that exhibit intelligent behavior, this thesis advocates for specific methods to improve the format, data, and scoring in qa evaluations. Although these methods are perhaps specific to improving qa evaluations, the broader message of this thesis is to always think critically about the purpose and goals of computer systems. 145 Appendix A: Curiosity This appendix and those that follow are each paired with a particular chapter of this thesis. In these appendices, we primarily include: (1) details important for reproducibility, but that are uninteresting and break the flow of the main text and (2) additional examples or annotations. In this appendix, we provide additional details of the interface, dataset, and models in Chapter 6. A.1 Components of Dialog Interfaces This section provides short descriptions and screenshots of every component of the user and assistant dialog interfaces. A.1.1 User?s Interface Figure 3.3 shows the interface that we use to sample the user?s prior knowledge of entities related to the topic. To derive a diverse sample, we use Wikipedia page views as a proxy for how well known the entity is. All experiments use the English Wikipedia dump generated on July 23, 2019. We divide entity mentions into ten buckets based on the frequency of page views, and round-robin sample fifteen entities from those buckets. The interface is shown before the user starts chatting with the assistant. We elicit how ?interesting? a user finds each of the assistant?s messages through the like button in Figure 3.4. Only users can ?like? a message; the assistant cannot ?like? user messages. Users are instructed to ?like? messages if they are ?interesting, informative and/or entertaining? and ?relevant to their topic and/or aspects.? They are specifically instructed not to ?like? messages that are devoid of factual content, only express feelings, or only contain greetings or farewells. Switching Aspect Users are randomly assigned two aspects for each dialog and told to spend time discussing each. The guidelines instruct them to spend at least two turns per topic, but we do not specify any further time requirements. When the user changes aspects, we instruct them to click a button (Figure A.1) to indicate when and which aspect they are switching to. Additionally, this event triggers a reset in the context we use to rank the assistant?s facts. 146 Figure A.1: The user is assigned two aspects about their topic. After they are sat- isfied with what they have learned about the first aspect, they click a button and switch to the next aspect. While the button click is not communicated to the assis- tant (the user must send a corresponding message), it resets the fact contextualizer; we observe that without this, too many facts were related to the previous aspect. Figure A.2: A short topic description is always visible to the assistant. The goal is to ensure the assistant always has a general understanding of the dialog topic. A.1.2 Assistant Interface By design, we intend for most workers to not be familiar in depth with most of the geographic topics. Thus, the most important responsibility of the assistant interface is to provide enough information?without overwhelming them?to be en- gaging conversational partners. The first interface shown is a short description of the topic from either Simple Wikipedia or the English Wikipedia. This component helps the assistant reach a general understanding of the topic so that they can choose better facts. The most important component of the assistant interface is their list of avail- able facts. These facts have high textual similarity with the most recent three turns and are broken into three categories: facts related to entities the user knows about (rooted facts), facts related to an aspect (aspect facts), and facts from anywhere on the page (general facts). Feedback from pilot collections showed that six facts was too few which caused a lack of relevant facts, but twelve facts overwhelmed annotators. Thus, we use nine facts so that we can also balance equally across each 147 type of fact. When composing their reply, the assistant can use any number of facts as in Figure 3.5. To discourage verbatim copying, we disable the paste feature in javascript. We also drop repeatedly unused facts. A.2 Dialog Act Annotation To annotate dialog acts, we create a separate annotation interface (Figure A.3). The interface shows one dialog at a time, and the same annotator annotates all the utterances. In addition to the utterances, the interface shows the topic, aspects, and sender of each message. Lastly, we incorporate a ?Report Dialog? feature to help identify and remove inappropriate dialogs. Figure A.3: To annotate dialog acts, we develop an interface that showed each utterance on a separate line. Annotators assign zero or more dialog acts to each utterance using grouped dropdowns. A.3 Sample Dialogs Tables A.1 and A.2 show Curiosity dialogs and highlight the dataset?s features. Typos and grammatical errors made by annotators are left unaltered. 148 A.4 Paraphrase Analysis and Samples In Section 3.3.1.3, we describe the results of a manual analysis on two hundred and fifty assistant paraphrases. Annotations were completed by the authors and shown in Table 3.3. We break messages into four categories: paraphrases, copies, errors, and unrelated. Paraphrases include messages that incorporate the selected fact and possibly additional information. Copies include verbatim copying, cherry- picked phrases, and trivial contextualizations like replacing an entity with a pronoun. A.5 Like Prediction Comparison Like prediction is the one task where charm was not the best model. To better understand the differences between the charm and bert model, we randomly sample thirty dialogs in the test set and find assistant messages where the model predictions disagree. Of the 202 assistant messages in the thirty dialogs, the bert like prediction is correct 174 times (86.1%) and charm 170 times (84.2%). The models disagree on eight predictions (3.96%); bert is correct on six of these and charm on the other two. Table A.4 shows the messages, predictions, and labels. Although we might guess at why the predictions differ (e.g., perhaps something in the Florida dialog confused bert), without further analysis we cannot establish causality. A.6 Model Training, Implementation, and Computation Models were developed on a single machine with eighty Intel 2.0GHz cpus, 256gb ram, and eight Tesla V100 graphics cards. Each model was trained and evaluated on a single graphics cards with hyper-parameter sweeps parallelized across the eight cards. Our best model (charm), has 26, 970, 475 parameters, takes two hours and eighteen minutes to train, and early stops on epoch fifteen. Allennlp configuration files and software dependencies (including version) are included with our code at github.com/facebookresearch/curiosity. A.7 MS Marco Conversational Sample Queries Conversational ms marco is a search dataset that partially inspired this work. Assistant messages should prompt followup queries like in Table A.5. 149 Topic: Djibouti Aspects: Economy, Politics Known Entities: France, Africa Dialog Acts Messages U request topic Hi can you tell me about Djibouti? A inform response The country of Djibouti is officially called the Republic of Djibouti is located in the horn of Africa. It was created out of French So- maliland. It is also recorded the country was colonized by France in1800s and later gained its independence on June 27, 1977. U request aspect Can you tell me about their Politics A inform response After its independence, The country had two regiments com- manded by the French officers. The country also maintains its close ties with Somalia, Ethiopia, France and the United States. U request aspect Any other information about the French and their impact of the politics? A inform response There are no direct information on its effect on the country?s pol- itics but even though the country is already independent. The 13th Demi-brigade of the French Foreign Legion is still stationed in this country. It is also the largest french military presence abroad. commanded by a 3-star general. U request aspect Heavy French infeleunce. what other interesting inofrmation on politics do you have? A inform response In recent years, the country has improved its training techniques, military command, and information structure. They also became more self-reliant in supplying their military and assist in peace- keeping missions and provide military assistance to countries that officially ask them for help. U request aspect, Wow Interesting, Can you now tell me about the economy they feedback positive have? A offer aspect Sorry, I do not have that information. But I can tell you more about Politics. U offer accept Sure A inform response Camp Lemonnier is rented by the United State as a Naval Expe- ditionary Based for $63 Million a year. While Japan and France each pays $30 million a year and China pays $20 Million a year. U request other Lastly, any any fun facts? A inform response I am not sure if this is a fun fact, but the country?s gross domestic product expanded by more than 6%. From $341 Million to 1.5 Billion U That?s a huge increase. thank you for all your help A You are welcome Table A.1: Example dialog #1 from Curiosity. (U: User, A: Assistant) 150 Topic: British Columbia Aspects: Government and politics, Culture Known Entities: Canada, Seattle Dialog Acts Messages U request topic Hi! Can you help me learn some basic information about British Columbia? I don?t know much except that it?s located in Canada. A inform response Yes, British Columbia is the westernmost province of Canada and is located between the Rocky Mountains and the Pacific Ocean. U request aspect, I didn?t know it was on the coast! What can you tell me about feedback positive government and politics there? A inform response One interesting fact about the government is that the Green Part plays a larger role in this province than it does in other provinces of Canada. U request followup, Interesting. What can else you tell me about the Green Party? feedback positive A inform response The New Democratic Party and the Green Party caucuses to- gether control 44 seats. Which seems like a lot but the British Columbia Green Party only takes up 3 of those 44 seats. U request aspect That?s a pretty small influence. Can you tell me some fun culture facts about British Columbia? A I am sorry I do not have any information on their culture right now. U request topic That?s okay. What other fun facts can you share? A inform response Interestingly, Queen Victoria chose British Columbia to distin- guish what was the British sector of the Columbia District from the United States which became the Oregon Territory on August 8, 1848. U request aspect So that?s why it has "British" specifically as part of it?s name! Makes sense. Are there any sports or outdoor activities that are popular in British Columbia? A inform response Horseback riding is enjoyed by many British Columbians. U Thanks for your help today. Now I know more than I did before. A No problem, it was a pleasure. Table A.2: Example dialog #2 from Curiosity. (U: User, A: Assistant). After mentioning the Green Party, the user asks a specific followup question; we use these interactions to estimate implicit preference. 151 Label: paraphrase-correct Message: I do not have the information available but I can tell you 10 million tones were exported annually for, Cardiff dock in the years leading up to the First World War. Fact: In the years leading up to the First World War , more than 10 million tonnes of coal was exported annually from Cardiff Docks . Label: copy Message: Media and journalists in Namibia are represented by the Namibia chapter of the Media Institute of ?southern Africa and the Editors Forum of Namibia. Fact: Media and journalists in Namibia are represented by the Namibian chapter of the Media Institute of Southern Africa and the Editors ? Forum of Namibia . Label: paraphrase-multiple Message: The film industry has a rich history in Indonesian culture and the Indonesian Film Festival has been held regularly since 1955. Fact: Indonesia has held annual film festivals and awards , including the Indonesian Film Festival ( Festival Film Indonesia ) that has been held intermittently since 1955 . Label: context Message: EU means European Union. GDP per capital ranges from 188% of EU average in Bratislava to 54% in eastern Slovakia Fact: GDP per capita ranges from 188 % of EU average in Bratislava to 54 % in Eastern Slovakia . Label: paraphrase-correct Message: Sudan was considered the 17th growing economy in the world. Its faster devel- opment of the country from oil profits. Fact: Sudan , officially the Republic of the Sudan , is a country in Northeast Africa . Label: paraphrase-correct Message: There is a mountain range called the Annamite Range that runs along eastern border to Vietnam and another range, named Luang Prabang Range, near the Thailand border. Fact: The Mekong River forms a large part of the western boundary with Thailand , where the mountains of the Annamite Range form most of the eastern border with Vietnam and the Luang Prabang Range the northwestern border with the Thai highlands . Label: copy Message: In 1603, the kingdoms of England, Scotland and Ireland were united in a personal union when James VI, king of Scots, inherited the crowns of England and Ireland and moved his court from Edinburgh to London; each country nevertheless remained a separate political entity and retained its separate political, legal and religious institutions. Would you like to know anything else? Fact: In 1603 , the kingdoms of England , Scotland and Ireland were united in a personal union when James VI , King of Scots , inherited the crowns of England and Ireland and moved his court from Edinburgh to London ; each country nevertheless remained a separate political entity and retained its separate political , legal , and religious institutions . Label: copy Message: The A31 connects to the A35 at Bere Regis , and passes east through Wimborne and Ferndown to Hampshire where it later becomes the M27 Fact: The A31 connects to the A35 at Bere Regis , and passes east through Wimborne and Ferndown to Hampshire , where it later becomes the M27 . Label: context Message: I?m not sure of that. One fact I do have is that by 1857, Chicago was Illinois? largest city. Fact: By 1857 , Chicago was Illinois ?s largest city . Label: paraphrase-correct Message: On March 2, 2008, Dmitry M1e5d2vedev was elected President of Russia and Vladimir Putin became Prime Minister. Fact: On 2 March 2008 , Dmitry Medvedev was elected President of Russia while Putin became Prime Minister . Table A.3: A random sample of ten manually labeled paraphrases from the assistant. The top row indicates the label we (the authors) annotated, the middle row the message, and the bottom row the original fact from Wikipedia. The original fact is shown as displayed to crowd-workers including punctuation tokenization. Liked Correct Model Message No bert You are welcome! Yes bert I?m sorry I don?t have anymore information about the etymol- ogy of Tunisia, but what I can tell you is that Tunisia Sports City is a whole sports city being constructed in Tunis Yes bert Yes Buddhism is a dominant influence in Lao culture. It has been great helping you. Yes charm Florida is a state in the southeast United States. What would you like to know? Yes bert They have an average daily temperature of 70.7, it?s the warmest state in the U. S. Yes charm Yes, I can. Florida is nicknamed the ?Sunshine State?, but severe weather is a common occurrence. Yes bert Hello, Indonesia is part of the Malay Islands and is in South- east Asia. Would you like to know more about the history? Yes bert I do not have etymologic information, would you like to know more about the economy? I can tell you thank Indonesia develops military and commuter aircraft. Table A.4: To compare like prediction between models, we randomly sample thirty dialogs and obtain predictions from charm and bert. The table only shows mes- sages where the model predictions disagree and indicates which model was correct. Dialogs are delineated by horizontal lines. Unfortunately, from only these examples we cannot determine why the charm model errors in most of these predictions. 153 Query What is a physician?s assistant? What are the educational requirements required to become a physician?s assistant? What does the education to become a physician?s assistant cost? What?s the average starting salary of a physician?s assistant in the UK? What?s the average starting salary of a physician?s assistant in the US? What school subjects are needed to become a registered nurse? What is the physician?s assistant average salary vs a registered nurse? What the difference between a physician?s assistant and a nurse practitioner? Do nurse practitioners or physician?s assistant?s make more? Is a physician?s assistant above a nurse practitioner? What is the fastest way to become a nurse practioner? How much longer does it take to become a doctor after being a nurse practitioner? What are the main breeds of goat? Tell me about boer goats. What goat breed is good for meat? Are angora goats good for meat? Are boer goats good for meat? What are pygmy goats used for? What goat breed is the best for fiber production? How long do Angora goats live? Can you milk Angora goats? How many Angora goats can you have per acre? Are Angora goats profitable? Table A.5: An exemplar query chain from the conversational variant of ms marco. An ideal assistant should answer these questions and inspire these types of followup questions. 154 Appendix B: Leaderboard This appendix supports Chapter 4 by showing additional squad examples, feature descriptions, reproducibility details, minor results related to the work?s main experiments, and discussions of multidimensional clustering and statistical testing. B.1 SQuAD Item Examples On our project page at irt.pedro.ai, we include annotations from Figure 4.10 in human and machine-readable formats and provide an interactive web interface to inspect the parameters of the irt-base, irt-disc, and irt-feas models. Figure B.4 shows the feasibility distribution corresponding to Figure 4.1. B.2 Logistic Regression Features The linear model (?4.4.2) includes features based on item ids, subject ids, textual features of the question, context, and answer, and topic model features. Table B.1 lists the feature names from Figure 4.4 with descriptions of each. When irt features or the statistics features are used, they include interaction terms with themselves. B.3 IRT Model Type Correlation Although each irt model differs in expressiveness, they should?in general? produce similar results. This is confirmed by computing the Kendall?s rank corre- lation between the subject abilities and item difficulties (Table B.2). B.4 Ranking Stability Experiments Here we provide further details for the ranking stability experiments (?4.4.2.3). First, we filter from the 161 subjects that have development set scores to the 115 that also have test set scores.1 In our simulation, we run 10 trials for every sample size; sample size begin at 100 and advances by 100. In addition to these, we also run for sample sizes 25, 50, and 75. Since each sample can be no larger than half the dataset, we stop at half the dataset. 1The squad organizers curate the test set subjects to avoid overfit, garbage, or duplicate submissions. 155 Discriminability: -9.63 Difficulty: -0.479 Feasibility: 0.614 Mean Exact Match: 0.472 Wikipedia Page: Economic inequality Question ID: 572a1c943f37b319004786e3 Question: Why did the demand for rentals decrease? Official Answer: demand for higher quality housing Context: A number of researchers (David Rodda, Jacob Vigdor, and Janna Matlack), argue that a shortage of affordable housing ? at least in the US ? is caused in part by income inequality. David Rodda noted that from 1984 and 1991, the number of quality rental units decreased as the demand for higher quality housing increased (Rhoda 1994:148). Through gentrification of older neighbourhoods, for example, in East New York, rental prices increased rapidly as landlords found new residents willing to pay higher market rate for housing and left lower income families without rental units. The ad valorem property tax policy combined with rising prices made it difficult or impossible for low income residents to keep pace. Figure B.1: The example from squad with the lowest discriminability. Surprisingly, it had a negative discriminability, implying that the less skilled a subject is, the more likely their response is to be correct. B.4.1 Development and Test Set Correlations Table B.3 uses a irt-disc model since we noticed that in comparison irt- feas overfit the data, yielding worse results. The correlations with the full data are all strong, but not the same. Taking all these results, we conclude that?at least on squad?irt rankings are modestly more reliable than classical rankings. B.5 The IRT Statistical Test The irt test differs in two substantial ways from other tests: (1) it does not assume that items are equally informative and (2) it does assume that the informativeness of items is a function of the subject?s skill ?j. In the literature, this is closely connected to reliability (Tague-Sutcliffe, 1992) and each item provides information about the location of ?j; as we accumulate more evidence for the location of ?j the error of estimation decreases. It is a well known result in irt that standard error of estimate (see) ?(??|?) varies with respect to the agent location parameter ? (De Ayala, 2013, p. 30) and is connected to the Fisher information (p?)2 Ii(?) = i ? (B.1)pi(1 pi) of each item. For a 2PL model, information I (?) = ?2i pi(1? pi) (B.2) 156 Discriminability: 2.1 Difficulty: 2.38 Feasibility: 0.995 Mean Exact Match: 0.00621 Mean F1: 0.546 Wikipedia Page: European Union Law Question ID: 57268f2bf1498d1400e8e3c4 Question: What reform was attempted following the Nice Treaty? Official Answer: an attempt to reform the constitutional law of the European Union and make it more transparent Context: Following the Nice Treaty, there was an attempt to reform the con- stitutional law of the European Union and make it more transparent; this would have also produced a single constitutional document. However, as a result of the referendum in France and the referendum in the Netherlands, the 2004 Treaty establishing a Constitution for Europe never came into force. Instead, the Lisbon Treaty was enacted. Its substance was very similar to the proposed constitutional treaty, but it was formally an amending treaty, and ? though it significantly altered the existing treaties ? it did not completely replace them. Figure B.2: This example shows that the answer span is likely too large, causing models to fail in both squad?s exact match and F1 metrics. is maximized when pi = (1 ? pi). Since Fisher information is additive, the infor- mation of the evaluation set is maximal when items have a 50% chance of being responded to correctly. As derived by De?Ayala (2013, p. 102), the standard errorof estimation ? 1SEE(?) = . (B.3) i Ii(?) is computed by accumulating the information gained from each item. Given two subjects X and Y , one can use the probability distribution of score differences N(? 2Y ? ?X , SEE(?X) + SEE(?Y )2) (B.4) to compute the probability that the difference in skill is greater than two standard errors which corresponds to an ? ? .05 significance level. B.6 Multidimensional IRT Clustering While we achieve strong held-out accuracy with the 10 dimensional irt (irt- vec), we had limited success in interpreting parameters. We use tsne2 plots over- layed with features like item accuracy, the question?s Wikipedia page, if the question was answerable, length of questions, and topic model weights. Of these, item accu- racy and answerability showed the most obvious patterns (Figure B.5). 2We use openTSNE (Poli?ar et al., 2019) with default parameters. 157 Discriminability: 8.01 Difficulty: -1.41 Feasibility: 0.939 Mean Exact Match: 0.64 Mean F1: 0.667 Wikipedia Page: Normas Question ID: 56de10b44396321400ee2595 Question: Who did the Normans team up with in Anatolia? Official Answer: Turkish forces Context: Some Normans joined Turkish forces to aid in the destruction of the Armenians vassal-states of Sassoun and Taron in far eastern Anatolia. Later, many took up service with the Armenian state further south in Cilicia and the Taurus Mountains. A Norman named Oursel led a force of "Franks" into the upper Euphrates valley in northern Syria.. . . Figure B.3: This highly discriminative question succeeds because there are many plausible answers. For example, although only ?Turkish forces? is correct, some models answer ?the Armenian state.? Feature Description All All the features irt irt values for difficulty, discriminability, feasibility, and ability Item id The item?s id Subject id The subject?s id Question Question words Context Context words Stats Lengths of question and context; answerability, answer position and length; difficulty from Sugawara et al. (2017) Subject & Item id Item and Subject id Topics 1K Topic weights of question words Title Wikipage page title words Baseline No features, majority class baseline Table B.1: The linear model integrates a variety of features to determine which are most predictive of a subject responding correctly to an item. B.7 Reproducibility Checklist Here we provide reproducibility details to complement our source code and data release at https://irt.pedro.ai. B.7.1 Software and Parameters All reported irt models are implemented in PyTorch (Paszke et al., 2019) and Pyro (Bingham et al., 2018). Linear models are trained with Vowpal Wabbit (Agar- wal et al., 2014). The topic model that generated features for the linear model uses Mallet (McCallum, 2002). The irt models have parameters proportional to the number of subjects m and the number of items n. The irt-base has one parameter per subject and one 158 100,000 100% 20,000 30% 10,000 20% 10% 2,000 1,000 3% 2% 200 1% 100 0.3% 20 0.2% 10 0.1% 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Probability of Feasibility (?) Figure B.4: The feasibility parameter ? of our irt model represents the probability that an example is unsolvable. For example, annotation error could lead to an example always being scored incorrectly?regardless of how good the model is. In squad 2.0, ? < .434 in the 5% percentile, ? < .698 for the 7.5%, and ? < .931 in the 10% percentile. parameter per item. The irt-disc has one parameter per subject and two parameters per item. The irt-feas has one parameter per subject and three parameters per item. The irt-vec has ten parameters per subject and thirty parameters per item. B.7.2 Hyperparameters We did not invest significant effort on hyper-parameter tuning the irt models and instead used the defaults in the py-irt software3 provided by Lalor et al. (2019). The irt-base, irt-disc, and irt-feas models were trained for 1000 epochs with no early stopping conditions and a learning rate of 0.1 with adam (Kingma and Ba, 2015). The irt-vec model was trained for 2500 epochs and used 10 dimensions. In the linear model, we used a Hyperopt-based (Bergstra et al., 2013) tool provided by Vowpal Wabbit4 for hyper parameter search. For each lm, the tool spent 20 iterations optimizing the learning rate, L2 regularization, and number of bits against the logistic loss function. The learning rate was searched from .001 to 10 with loguniform sampling, L2 regularization from 1e?8 to 1, and bits from 20 to 3https://github.com/jplalor/py-irt 4https://github.com/VowpalWabbit/vowpal_wabbit/blob/master/utl/vw-hyperopt.py 159 Count Percentage Ability irt-feas irt-disc irt-base irt-feas 1.00 0.947 0.895 irt-disc 0.947 1.00 0.907 irt-base 0.895 0.907 1.00 Table B.2: Table entries are Kendall?s ? rank correlation of irt subject ability between rows and columns. Generally, the models agree on the ranking with the irt-feas and irt-disc having the strongest correlation. EMdev EMtest Abilitydev Abilitytest EMdev 1.00 0.953 0.954 0.931 EMtest 0.953 1.00 0.944 0.947 Abilitydev 0.954 0.944 1.00 0.950 Abilitytest 0.931 0.947 0.950 1.00 Table B.3: Entries are Kendall?s rank correlation between rows and columns. Scores are squad Exact Match (EM) and irt-disc ability. 23 as categorical variables. The topic model that generated features for the linear model used mallet, and we followed the recommendations of the software5 to set hyper parameters. Specifically, we used an optimization interval of 10, removed stop words, trained for 1000 iterations, and used a document topic threshold of 0.05. Each document was comprised of the Wikipedia page title and the question text. B.7.3 Computational Resources The majority of experiments were conducted on a single workstation with an Intel i7-7700K cpu, 47gb of ram, and an Nvidia 1080Ti. The average runtime for the irt-feas model on cpu is 113 seconds with a standard deviation of 2.31 over 5 trials. The average runtime of the irt-vec model on gpu is 110 seconds with a standard deviation of 0.5 over 5 trials. Since each ranking stability required (?4.4.3.1) re-training an irt-feas model on each subset, we parallelized this experiment on a cpu cluster where each trial received two cpu cores and 16gb of ram. In total, this included 520 trials which corresponds to twice that many trained irt models since one model is trained on each subset of the data. 5http://mallet.cs.umass.edu/topics.php 160 Answerable 60 Mean Item Accuracy 40 0.8 20 0.6 0.4 0 0.2 ?20 0.0 ?40 ?60 Not Answerable 60 40 20 0 ?20 ?40 ?60 ?70 ?50 ?30 ?10 10 30 50 70 TSNE Dimension 0 Figure B.5: In squad, tsne shows a relationship between mean exact match (item accuracy) and answerability with respect to multidimensional difficulty and discrim- inability. 161 TSNE Dimension 1 TSNE Dimension 1 Appendix C: Quizbowl This appendix provides additional details on preprocessing and model features in Chapter 5. C.1 Preprocessing This section provides a detailed description of preprocessing on the question dataset. C.1.1 Aligning and De-duplicating Questions Since we obtain machine-readable versions of questions from two online sources it is necessary to ensure that we do not include the same question twice. We use the metadata associated with each question such as tournament and year. As part of our preprocessing we manually align the values of these fields.1 We use these fields to ensure that questions for each tournament and year are included only once. C.1.2 Textual Preprocessing Models should define their own textual preprocessing so we only preprocess the text to remove qb specific artifacts. Most of these artifacts are instructions to the moderator or organizer such as ?MODERATOR NOTE:?, ?Description Required?, ?15 pts:?, or a reference to the category of the question; we use regular expression rules to remove these. Since we report results on accuracy after the first sentence in questions we also provide a set of canonical sentence tokenization indices computed using spacy.2 C.1.3 Fold Assignment We divide qanta dataset questions into training, development, and test folds based on the competitiveness and year of the source tournament. Since champi- onship tournaments typically have the highest quality questions we use questions from championship tournaments 2015 and onward as development and test sets. All other questions are used as the training set. 1We also align category and sub-category fields. 2https://spacy.io 162 Table 5.3 shows the divisions of each fold; each train, dev, or test fold is as- signed to be used for either determining what to answer (guessing) or when to answer (buzzing). Questions in ?guess? folds are used for developing question answering sys- tems as in Section 5.5. Questions in the ?buzz? folds are used for developing agents that decide when to answer as in Section 5.6. When we assign folds to qb questions we aim to create useful splits for guessing and buzzing while preserving the integrity of the development and test sets. Namely, when we create test and development folds we make the division into folds not depend on whether or not gameplay data exists. If it were the case that by making this unconditional assignment the number of questions with gameplay data is too small this would be a problem. We do not find this to be a problem however. For test set questions this is easily accomplished by using an implicit quality filter and a temporal split; use only questions from national championship tourna- ments, questions from 2016 are used in the buzzing test set, and questions from 2017 and 2018 are used for the guessing test set. Following this we pair the test fold for buzzing with gameplay data, and are fortunate that the number of questions is not small. To create the development sets we use questions from 2015 which are randomly split with equal probability into guessing and buzzing specific folds. Similarly to the test set we associate gameplay data after this assignment occurs to preserve its integrity against any bias that conditioning on having gameplay data would have. For the training data we make a weaker attempt to eliminate bias in favor of ensuring that the training folds for guessing and buzzing are large enough. We first divide the training questions with an 80/20 split. Questions in the eighty percent split are assigned to the guessing fold. Each remaining question is assigned to the buzzing fold if it has gameplay data, otherwise it is assigned to the guessing fold. Figure 5.3 shows the result of this folding procedure. C.1.4 Matching QB Answers to Wikipedia Pages The automatic rule based part of this process is composed of two phases: an expansion phase in that produces variants of the answer text, and a match phase that determines when one of these variants is a match to a Wikipedia page. The rules in the expansion phase can be as simple as exact text match to expanding ?The {Master of Fl?malle} or Robert {Campin}? to ?{Master of Fl?malle}? and ?Robert {Campin}?. In this case, multiple matches result in ?Robert Campin? being the answer page: after removing braces ?The Master of Fl?malle? Wikipedia redirects to ?Robert Campin? and ?Robert Campin? is also an exact match. These rules are incredibly effective at finding answers buried in qb specific notation such as the random sample in Table C.1. When matches disagree we use the match that modified the original answer text the least. There are inevitably cases where the automatic system fails to find a match, or finds the wrong match. Qualitatively these are often caused by disambiguation errors such as failing to differentiate between ?Guernica? the city versus the painting by Picasso, small differences in answer strings, and when there is no suitable Wikipedia 163 Original qb Answer Matched Wikipedia Page Nora Helmer A_Doll?s_House {Gauss}?s law for the electric field No Mapping Found Thomas Hutchinson Thomas_Hutchinson_(governor) linearity Linearity {caldera}s Caldera William Holman {Hunt} William_Holman_Hunt {plasma}s Plasma_(physics) {Second Vatican Council} [or {Vatican Second_Vatican_Council II}] {Jainism} Jainism {Electronegativity} Electronegativity Hubert Selby, Jr. Hubert_Selby_Jr. (The) Entry of Christ into Brussels (ac- Christ?s_Entry_Into_Brussels_in_1889 cept equivalents due to translation) Depictions of Speech [accept equiva- No Mapping Found lents] stress Stress_(mechanics) Table C.1: A random sample of qb answer strings and their matched Wikipedia pages. Answer mappings are easy to obtain accurately since most failures in exact matching are due to qb specific syntax that can be accounted for by rule based matching. Combined with manual annotation to find common non-exact matches, this process succeeds on 119,093 of 132,849. page. To correct or verify these errors we (the authors), and skilled members of the QB community (such as tournament organizers and participants from our exhibition matches) manually annotated a significant fraction of the training data, and all the test data. Rather than doing manual annotation of each question, we begin by defining mappings of answer strings to Wikipedia pages so that when that string occurs multiple times it does not require manual annotation for every occurrence of that answer in questions. However, this has the serious drawback that if the answer string is ambiguous then it may result in mislabeled answers. To avoid this problem we design a manual process whereby annotators update three sets of answer-to- Wikipedia mappings: unambiguous, ambiguous, and direct mappings. Unambiguous annotations contain a list of answer strings that when seen map to a specific Wikipedia page. As the name implies, we only insert annotations here when the answer unambiguously identifies the corresponding Wikipedia page. Ambiguous annotations similarly contain a list of answer strings, but are paired with 164 a list of disambiguation words. If the answer string is seen, at least one word is in the question text, and there are no other ambiguous matches, then it is mapped. For example, if the answer string is ?amazon? and the question contains the word ?river? then we assume ?Amazon river? is the correct page while if the question mentions ?bezos? then the correct page is ?Amazon (company)?. Finally, direct mappings match the answer for specific questions. The last major design decision in this process addresses how we prevent infor- mation from the test data to leak into the training data. The root of the data leak issue is that the distribution of answers between training and test data often results in only approximately 80% of test set answers occurring in the training data. We ob- served this phenomena empirically in both our data and the distribution of answers from our numerous exhibition events. If all answer strings are naively combined, then mapped, this implies that the training data will be biased towards its answers containing an over abundance of test set answers. A major difference between this and prior versions of the qanta dataset is finding and fixing this issue. We correct this error by separating the answer string pool for training and test questions. Although this results in more annotation work, it avoids information leakage. While reviewing our annotation procedure we noticed another source of bias. Recall that we do not exhaustively annotate the training data. In our initial annotation we did not fully annotate the test data, and by doing so introduced a bias towards easier-to-annotate questions in the test set. To eliminate this bias? and make it as similar to playing a qb tournament as possible?we annotated every question in the test set.3 C.1.5 Buzzer features The guesser updates its list of guesses whenever a new word of the qb question is revealed. At each time step, the buzzer extracts features from both the current and all past guesses and predict whether the current guess is correct. It is important to include past guesses as the dynamics of the guesser?s confidence contains strong signal about its correctness: the guesser usually starts with some random guess when little information is provided, then fluctuates between several plausible answers? just as humans do, and finally stablizes to a single answer, at which point the buzzer should buzz. Below is the full list of buzzer features we use in the experiments: ? Probabilities of the current top 3 guesses ? Change of top 3 probabilites from the previous step ? Gaps between probabilities of the top 3 guesses ? Binary indicator of whether each of the top 5 guesses increased its ranking from previous step 3Specifically, we either pair each test set answer strings with a Wikipedia title or mark it as not having a corresponding Wikipedia title. 165 Category N Percent Pop Culture 55 40% History 26 19% Science 20 15% Other 13 9.6% Social Science 7 5.1% Geography 6 4.4% Religion 2 1.5% Literature 5 3.7% Philosophy 1 0.74% Fine Arts 1 0.74% Total w/ Category 136 100% No Category 14 Total 150 Table C.2: A breakdown of NaturalQuestion example topics using qb categories. Most questions are about pop culture and the distribution includes many fewer questions about Literature and Fine Arts. ? Mean and variance of probabilities of the current top 3 guesses ? Mean and variance of probabilities of the previous top 3 guesses C.2 Natural Questions Categories Section 5.3.2 analyzes the topical diversity of qb questions and makes com- parisons to NaturalQuestions. To compare to NaturalQuestions?which does not have category labels?we annotated a random subset of 150 questions using qb?s categories (Table C.2). 166 Appendix D: Centaur Authorship of Quizbowl Questions This appendix contains an analysis of automatically created adversarial exam- ples and details of the Studio ousia model. D.1 Failure of Syntactically Controlled Paraphrase Networks Sentence Success/Failure Phenomena its types include ?frictional?, ?cyclical?, and ?structural? its types include ?frictional?, and structural Missing Information 7 german author of the sorrows of young werther and a two-part faust Lost Named Entity 7 german author of the sorrows of mr. werther name this elegy on the death of john keats composed by percy shelley Incorrect Clue 7 name was this elegy on the death of percy shelley identify this play about willy loman written by arthur miller Unsuited Syntax Template 7 so you can identify this work of mr. miller he employed marco polo and his father as ambassadors he hired marco polo and his father as ambassadors Verb Synonym X Table D.1: Failure and success cases for scpn. The model fails to create a valid paraphrase of the sentence for 97% of questions. We apply the Syntactically Controlled Paraphrase Network (Iyyer et al., 2018, SCPN) to qb questions. The model operates on the sentence level and cannot paraphrase paragraphs. We thus feed in each sentence independently, ignoring pos- sible breaks in coreference. The model does not correctly paraphrase most of the complex sentences present in qb questions. The paraphrases were rife with issues: ungrammatical, repetitive, or missing information. To simplify the setting, we focus on paraphrasing the shortest sentence from each question (often the final clue). The model still fails in this case. We ana- lyze a random sample of 200 paraphrases: only six maintained all of the original information. Table D.1 shows common failure cases. One recurring issue is an inability to maintain the correct named entities after paraphrasing. In qb, maintaining entity information is vital for ensuring question validity. We were surprised by this failure 167 because SCPN incorporates a copy mechanism. D.2 Studio Ousia qb Model The Studio Ousia system works by aggregating scores from both a neural text classification model and an IR system. Additionally, it scores answers based on their match with the correct entity type (e.g., religious leader, government agency, etc.) predicted by a neural entity type classifier. The Studio Ousia system also uses data beyond qb questions and the text of Wikipedia pages, integrating entities from a knowledge graph and customized word vectors (Yamada et al., 2018b). 168 Bibliography 2020. Canon - qb wiki. https://www.qbwiki.com/wiki/Canon. Accessed: 2021-03- 05. Firas Abuzaid, Geet Sethi, Peter Bailis, and Matei Zaharia. 2019. To index or not to index: Optimizing exact maximum inner product search. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pages 1250?1261. IEEE. Alekh Agarwal, Olivier Chapelle, Miroslav Dud?k, and John Langford. 2014. A reliable effective terascale linear learning system. Journal of Machine Learning Research, 15:1111?1133. Luis von Ahn. 2006. Games with a purpose. Computer, 39:92 ? 94. Luis von Ahn and Laura Dabbish. 2004. Labeling images with a computer game. In International Conference on Human Factors in Computing Systems. Luis von Ahn and Laura Dabbish. 2008. Designing games with a purpose. Communications of the ACM, 51(8):58?67. Xavier Amatriain and Justin Basilico. 2012. Netflix recommendations: Beyond the 5 stars (part 1). https://netflixtechblog.com/netflix-recommendations-beyond-the- 5-stars-part-1-55838468f429. Accessed: 2021-03-15. National Council on Measurement in Education, Joint Committee on Standards for Educational and Psychological Testing (U.S.) American Educational Research As- sociation, American Psychological Association. 2014. Standards for educational and psychological testing. American Educational Research Association, Washing- ton, DC. Saleema Amershi, Max Chickering, Steven M Drucker, Bongshin Lee, Patrice Simard, and Jina Suh. 2015. ModelTracker: Redesigning performance analysis tools for machine learning. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ?15, pages 337?346, New York, NY, USA. Association for Computing Machinery. 169 Christine M Anderson-Cook, Kary L Myers, Lu Lu, Michael L Fugate, Kevin R Quinlan, and Norma Pawley. 2019. How to host an effective data competition: Statistical advice for competition design and analysis. Statistical Analysis and Data Mining: The ASA Data Science Journal, 12(4):271?289. Dustin Arendt, Zhuanyi Huang, Prasha Shrestha, Ellyn Ayton, Maria Glenski, and Svitlana Volkova. 2021. CrossCheck: Rapid, reproducible, and interpretable model evaluation. In Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances. Association for Computational Linguis- tics. Sylvain Arlot and Alain Celisse. 2010. A survey of cross-validation procedures for model selection. Statistics surveys, 4:40?79. Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4):555?596. Richard D Arvey, Thomas J Bouchard, John B Carroll, Raymond B Cattell, David B Cohen, Rene V Dawis, and LWillerman. 1994. Mainstream science on intelligence. Wall Street Journal, 13(1):18?25. David Baber. 2015. Television Game Show Hosts: Biographies of 32 Stars. McFar- land. Ricardo Baeza-Yates, Aristides Gionis, Flavio Junqueira, Vanessa Murdock, Vas- silis Plachouras, and Fabrizio Silvestri. 2007. The impact of caching on search engines. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ?07, pages 183?190, New York, NY, USA. Association for Computing Machinery. Frank B Baker. 2001. The Basics of Item Response Theory. ERIC. Solon Barocas, Moritz Hardt, and Arvind Narayanan. 2019. Fairness and Machine Learning. fairmlbook.org. http://www.fairmlbook.org. Max Bartolo, Alastair Roberts, Johannes Welbl, Sebastian Riedel, and Pontus Stenetorp. 2020. Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension. Transactions of the Association for Computational Linguistics, 8:662?678. Hannah Bast, Florian B?urle, Bj?rn Buchhold, and Elmar Haussmann. 2014. Easy access to the freebase dataset. In Proceedings of the World Wide Web Conference. Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In Proceedings of the International Conference on Learning Representations. 170 Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. In Transactions of the Association for Computational Linguistics, pages 49?72. Nicholas J Belkin, Colleen Cool, Adelheit Stein, and Ulrich Thiel. 1995. Cases, scripts, and information-seeking strategies: On the design of interactive informa- tion retrieval systems. Expert systems with applications, 9(3):379?395. Emily M Bender and Batya Friedman. 2018. Data statements for natural lan- guage processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587?604. Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the ACM Conference on Fairness, Accountability, and Transparency. Emily M Bender and Alexander Koller. 2020. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. R. Benjamin. 2019. Race After Technology: Abolitionist Tools for the New Jim Code. Polity Press. James Bennett and Stan Lanning. 2007. The netflix prize. In Proceedings of KDD cup and workshop. Jonathan Berant, Andrew Chou, Roy Frostig, and Percy S. Liang. 2013. Seman- tic parsing on freebase from question-answer pairs. In Proceedings of Empirical Methods in Natural Language Processing. James Bergstra, Daniel Yamins, and David Cox. 2013. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision ar- chitectures. In Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 115? 123. PMLR. D E Berlyne. 1954. A theory of human curiosity. British journal of psychology, 45(3):180?191. Darse Billings, Denis Papp, Jonathan Schaeffer, and Duane Szafron. 1998. Opponent modeling in poker. In Association for the Advancement of Artificial Intelligence. Eli Bingham, Jonathan P. Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Prad- han, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D. Goodman. 2018. Pyro: Deep Universal Probabilistic Programming. Journal of Machine Learning Research. 171 Abeba Birhane and Vinay Uday Prabhu. 2021. Large image datasets: A pyrrhic win for computer vision? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1537?1547. Edwin Black. 2012. War Against the Weak: Eugenics and America?s Campaign to Create a Master Race, Expanded Edition. Dialog Press. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993?1022. Su Lin Blodgett, Solon Barocas, Hal Daum?, III, and Hanna Wallach. 2020. Lan- guage (technology) is power: A critical survey of ?bias? in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454?5476, Online. Association for Computational Linguistics. Su Lin Blodgett, Lisa Green, and Brendan O?Connor. 2016. Demographic dialec- tal variation in social media: A case study of African-American English. In Proceedings of Empirical Methods in Natural Language Processing. Avrim Blum and Moritz Hardt. 2015. The ladder: A reliable leaderboard for machine learning competitions. In Proceedings of the International Conference of Machine Learning. PMLR. Daniel G Bobrow. 1964. Natural language input for a computer problem solving system. Technical report. Ond?ej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. Findings of the 2013 Workshop on Statistical Machine Translation. pages 1?44. Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075. C. Boyce-Jacino and S. DeDeo. 2018. Opacity, Obscurity, and the Geometry of Question-Asking. ArXiv e-prints. Jordan Boyd-Graber and Benjamin B?rschinger. 2020. What question answering can learn from trivia nerds. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Jordan Boyd-Graber, Shi Feng, and Pedro Rodriguez. 2018. Human-Computer Question Answering: The Case for Quizbowl. Springer. Jordan Boyd-Graber, Pedro Rodriguez, Nathan Murphy, and R. Hentzel. 2017. Qanta vs. quiz bowl veterans. 172 Jordan Boyd-Graber, Brianna Satinoff, He He, and Hal Daum? III. 2012. Besting the quiz master: Crowdsourcing incremental classification games. In Proceedings of Empirical Methods in Natural Language Processing. Laura Briggs. 2003. Reproducing Empire: Race, Sex, Science, and U.S. Imperialism in Puerto Rico. American Crossroads. University of California Press. Carla E Brodley and Mark A Friedl. 1999. Identifying mislabeled training data. The journal of artificial intelligence research, 11(1):131?167. Noam Brown and Tuomas Sandholm. 2019. Superhuman AI for multiplayer poker. Science, 365(6456):885?890. Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Wojciech Gajewski, An- drea Gesmundo, Neil Houlsby, and Wei Wang. 2018. Ask the right questions: Active question reformulation with reinforcement learning. In Proceedings of the International Conference on Learning Representations. Chris Buckley and Ellen M Voorhees. 2000. Evaluating evaluation measure stability. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Harry Bunt, Jan Alexandersson, Jean Carletta, Jae-Woong Choe, Alex Chengyu Fang, K?iti Hasida, Kiyong Lee, Volha Petukhova, Andrei Popescu-Belis, Lau- rent Romary, Claudia Soria, and David R. Traum. 2010. Towards an ISO stan- dard for dialogue act annotation. In Proceedings of the Language Resources and Evaluation Conference. Harry Bunt, Jan Alexandersson, Jae-Woong Choe, Alex Chengyu Fang, K?iti Hasida, Volha Petukhova, Andrei Popescu-Belis, and David R. Traum. 2012. ISO 24617-2: A semantically-based standard for dialogue annotation. In Proceedings of the Language Resources and Evaluation Conference. Joy Buolamwini and Timnit Gebru. 2018. Gender shades: Intersectional ac- curacy disparities in commercial gender classification. In Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pages 77?91, New York, NY, USA. PMLR. Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the role of Bleu in machine translation research. In Proceedings of the European Chapter of the Association for Computational Linguistics, Trento, Italy. Associa- tion for Computational Linguistics. B. Barla Cambazoglu, Mark Sanderson, Falk Scholer, and Bruce Croft. 2020. A review of public datasets in question answering research. ACM SIGIR Forum, 54(2). 173 Yang Trista Cao and Hal Daum?, III. 2020. Toward Gender-Inclusive coreference resolution. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4568?4595, Online. Association for Computa- tional Linguistics. L R Caporael. 1986. Anthropomorphism and mechanomorphism: Two faces of the human machine. Computers in human behavior, 2(3):215?234. Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. With little power comes great responsibility. In Proceedings of Empirical Methods in Natural Language Processing. Dallas Card, Chenhao Tan, and Noah A Smith. 2018. Neural models for documents with metadata. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. 2020. Extracting training data from large language models. Nicholas Carlini and David Wagner. 2017. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, AISec ?17, pages 3?14, New York, NY, USA. Association for Computing Machinery. David Carmel and Elad Yom-Tov. 2010. Estimating the query difficulty for in- formation retrieval. Synthesis Lectures on Information Concepts, Retrieval, and Services, 2(1):1?89. Nicky Case. 2018. How To Become A Centaur. Journal of Design and Science. J Mckeen Cattell. 1915. Families of american men of science: Origin, heredity and performance?. Popular Science Monthly, May, pages 248?262. Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X. Ling. 2004. Test-cost sen- sitive naive bayes classification. Fourth IEEE International Conference on Data Mining (ICDM?04), pages 51?58. Seth Chaiklin. 2003. The Zone of Proximal Development in Vygotsky?s Analysis of Learning and Instruction, Learning in Doing: Social, Cognitive and Computa- tional Perspectives, page 39?64. Cambridge University Press. Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Eval- uating question answering evaluation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 119?124, Stroudsburg, PA, USA. Association for Computational Linguistics. 174 Anthony Chen, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2020. MOCHA: A dataset for training and evaluating generative reading comprehension metrics. In Proceedings of Empirical Methods in Natural Language Processing. Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough exam- ination of the CNN/Daily Mail reading comprehension task. In Proceedings of the Association for Computational Linguistics. Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. In Proceedings of the Association for Computational Linguistics. Danqi Chen and Wen-Tau Yih. 2020. Open-Domain question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts. Xilun Chen and Claire Cardie. 2018. Multinomial adversarial networks for multi- domain text classification. In Conference of the North American Chapter of the Association for Computational Linguistics. Xiao Cheng and Dan Roth. 2013. Relational inference for wikification. In Proceedings of Empirical Methods in Natural Language Processing. Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014a. On the properties of neural machine translation: Encoder-decoder ap- proaches. In Proceedings of Eighth Workshop on Syntax, Semantics and Structure in Statistical Translation. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning phrase rep- resentations using RNN encoder-decoder for statistical machine translation. In Proceedings of Empirical Methods in Natural Language Processing. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. Quac: Question answering in context. In Proceedings of Empirical Methods in Natural Language Processing. Marcus Tullius Cicero. 1914. De Finibus Bonorum Et Malorum. Loeb classical library. W. Heinemann and Macmillan. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the surprising diffi- culty of natural Yes/No questions. In Conference of the North American Chapter of the Association for Computational Linguistics. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try ARC, the AI2 reasoning challenge. 175 Charles L A Clarke, Gordon V Cormack, and Thomas R Lynam. 2001. Exploiting redundancy in question answering. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ?01, pages 358? 365, New York, NY, USA. Association for Computing Machinery. Cyril Cleverdon. 1967. The cranfield tests on index language devices. In Aslib proceedings. MCB UP Ltd. Cyril W Cleverdon. 1991. The significance of the cranfield tests on index lan- guages. In Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ?91, pages 3?12, New York, NY, USA. Association for Computing Machinery. Adam Coates, Pieter Abbeel, and Andrew Y Ng. 2008. Learning for control from multiple demonstrations. In Proceedings of the International Conference of Machine Learning. Kenneth Mark Colby. 1981. Modeling a paranoid mind. The Behavioral and brain sciences, 4(4):515?534. Ronan Collobert and Jason Weston. 2008. A unified architecture for natural lan- guage processing: deep neural networks with multitask learning. In Proceedings of the International Conference of Machine Learning. Alexis Conneau, German Kruszewski, Guillaume Lample, Lo?c Barrault, and Marco Baroni. 2018. What you can cram into a single $&!#* vector: Probing sen- tence embeddings for linguistic properties. In Proceedings of the Association for Computational Linguistics. Seth Cooper, Firas Khatib, Adrien Treuille, Janos Barbero, Jeehyung Lee, Michael Beenen, Andrew Leaver-Fay, David Baker, Zoran Popovi?, and Foldit Players. 2010. Predicting protein structures with a multiplayer online game. Nature, 466(7307):756?760. Ann Copestake and Karen Sparck Jones. 1990. Natural language interfaces to databases. Knowledge Engineering Review, 5(4):225?249. Charles Corbi?re, Nicolas Thome, Avner Bar-Hen, Matthieu Cord, and Patrick P?rez. 2019. Addressing failure prediction by learning model confidence. In Proceedings of Advances in Neural Information Processing Systems. U.S. Supreme Court. 1927. Buck v. bell. 274. Matt Crane. 2018. Questionable answers in question answering research: Repro- ducibility and variability of published results. Transactions of the Association for Computational Linguistics, 6:241?252. Common Crawl. Statistics of common crawl monthly archives. 176 Bruce Croft. 2019. Approaches to research in IR. Invited Lecture at the 12th European Summer School in Information Retrieval. J Shane Culpepper, Fernando Diaz, and Mark D Smucker. 2018. Research frontiers in information retrieval: Report from the third strategic workshop on information retrieval in lorne (SWIRL 2018). SIGIR Forum, 52(1):34?90. Jeffrey Dalton, Chenyan Xiong, Vaibhav Kumar, and Jamie Callan. 2020. CAsT- 19: A dataset for conversational information seeking. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1985?1988, New York, NY, USA. Association for Computing Machinery. Jeffrey Stephen Dalton, Chen-Yan Xiong, and James P. Callan. 2019. TREC CAsT 2019: The conversational assistance track overview. In Text REtrieval Conference. Hoa Trang Dang, Diane Kelly, and Jimmy Lin. 2007. Overview of the trec 2007 question answering track. In Proceedings of the Text REtrieval Conference. Hoa Trang Dang, Jimmy Lin, and Diane Kelly. 2006. Overview of the trec 2006 question answering track. In Proceedings of the Text REtrieval Conference. Rajarshi Das, Shehzaad Dhuliawala, Manzil Zaheer, and Andrew McCallum. 2018. Multi-step Retriever-Reader interaction for scalable open-domain ques- tion answering. In Proceedings of the International Conference on Learning Representations. Pradeep Dasigi, Nelson F Liu, Ana Marasovi?, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coref- erential reasoning. In Proceedings of Empirical Methods in Natural Language Processing. Hal Daum?, Nikos Karampatziakis, John Langford, and Paul Mineiro. 2017. Loga- rithmic time one-against-some. In Proceedings of the International Conference of Machine Learning. Hal Daume III. 2007. Frustratingly easy domain adaptation. In Proceedings of the Association for Computational Linguistics. Matt Davis. 2003. Aoccdrnig to a rscheearch at cmabrigde uinervtisy. https://www.mrc-cbu.cam.ac.uk/people/matt.davis/cmabridge/. Accessed: 2021-03-22. Rafael Jaime De Ayala. 2013. The Theory and Practice of Item Response Theory. Guilford Publications. Janez Dem?ar. 2006. Statistical comparisons of classifiers over multiple data sets. Journal of machine learning research: JMLR, 7(1):1?30. 177 J Deng, W Dong, R Socher, L Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large- scale hierarchical image database. In Computer Vision and Pattern Recognition. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Bhuwan Dhingra, Kathryn Mazaitis, and WilliamW Cohen. 2017. Quasar: Datasets for question answering by search and reading. Emily Dinan, Samuel Humeau, Bharath Chintagunta, and Jason Weston. 2019a. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. In Proceedings of Empirical Methods in Natural Language Processing. Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, and Jason Weston. 2019b. Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations. Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. 2019. Show your work: Improved reporting of experimental results. Association for Computational Linguistics. Ravit Dotan and Smitha Milli. 2020. Value-laden disciplinary shifts in machine learning. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitch- hiker?s guide to testing statistical significance in natural language processing. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. DROP: A reading comprehension benchmark requir- ing discrete reasoning over paragraphs. In Proceedings of the Association for Computational Linguistics. Jesse Dunietz, Gregory Burnham, Akash Bharadwaj, Owen Rambow, Jennifer Chu- Carroll, and David Ferrucci. 2020. To test machine comprehension, start by defining comprehension. In Proceedings of the Association for Computational Linguistics. Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur G?ney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA: A new Q&A dataset augmented with context from a search engine. Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White- box adversarial examples for text classification. In Proceedings of the Association for Computational Linguistics. 178 F Y Edgeworth. 1888. The statistics of examinations. Journal of the Royal Statistical Society, 51(3):599?635. Bradley Efron. 1994. An introduction to the bootstrap. Chapman & Hall, New York. Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, and Eduard Hovy, editors. 2020. Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. Association for Computational Linguistics. Ahmed Elgohary, Denis Peskov, and Jordan Boyd-Graber. 2019. Can you unpack that? learning to rewrite Questions-in-Context. In Proceedings of Empirical Methods in Natural Language Processing. Ahmed Elgohary, Chen Zhao, and Jordan Boyd-Graber. 2018. A dataset and base- lines for sequential open-domain question answering. In Proceedings of Empirical Methods in Natural Language Processing. Charles Elkan. 2001. The foundations of cost-sensitive learning. In International Joint Conference on Artificial Intelligence, IJCAI?01, pages 973?978, San Fran- cisco, CA, USA. Morgan Kaufmann Publishers Inc. Jeffrey L. Elman. 1990. Finding structure in time. Cognitive Science, 14:179?211. Mihail Eric and Christopher D. Manning. 2017. Key-value retrieval networks for task-oriented dialogue. In Proceedings of the Annual SIGDIAL Meeting on Discourse and Dialogue. Kawin Ethayarajh and Dan Jurafsky. 2020. Utility is in the eye of the user: A critique of NLP leaderboards. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Allyson Ettinger, Sudha Rao, Hal Daum? III, and Emily M. Bender. 2017. Towards linguistically generalizable NLP systems: A workshop and shared task. In In Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems. Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. 2019. ELI5: Long form question answering. In Proceedings of the Association for Computational Linguistics. Shi Feng and Jordan Boyd-Graber. 2019. What can ai do for me: Evaluating machine learning interpretations in cooperative play. In International Conference on Intelligent User Interfaces. Shi Feng, Eric Wallace, and Jordan Boyd-Graber. 2019. Misleading failures of partial-input baselines. In Proceedings of the Association for Computational Linguistics. 179 Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, and Jordan Boyd-Graber. 2018. Pathologies of neural models make interpretations difficult. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. David A. Ferrucci, Eric W. Brown, Jennifer Chu-Carroll, James Fan, David Gondek, Aditya Kalyanpur, Adam Lally, J. William Murdock, Eric Nyberg, John M. Prager, Nico Schlaefer, and Christopher A. Welty. 2010. Building watson: An overview of the deepqa project. AI Magazine, 31:59?79. Jenny Rose Finkel and Christopher D Manning. 2009. Hierarchical Bayesian domain adaptation. In Proceedings of the Association for Computational Linguistics. Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eunsol Choi, and Danqi Chen. 2019. MRQA 2019 shared task: Evaluating generalization in reading compre- hension. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. Association for Computational Linguistics. Kar?n Fort, Gilles Adda, and K Bretonnel Cohen. 2011. Amazon mechanical turk: Gold mine or coal mine? Computational Linguistics, 37(2):413?420. Apache Software Foundation. Lucene. https://lucene.apache.org/. Batya Friedman, Peter H. Kahn Jr., and Alan Borning. 2008. Value Sensitive Design and Information Systems, chapter 4. John Wiley & Sons, Ltd. Batya Friedman and Helen Nissenbaum. 1996. Bias in computer systems. ACM Transactions on Information and System Security, 14(3):330?347. Hannah Fry. 2018. Hello World: Being Human in the Age of Algorithms. W. W. Norton. Francis Fukuyama. 1995. Confucianism and democracy. Journal of Democracy, 6(2):20?33. Meredith D Gall. 1970. The use of questions in teaching. Review of educational research, 40(5):707?721. Howard Gardner. 2011. Frames of Mind: The Theory of Multiple Intelligences. Basic Books. Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiang- ming Liu, Nelson F Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020a. Evaluating models? local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP. Association for Computational Linguistics. 180 Matt Gardner, Jonathan Berant, Hannaneh Hajishirzi, Alon Talmor, and Sewon Min. 2020b. Question answering is a format; when is it useful? In Proceedings of the Association for Computational Linguistics. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nel- son F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2018. Allennlp: A deep semantic natural language processing platform. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum? Iii, and Kate Crawford. 2018. Datasheets for datasets. Steven Gelb, Garland E Allen, Andrew Futterman, and Barry A Mehler. 1986. Rewriting mental testing history: The view from the american psychologist. In Sage Race Relations Abstracts, volume 11, pages 18?31. Sean M Gerrish and David M Blei. 2011. Predicting legislative roll calls from text. In Proceedings of the International Conference of Machine Learning. Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an investigation of annotator bias in natural language under- standing datasets. In Proceedings of Empirical Methods in Natural Language Processing. Marjan Ghazvininejad, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Scott Wen-tau Yih, and Michel Galley. 2018. A knowledge-grounded neural con- versation model. In Association for the Advancement of Artificial Intelligence. R Girshick, J Donahue, T Darrell, and J Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition. Mark E Glickman and Albyn C Jones. 1999. Rating the chess rating system. Chance, 12. Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the Association for Computational Linguistics. Uri Gneezy and Aldo Rustichini. 2000. Pay enough or don?t pay at all. The quarterly journal of economics, 115(3):791?810. Henry Herbert Goddard. 1920. Human efficiency and levels of intelligence. Princeton University Press. Karan Goel, Nazneen Rajani, Jesse Vig, Samson Tan, Jason Wu, Stephan Zheng, Caiming Xiong, Mohit Bansal, and Christopher R?. 2021. Robustness gym: Uni- fying the NLP evaluation landscape. 181 Harvey Goldstein. 2012. Francis galton, measurement, psychometrics and social progress. Assessment in Education: Principles, Policy & Practice, 19(2):147?158. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org. Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, San- jeev Kwatra, Anu Venkatesh, Raefer Gabriel, and Dilek Hakkani-T?r. 2019. Topical-chat: Towards knowledge-grounded open-domain conversations. In Proceedings of the Annual Conference of the International Speech Communication Association. Karthik Gopalakrishnan, Behnam Hedayatnia, Longshaokan Wang, Yang Liu, and Dilek Hakkani-Tur. 2020. Are neural Open-Domain dialog systems robust to speech recognition errors in the dialog history? an empirical study. In Proceedings of the Annual Conference of the International Speech Communication Association. Clinton Gormley and Zachary Tong. 2015. Elasticsearch: The Definitive Guide. " O?Reilly Media, Inc.". Stephen Jay Gould. 1981. The Mismeasure of Man. W. W. Norton & Company. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. International Journal of Computer Vision, 127:398? 414. Samuel Gratzl, Alexander Lex, Nils Gehlenborg, Hanspeter Pfister, and Marc Streit. 2013. LineUp: visual analysis of multi-attribute rankings. IEEE transactions on visualization and computer graphics, 19(12):2277?2286. Bert F Green, Alice K Wolf, Carol Chomsky, and Kenneth Laughery. 1961. Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference. Association for Computing Machinery. Terry Gross. 2016. The supreme court ruling that led to 70,000 forced sterilizations. Fresh Air. Anupam Guha, Mohit Iyyer, Danny Bouman, and Jordan Boyd-Graber. 2015. Re- moving the training wheels: A coreference dataset that entertains humans and challenges computers. In Conference of the North American Chapter of the Association for Computational Linguistics. J P Guilford. 1954. Psychometric methods. McGraw-Hill, New York, NY, US. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On calibration of modern neural networks. In Proceedings of the International Conference of Machine Learning. 182 Nitish Gupta, Sameer Singh, and Dan Roth. 2017. Entity linking via joint encod- ing of types, descriptions, and context. In Proceedings of Empirical Methods in Natural Language Processing. Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language inference data. In Conference of the North American Chapter of the Association for Computational Linguistics. Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. REALM: Retrieval-Augmented language model Pre-Training. In Proceedings of the International Conference of Machine Learning. A Halevy, P Norvig, and F Pereira. 2009. The unreasonable effectiveness of data. IEEE intelligent systems, 24(2):8?12. Ronald Hambleton. 1991. Fundamentals of item response theory. Sage Publications, Newbury Park, Calif. Ray Hamel. 2000. Trivia master bio. Accessed: 2021-03-04. Sangdo Han, Jeesoo Bang, Seonghan Ryu, and Gary Geunbae Lee. 2015. Exploiting knowledge base to generate responses for natural language dialog listening agents. In Proceedings of the Annual SIGDIAL Meeting on Discourse and Dialogue. Elizabeth Hardwick. 1967. The Little Foxes revived. The New York Review of Books, 9(11). Stevan Harnad. 1992. The turing test is not a trick: Turing indistinguishability is a scientific criterion. SIGART Bull., 3(4):9?10. Bob Harris. 2006. Prisoner of Trebekistan: a decade in Jeopardy! Zellig S Harris. 1954. Distributional structure. Word, 10(2-3):146?162. He He, Anusha Balakrishnan, Mihail Eric, and Percy Liang. 2017. Learning sym- metric collaborative dialogue agents with dynamic knowledge graph embeddings. In Proceedings of the Association for Computational Linguistics. He He, Jordan Boyd-Graber, and Hal Daum? III. 2016a. Interpretese vs. transla- tionese: The uniqueness of human strategies in simultaneous interpretation. In Conference of the North American Chapter of the Association for Computational Linguistics. He He, Jordan L. Boyd-Graber, Kevin Kwok, and Hal Daum? III. 2016b. Opponent modeling in deep reinforcement learning. In Proceedings of the International Conference of Machine Learning. 183 Behnam Hedayatnia, Karthik Gopalakrishnan, Seokhwan Kim, Yang Liu, Mihail Eric, and Dilek Hakkani-Tur. 2020. Policy-Driven neural response generation for Knowledge-Grounded dialogue systems. In Proceedings of the 13th International Conference on Natural Language Generation. Dan Hendrycks and Kevin Gimpel. 2017a. A baseline for detecting misclassi- fied and out-of-distribution examples in neural networks. In Proceedings of the International Conference on Learning Representations. Dan Hendrycks and Kevin Gimpel. 2017b. Gaussian error linear units (GELUs). Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. 2021. Natural adversarial examples. In Computer Vision and Pattern Recognition. Ralf Herbrich, Tom Minka, and Thore Graepel. 2007. Trueskill?: A bayesian skill rating system. In Proceedings of Advances in Neural Information Processing Systems. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proceedings of Advances in Neural Information Processing Systems. Jose Hernandez-Orallo. 2020. AI evaluation: On broken yardsticks and measurement scales. In Workshop on Evaluating Evaluation of Ai Systems at AAAI. William Hersh, Andrew Turpin, Susan Price, Benjamin Chan, Dale Kramer, Lynetta Sacherek, and Daniel Olson. 2000. Do batch and user evaluations give the same results? In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fan- drianto, Jay Han, Matthew Kelcey, and David Berthelot. 2016. Wikireading: A novel large-scale language understanding task over wikipedia. In Proceedings of the Association for Computational Linguistics. Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2016. The goldilocks principle: Reading children?s books with explicit memory representations. In Proceedings of the International Conference on Learning Representations. Lynette Hirschman, Marc Light, Eric Breck, and John D Burger. 1999. Deep read: a reading comprehension system. In Proceedings of the Association for Computational Linguistics. Sepp Hochreiter and J?rgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9:1735?1780. 184 Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daum?, Miro Dudik, and Hanna Wallach. 2019. Improving fairness in machine learning systems: What do industry practitioners need? In International Conference on Human Factors in Computing Systems, New York, NY, USA. Association for Computing Machinery. Mark Hopkins and Jonathan May. 2013. Models of translation competitions. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Yufang Hou, Charles Jochim, Martin Gleize, Francesca Bonin, and Debasis Ganguly. 2019. Identification of tasks, datasets, evaluation metrics, and numeric scores for scientific leaderboards construction. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Dirk Hovy and Shannon L Spruit. 2016. The social impact of natural language processing. In Proceedings of the Association for Computational Linguistics. Feng hsiung Hsu, Murray Campbell, and A. Joseph Hoane. 1995. Deep blue system overview. In Proceedings of the 9th International Conference on Supercomputing. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of Empirical Methods in Natural Language Processing. Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference of Machine Learning. Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daum? III. 2014. A neural network for factoid question answering over paragraphs. In Proceedings of Empirical Methods in Natural Language Processing. Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daum? III. 2015. Deep unordered composition rivals syntactic methods for text classification. In Proceedings of the Association for Computational Linguistics. Mohit Iyyer, John Wieting, Kevin Gimpel, and Luke Zettlemoyer. 2018. Adver- sarial example generation with syntactically controlled paraphrase networks. In Conference of the North American Chapter of the Association for Computational Linguistics. Matthew Frye Jacobson. 1999. Whiteness of a different color: European immigrants and the alchemy of race. Harvard University Press. Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study. Intelligent Data Analysis, 6:429?449. Ken Jennings. 2006. Brainiac: adventures in the curious, competitive, compulsive world of trivia buffs. Villard. 185 Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading com- prehension systems. In Proceedings of Empirical Methods in Natural Language Processing. Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. 2018. To trust or not to trust a classifier. In Proceedings of Advances in Neural Information Processing Systems. Xiaoqian Jiang, Melanie Osl, Jihoon Kim, and Lucila Ohno-Machado. 2012. Cali- brating predictive model estimates to support personalized medicine. Journal of the American Medical Informatics Association: JAMIA, 19(2):263?274. Jeff Johnson, Matthijs Douze, and Herv? J?gou. 2019. Billion-scale similarity search with gpus. IEEE Transactions on Big Data. Karen Sp?rck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11?21:133,525. Karen Sparck Jones. 2001. Automatic language and information processing: re- thinking evaluation. Natural Language Engineering, 7(1):29?46. Lyle V. Jones and David Thissen. 2006. 1 a history and overview of psychometrics. In C.R. Rao and S. Sinharay, editors, Psychometrics, volume 26 of Handbook of Statistics, pages 1?27. Elsevier. Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning, 37(2):183?233. Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the Association for Computational Linguistics. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the European Chapter of the Association for Computational Linguistics. Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent convolutional neural networks for discourse compositionality. In Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality. K Kalgaonkar, C Liu, Y Gong, and K Yao. 2015. Estimating cdonfidence scores on ASR results using recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Amita Kamath, Robin Jia, and Percy Liang. 2020. Selective question answering un- der domain shift. In Proceedings of the Association for Computational Linguistics. 186 Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- domain question answering. In Proceedings of Empirical Methods in Natural Language Processing. Divyansh Kaushik, Eduard Hovy, and Zachary Lipton. 2020. Learning the difference that makes a difference with Counterfactually-Augmented data. In International Conference on Learning Representations. Divyansh Kaushik, Douwe Kiela, Zachary C Lipton, and Wen-Tau Yih. 2021. On the efficacy of adversarial data collection for question answering: Results from a Large-Scale randomized study. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Divyansh Kaushik and Zachary Chase Lipton. 2018. How much reading does read- ing comprehension require? a critical investigation of popular benchmarks. In Proceedings of Empirical Methods in Natural Language Processing. Greg A Keim, Noam M Shazeer, Michael L Littman, Sushant Agarwal, Catherine M Cheves, Joseph Fitzgerald, Jason Grosland, Fan Jiang, Shannon Pollard, and Karl Weinmeister. 1999. PROVERB: The probabilistic cruciverbalist. In Association for the Advancement of Artificial Intelligence. J. F. Kelley. 1984. An iterative design methodology for user-friendly natural lan- guage office information applications. ACM Trans. Inf. Syst., 2:26?41. M G Kendall. 1938. A new measure of rank correlation. Biometrika, 30(1/2):81?93. Ibram X. Kendi. 2016. Stamped from the Beginning: The Definitive History of Racist Ideas in America. PublicAffairs. Ibram X Kendi. 2020. Read ibram x. kendi?s testimony in support of the working group recommendation to #suspendthetest. Accessed: 2021-07-16. Hamed Khanpour, Nishitha Guntakandla, and Rodney Nielsen. 2016. Dialogue act classification in domain-independent conversations using a deep recurrent neural network. In Proceedings of International Conference on Computational Linguistics. Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehen- sion over multiple sentences. In Conference of the North American Chapter of the Association for Computational Linguistics. 187 Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, and Ashish Sabharwal. 2020. QASC: A dataset for question answering via sentence composition. In Association for the Advancement of Artificial Intelligence. Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengx- uan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking benchmarking in NLP. In Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. 2016a. Character- Aware neural language models. In Association for the Advancement of Artificial Intelligence. Young-Bum Kim, Karl Stratos, and Ruhi Sarikaya. 2016b. Frustratingly easy neural domain adaptation. In Proceedings of International Conference on Computational Linguistics. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimiza- tion. In Proceedings of the International Conference on Learning Representations. Tom?? Ko?isk?, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Her- mann, G?bor Melis, and Edward Grefenstette. 2018. The NarrativeQA read- ing comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317?328. Yehuda Koren. 2009. The bellkor solution to the netflix grand prize. Netflix prize documentation, 81(2009):1?10. A. Korin. 2002. Fenomen ?Chto? Gde? Kogda??. Eksmo. Peter Kraft, Hirsh Jain, and Alexander M Rush. 2016. An embedding model for pre- dicting Roll-Call votes. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Bernhard Kratzwald and Stefan Feuerriegel. 2018. Adaptive document retrieval for deep question answering. In Proceedings of Empirical Methods in Natural Language Processing. Klaus Krippendorff. 2004. Content Analysis: an Introduction to its Methodology. Sage: Thousand Oaks, CA. Chapter 11. 188 Thomas Kuhn. 2012. The structure of scientific revolutions. The University of Chicago Press, Chicago. Volodymyr Kuleshov and Stefano Ermon. 2017. Estimating uncertainty online against an adversary. In Association for the Advancement of Artificial Intelligence, volume 31. Brian Kulis. 2012. Metric Learning: A Survey, volume 5. Julian Kupiec. 1993. Murax: A robust linguistic approach for question answering using an on-line encyclopedia. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Stacey Kuznetsov. 2006. Motivations of contributors to wikipedia. SIGCAS Comput. Soc., 36(2):1?es. Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, An- drew M Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453?466. Cody C T Kwok, Oren Etzioni, and Daniel S Weld. 2001. Scaling question answering to the web. In Proceedings of the World Wide Web Conference. ACM Press. Michail G Lagoudakis and Ronald Parr. 2003. Reinforcement learning as clas- sification: Leveraging modern classifiers. In Proceedings of the International Conference of Machine Learning. Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale ReAding comprehension dataset from examinations. In Proceedings of Empirical Methods in Natural Language Processing. John P Lalor, Hao Wu, Tsendsuren Munkhdalai, and Hong Yu. 2018. Understanding deep learning performance through an examination of test set difficulty: A psy- chometric case study. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. John P Lalor, Hao Wu, and Hong Yu. 2016. Building an evaluation scale using item response theory. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. John P Lalor, Hao Wu, and Hong Yu. 2019. Learning latent parameters with- out human response patterns: Item response theory with artificial crowds. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 189 S K Lam, D Pennock, Dan Cosley, and S Lawrence. 2003. 1 billion pages = 1 million dollars? mining the web to play ?who wants to be a millionaire??. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. Kenton Lee, Ming-Wei Chang, and Kristina Toutanova. 2019. Latent retrieval for weakly supervised open domain question answering. In Proceedings of the Association for Computational Linguistics. Nicholas Lemann. 2000. The Big Test: The Secret History of the American Meritocracy. Farrar, Straus and Giroux. Hector J Levesque. 2014. On our best behaviour. Artificial intelligence, 212(1):27? 35. Hector J Levesque, Ernest Davis, and Leora Morgenstern. 2011. The winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning. David D Lewis and William A Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Springer-Verlag. Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. 2020a. Question and answer Test-Train overlap in Open-Domain question answering datasets. Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich K?ttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. 2021. PAQ: 65 million Probably-Asked questions and what you can do with them. Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K?ttler, Mike Lewis, Wen-tau Yih, Tim Rockt?schel, Sebastian Riedel, and Douwe Kiela. 2020b. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of Advances in Neural Information Processing Systems. Jiwei Li, Will Monroe, and Dan Jurafsky. 2016. Understanding neural networks through representation erasure. Ping Li, Christopher J. C. Burges, and Qiang Wu. 2008. Learning to rank us- ing classification and gradient boosting. In Proceedings of Advances in Neural Information Processing Systems. Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of International Conference on Computational Linguistics. Rongzhong Lian, Min Xie, Fan Wang, Jinhua Peng, and Hua Wu. 2019. Learning to select knowledge for response generation in dialog systems. In International Joint Conference on Artificial Intelligence. 190 Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Dekang Lin and Patrick Pantel. 2001. Discovery of inference rules for question- answering. Natural Language Engineering, 7:343?360. Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. 2019. Reasoning over paragraph effects in situations. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering. Xiao Ling, Sameer Singh, and Daniel S Weld. 2015. Design challenges for entity linking. Transactions of the Association for Computational Linguistics, 3:315? 328. Tal Linzen. 2020. How can we accelerate progress towards human-like linguistic generalization? In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521?535. Zachary C. Lipton and Jacob Steinhardt. 2019. Troubling trends in machine learning scholarship. Queue, 17(1). Nelson F Liu, Tony Lee, Robin Jia, and Percy Liang. 2021. Can small and syn- thetic benchmarks drive modeling innovation? a retrospective study of question answering modeling approaches. Shuman Liu, Hongshen Chen, Zhaochun Ren, Yang Feng, Qun Liu, and Dawei Yin. 2018. Knowledge diffusion for neural dialogue generation. In Proceedings of the Association for Computational Linguistics. Daniel J Lizotte, Omid Madani, and Russell Greiner. 2003. Budgeted learn- ing of Naive-Bayes classifiers. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. F M Lord, M R Novick, and Allan Birnbaum. 1968. Statistical theories of mental test scores. Paul Lujan and Seth Teitler. 2003. Writing good quizbowl questions: A quick primer. Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, and Douwe Kiela. 2021. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. Subash Maddipoti. Subash maddipoti?s tips on ques- tion writing. https://acf-quizbowl.com/documents/ subash-maddipotis-tips-on-question-writing/. Accessed: 2018-12-04. 191 Kenny Malone. 2019. How uncle Jamie broke Jeopardy. Planet Money, (912). Travis Mandel, Yun-En Liu, Sergey Levine, Emma Brunskill, and Zoran Popovic. 2014. Offline policy evaluation across representations with applications to educa- tional games. In Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems. F Mart?nez-Plumed and J Hern?ndez-Orallo. 2020. Dual indicators to analyze AI benchmarks: Difficulty, discrimination, ability, and generality. IEEE Transactions on Computational Intelligence in AI and Games, 12(2):121?131. Fernando Mart?nez-Plumed, Ricardo B C Prud?ncio, Adolfo Mart?nez-Us?, and Jos? Hern?ndez-Orallo. 2016. Making sense of item response theory in machine learning. In Proceedings of the Twenty-second European Conference on Artificial Intelligence. Fernando Mart?nez-Plumed, Ricardo B C Prud?ncio, Adolfo Mart?nez-Us?, and Jos? Hern?ndez-Orallo. 2019. Item response theory in AI: Analysing machine learning classifiers at the instance level. Artificial intelligence, 271:18?42. Winter Mason and Duncan J Watts. 2009. Financial incentives and the ?perfor- mance of crowds?. In Proceedings of the ACM SIGKDD Workshop on Human Computation, HCOMP ?09. Association for Computing Machinery. James R. McBride. 1976. Bandwidth, fidelity, and adaptive tests. T. J. McConnell, Jr. (Ed.), CAT/C 2 1975: The second conference on computer-assisted test con- struction. Atlanta GA: Atlanta Public Schools. Andrew Kachites McCallum. 2002. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu. Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. In Proceedings of Advances in Neural Information Processing Systems. Tom McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. H. Mehan. 1979. Learning Lessons: Social Organization in the Classroom. Harvard University Press. P Melville, M Saar-Tsechansky, F Provost, and R Mooney. 2005. An expected utility approach to active feature-value acquisition. In Fifth IEEE International Conference on Data Mining. 192 Jack Merullo, Luke Yeh, Abram Handler, Alvin Grissom, II, Brendan O?Connor, and Mohit Iyyer. 2019. Investigating sports commentator bias within a large corpus of american football broadcasts. In Proceedings of Empirical Methods in Natural Language Processing. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of Empirical Methods in Natural Language Processing. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Proceedings of Advances in Neural Information Processing Systems. Alexander Miller, Will Feng, Dhruv Batra, Antoine Bordes, Adam Fisch, Jiasen Lu, Devi Parikh, and Jason Weston. 2017. ParlAI: A dialog research software platform. In Proceedings of Empirical Methods in Natural Language Processing. Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bor- des, and Jason Weston. 2016a. Key-value memory networks for directly reading documents. In Proceedings of Empirical Methods in Natural Language Processing. Alexander H. Miller, Adam Fisch, Jesse Dodge, Amir-Hossein Karimi, Antoine Bor- des, and Jason Weston. 2016b. Key-value memory networks for directly reading documents. In Proceedings of Empirical Methods in Natural Language Processing. Sewon Min, Jordan L Boyd-Graber, C Alberti, Danqi Chen, Eunsol Choi, M Collins, Kelvin Guu, Hannaneh Hajishirzi, Kenton Lee, J Palomaki, Colin Raffel, A Roberts, T Kwiatkowski, Patrick Lewis, Y Wu, Heinrich Kuttler, L Liu, Pasquale Minervini, Pontus Stenetorp, Sebastian Riedel, Sohee Yang, Minjoon Seo, Gautier Izacard, F Petroni, L Hosseini, Nicola De Cao, E Grave, Ikuya Yamada, Sonse Shimaoka, Masatoshi Suzuki, Shumpei Miyawaki, S Sato, Ryo Takahashi, J Suzuki, Martin Fajcik, Martin Docekal, Karel Ondrej, P Smrz, Hao Cheng, Y Shen, X Liu, Pengcheng He, W Chen, Jianfeng Gao, Barlas O?uz, Xilun Chen, V Karpukhin, Stan Peshterliev, Dmytro Okhonko, M Schlichtkrull, Sonal Gupta, Yashar Mehdad, and Wen-Tau Yih. 2021. NeurIPS 2020 EfficientQA competition: Systems, analyses and lessons learned. Sewon Min, Julian Michael, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2020. Am- bigQA: Answering ambiguous open-domain questions. In Proceedings of Empirical Methods in Natural Language Processing. Sewon Min, Eric Wallace, Sameer Singh, Matt Gardner, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2019. Compositional questions do not necessitate multi-hop reasoning. In Proceedings of the Association for Computational Linguistics. As- sociation for Computational Linguistics. Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. 193 Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery. Bhaskar Mitra and Nick Craswell. 2018. An introduction to neural information retrieval. Foundations and Trends? in Information Retrieval, 13(1):1?126. Nikita Moghe, Siddhartha Arora, Suman Banerjee, and Mitesh M. Khapra. 2018. Towards exploiting background knowledge for building conversation systems. In Proceedings of Empirical Methods in Natural Language Processing. Gr?goire Montavon, Wojciech Samek, and Klaus-Robert M?ller. 2018. Methods for interpreting and understanding deep neural networks. In Digital Signal Processing. Seungwhan Moon, Pararth Shah, Anuj Kumar, and Rajen Subba. 2019. Open- DialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the Association for Computational Linguistics. Kathleen E Moreno, C Douglas Wetzel, James R McBride, and David J Weiss. 1984. Relationship between corresponding armed services vocational aptitude battery (ASVAB) and computerized adaptive testing (CAT) subtests. Applied psychological measurement, 8(2):155?163. Pramod Kaushik Mudrakarta, Ankur Taly, Mukund Sundararajan, and Kedar Dhamdhere. 2018. Did the model understand the question? In Proceedings of the Association for Computational Linguistics. Tamara Munzner. 2014. Visualization Analysis and Design. CRC Press. Brad Myers, Scott E Hudson, and Randy Pausch. 2000. Past, present, and future of user interface software tools. ACM Trans. Comput.-Hum. Interact., 7(1):3?28. Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Gra- ham Neubig. 2018. Stress test evaluation for natural language inference. In Proceedings of International Conference on Computational Linguistics. Adam Najberg. 2018. Alibaba AI model tops humans in reading comprehension. Nikita Nangia and Samuel R Bowman. 2019. Human vs. muppet: A conservative estimate of human performance on the GLUE benchmark. In Proceedings of the Association for Computational Linguistics. Prathiba Natesan, Ratna Nandakumar, Tom Minka, and Jonathan D Rubright. 2016. Bayesian prior choice in IRT estimation using MCMC and variational bayes. Frontiers in psychology, 7:1422. LLC National Academic Quiz Tournaments. 2020. Naqt | press guide. https: //www.naqt.com/nationals/press-guide.jsp. Accessed: 2020-03-31. 194 Preksha Nema and Mitesh M Khapra. 2018. Towards a better metric for evaluating question generation systems. In Proceedings of Empirical Methods in Natural Language Processing. Vincent Ng. 2010. Supervised noun phrase coreference research: The first fifteen years. In Proceedings of the Association for Computational Linguistics. Associa- tion for Computational Linguistics. Khanh Nguyen and Brendan O?Connor. 2015. Posterior calibration and exploratory analysis for natural language processing models. In Proceedings of Empirical Methods in Natural Language Processing. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Ma- jumder, and Li Deng. 2016. MS MARCO: A human generated MAchine Reading COmprehension dataset. In Proceedings of the NIPS Workshop on Cognitive Computation: Integrating neural and symbolic approaches. Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik, and Kristina Miler. 2015. Tea party in the house: A hierarchical ideal point topic model and its application to republican legislators in the 112th congress. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, and Douwe Kiela. 2020. Adversarial NLI: A new benchmark for natural language understand- ing. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Malvina Nissim, Lasha Abzianidze, Kilian Evang, Rob van der Goot, Hessel Haagsma, Barbara Plank, and Martijn Wieling. 2017. Last words: Sharing is caring: The future of shared tasks. Computational Linguistics, 43(4):897?904. Timothy Niven and Hung-Yu Kao. 2019. Probing neural network comprehension of natural language arguments. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Karen Norrgard. 2008. Human testing, the eugenics movement, and irbs. Nature education, 1(1):170. Anne Oeldorf-Hirsch, Brent Hecht, Meredith Ringel Morris, Jaime Teevan, and Darren Gergle. 2014. To search or to ask: the routing of information needs be- tween traditional search engines and social networks. In Conference on Computer Supported Cooperative Work and Social Computing. Cathy O?Neil. 2016. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Crown. 195 Simon Ostermann, Michael Roth, Ashutosh Modi, Stefan Thater, and Manfred Pinkal. 2018. SemEval-2018 task 11: Machine comprehension using common- sense knowledge. In Proceedings of The 12th International Workshop on Semantic Evaluation. Association for Computational Linguistics. Naoki Otani, Toshiaki Nakazawa, Daisuke Kawahara, and Sadao Kurohashi. 2016. IRT-based aggregation model of crowdsourced pairwise comparison for evaluating machine translations. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D Sculley, Sebastian Nowozin, Joshua V Dillon, Balaji Lakshminarayanan, and Jasper Snoek. 2019. Can you trust your model?s uncertainty? evaluating predictive uncertainty under dataset shift. In Proceedings of Advances in Neural Information Processing Systems. Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen, Xinying Song, and Rabab Kreidieh Ward. 2016. Deep sentence embedding using long short-term memory networks: Analysis and application to information re- trieval. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24:694?707. D S Pallett. 2003. A look at NIST?S benchmark ASR tests: past, present, and future. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), pages 483?488. Denis Paperno, Germ?n Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raf- faella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fer- n?ndez. 2016. The LAMBADA dataset: Word prediction requiring a broad dis- course context. In Proceedings of the Association for Computational Linguistics. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics. Carla Parra Escart?n, Wessel Reijers, Teresa Lynn, Joss Moorkens, Andy Way, and Chao-Hong Liu. 2017. Ethical considerations in NLP shared tasks. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing. Prasanna Parthasarathi and Joelle Pineau. 2018. Extending neural generative con- versational model using external knowledge sources. In Proceedings of Empirical Methods in Natural Language Processing. Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In Proceedings of the 1st international conference on Scalable information systems. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre- gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Al- ban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, 196 Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of Advances in Neural Information Processing Systems. Kayur Patel, James Fogarty, James A Landay, and Beverly L Harrison. 2008a. Examining difficulties software developers encounter in the adoption of statistical machine learning. In Association for the Advancement of Artificial Intelligence. Kayur Patel, James Fogarty, James A. Landay, and Beverly L. Harrison. 2008b. Investigating statistical machine learning as a tool for software development. In International Conference on Human Factors in Computing Systems. Amandalynne Paullada, Inioluwa Deborah Raji, Emily M Bender, Emily Denton, and Alex Hanna. 2020. Data and its (dis)contents: A survey of dataset develop- ment and use in machine learning research. In Proceedings of the NeurIPS 2020 Workshop: ML Retrospectives, Surveys and Meta-analyses. Judea Pearl. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Ted Pedersen. 2019. Semeval discussions at NAACL 2019. Accessed: 2021-03-15. Fabian Pedregosa, Ga?l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vin- cent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and ?douard Duchesnay. 2011. Scikit-learn: Machine learning in python. Journal of Machine Learning Research, 12(85):2825?2830. Anselmo Pe?as, Eduard Hovy, Pamela Forner, ?lvaro Rodrigo, Richard Sutcliffe, and Roser Morante. 2013. QA4MRE 2011-2013: Overview of question answering for machine reading evaluation. In Information Access Evaluation. Multilinguality, Multimodality, and Visualization. Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of Empirical Methods in Natural Language Processing. Denis Peskov, Joe Barrow, Pedro Rodriguez, Graham Neubig, and Jordan Boyd- Graber. 2019. Mitigating noisy inputs for question answering. In Proceedings of the Annual Conference of the International Speech Communication Association. Matthew Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power. 2017. Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the Association for Computational Linguistics. 197 Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word represen- tations. In Conference of the North American Chapter of the Association for Computational Linguistics. Michael Petrochuk and Luke S. Zettlemoyer. 2018. Simplequestions nearly solved: A new upperbound and baseline approach. In Proceedings of Empirical Methods in Natural Language Processing. Geoff Pleiss, Tianyi Zhang, Ethan R Elenberg, and Kilian QWeinberger. 2020. Iden- tifying mislabeled data using the area under the margin ranking. In Proceedings of Advances in Neural Information Processing Systems. Pavlin G. Poli?ar, Martin Stra?ar, and Bla? Zupan. 2019. openTSNE: a modular python library for t-sne dimensionality reduction and embedding. Keith T Poole and Howard Rosenthal. 1985. A spatial model for legislative roll call analysis. American journal of political science, 29(2):357?384. Keith T Poole and Howard Rosenthal. 2017. Ideology & congress: A political economic history of roll call voting, 2 edition. Routledge, London, England. Lianhui Qin, Michel Galley, Chris Brockett, Xiaodong Liu, Xiang Gao, Bill Dolan, Yejin Choi, and Jianfeng Gao. 2019. Conversing by reading: Contentful neural conversation with on-demand machine reading. In Proceedings of the Association for Computational Linguistics. Joaquin Quionero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. 2009. Dataset shift in machine learning. The MIT Press. Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversational search. In Proceedings of the Conference on Human Information Interaction and Retrieval. Md Mustafizur Rahman, Mucahid Kutlu, Tamer Elsayed, and Matthew Lease. 2020. Efficient test collection construction via active learning. In Proceedings of the 2020 ACM SIGIR on International Conference on Theory of Information Retrieval. Association for Computing Machinery. Howard Raiffa. 1968. Decision analysis: introductory lectures on choices under uncertainty. Anand Rajaraman and Jeffrey David Ullman. 2011. Data Mining, page 1?17. Cam- bridge University Press. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don?t know: Unanswerable questions for squad. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. 198 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of Empirical Methods in Natural Language Processing. Association for Compu- tational Linguistics. Parikshit Ram and Alexander G Gray. 2012. Maximum inner-product search using cone trees. In Knowledge Discovery and Data Mining. Karthik Raman, Paul N. Bennett, and Kevyn Collins-Thompson. 2014. Understand- ing intrinsic diversity in web search: Improving whole-session relevance. ACM Transactions on Information Systems, 32. Georg Rasch. 1960. Studies in Mathematical Psychology: I. Probabilistic Models for Some Intelligence and Attainment Tests. Studies in Mathematical Psychology: I. Probabilistic Models for Some Intelligence and Attainment Tests. Nielsen & Lydiche, Oxford, England. Keith Rayner, Sarah J White, Rebecca L Johnson, and Simon P Liversedge. 2006. Raeding wrods with jubmled lettres: there is a cost. Psychological science, 17(3):192?193. Joseph Reagle and Lauren Rhue. 2011. Gender bias in Wikipedia and Britannica. International Journal of Communication, 5(0). Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. 2019. Do ImageNet classifiers generalize to ImageNet? In Proceedings of the International Conference of Machine Learning. PMLR. Mark D Reckase. 2009. Multidimensional item response theory models. In Reck- ase, editor, Multidimensional Item Response Theory, pages 79?112. Springer New York, New York, NY. Ajitha Reddy. 2007. The eugenic origins of iq testing: Implications for post-atkins litigation. DePaul Law Rev., 57:667. Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A con- versational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249?266. Pengjie Ren, Zhumin Chen, Zhaochun Ren, Evangelos Kanoulas, Christof Monz, and Maarten de Rijke. 2020. Conversations with search engines. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should I trust you?: Explaining the predictions of any classifier. In Knowledge Discovery and Data Mining. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equiva- lent adversarial rules for debugging nlp models. In Proceedings of the Association for Computational Linguistics. 199 Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020. Be- yond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Mark Richards and Eyal Amir. 2007. Opponent modeling in scrabble. In International Joint Conference on Artificial Intelligence. Matthew Richardson, Christopher J C Burges, and Erin Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Proceedings of Empirical Methods in Natural Language Processing. Frank Rijmen, Francis Tuerlinckx, Paul De Boeck, and Peter Kuppens. 2003. A non- linear mixed model framework for item response theory. Psychological methods, 8(2):185. Peter W van Rijn, Sandip Sinharay, Shelby J Haberman, and Matthew S John- son. 2016. Assessment of fit of item response theory models used in large-scale educational survey assessments. Large-scale Assessments in Education, 4(1):10. Stephen E. Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Pedro Rodriguez, Joe Barrow, Alexander Hoyle, John P. Lalor, Robin Jia, and Jordan Boyd-Graber. 2021. Evaluation examples are not equally informative: How should that change nlp leaderboards? In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Pedro Rodriguez, Paul Crook, Seungwhan Moon, and Zhiguang Wang. 2020. Infor- mation seeking in the spirit of learning: A dataset for conversational curiosity. In Proceedings of Empirical Methods in Natural Language Processing. Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber. 2019. Quizbowl: The case for incremental question answering. Anna Rogers. 2019. How the transformers broke nlp leaderboards. Anna Rogers. 2021. Changing the world by changing the data. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Anna Rogers, Matt Gardner, and Isabelle Augenstein. 2021. QA dataset explosion: A taxonomy of NLP resources for question answering and reading comprehension. arXiv preprint arXiv:2107.12708. Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. 2020a. Get- ting closer to AI complete question answering: A set of prerequisite real tasks. Association for the Advancement of Artificial Intelligence, 34(05):8722?8731. 200 Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020b. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842?866. Marc-Antoine Rondeau and T J Hazen. 2018. Systematic error analysis of the Stanford question answering dataset. In Proceedings of the Workshop on Machine Reading for Question Answering. Association for Computational Linguistics. Stephane Ross and Drew Bagnell. 2010. Efficient reductions for imitation learning. In Proceedings of Artificial Intelligence and Statistics. Stephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A reduction of imitation learning and structured prediction to No-Regret online learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. Dan Roth, Heng Ji, Ming-Wei Chang, and Taylor Cassidy. 2014. Wikification and beyond: The challenges of entity and concept grounding. In Proceedings of the Association for Computational Linguistics. Abhijit Roy. 2016. (In)visible Publics: Television and Participatory Culture in India, pages 201?221. Andrew Ruef, Michael Hicks, James Parker, Dave Levin, Michelle L. Mazurek, and Piotr Mardziel. 2016. Build it, break it, fix it: Contesting secure devel- opment. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. Nature, 323(6088):533?536. Amrita Saha, Rahul Aralikatte, Mitesh M Khapra, and Karthik Sankaranarayanan. 2018. DuoRC: Towards complex language understanding with paraphrased reading comprehension. In Proceedings of the Association for Computational Linguistics. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2020. WinoGrande: An adversarial winograd schema challenge at scale. In Association for the Advancement of Artificial Intelligence. G Salton, A Wong, and C S Yang. 1975. A vector space model for automatic indexing. Communications of the ACM. M Sanderson and W B Croft. 2012. The history of information retrieval research. Proceedings of the IEEE, 100(Special Centennial Issue):1444?1451. Chinnadhurai Sankar and Sujith Ravi. 2019. Deep reinforcement learning for model- ing chit-chat dialog with discrete attributes. Proceedings of the Annual SIGDIAL Meeting on Discourse and Dialogue. 201 Yuzuru Sato and Takashi Ikegami. 2004. Undecidability in the imitation game. Minds and Machines, 14(2):133?143. Frederik Schadd, Sander Bakkes, and Pieter Spronck. 2007. Opponent modeling in Real-Time strategy games. David Schlangen. 2020. Targeting the benchmark: On methodology in current natural language processing research. Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural net- works. IEEE transactions on signal processing: a publication of the IEEE Signal Processing Society, 45(11):2673?2681. Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. 2020. Green AI. Communications of the ACM, 63(12):54?63. Rolf Schwitter, Diego Molla, Rachel Fournier, and Michael Hess. 2000. Answer ex- traction towards better evaluations of NLP systems. In ANLP-NAACL 2000 Workshop: Reading Comprehension Tests as Evaluation for Computer-Based Language Understanding Systems. D Sculley, Jasper Snoek, Alexander B Wiltschko, and A Rahimi. 2018. Winner?s curse? on pace, progress, and empirical rigor. In Proceedings of the International Conference on Learning Representations. Joao Sedoc, Daphne Ippolito, Arun Kirubarajan, Jai Thirani, Lyle Ungar, and Chris Callison-Burch. 2019. Chateval: A tool for chatbot evaluation. In Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. Jo?o Sedoc and Lyle Ungar. 2020. Item response theory for efficient human eval- uation of chatbots. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems. Association for Computational Linguistics. Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? How controllable attributes affect human judgments. In Conference of the North American Chapter of the Association for Computational Linguistics. Priyanka Sen and Amir Saffari. 2020. What do models learn from question answering datasets? In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. Proceedings of the International Conference on Learning Representations. 202 Iulian Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and Joelle Pineau. 2015. Building end-to-end dialogue systems using generative hierarchi- cal neural network models. In Association for the Advancement of Artificial Intelligence. Burr Settles. 2009. Active Learning Literature Survey. Technical Report 1648, University of Wisconsin?Madison. Wei Shen, Jianyong Wang, and Jiawei Han. 2015. Entity linking with a knowledge base: Issues, techniques, and solutions. IEEE Transactions on Knowledge and Data Engineering, 27:443?460. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneer- shelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. 2016. Mastering the game of go with deep neural networks and tree search. Nature, 529:484?489. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep inside con- volutional networks: Visualising image classification models and saliency maps. In Proceedings of the International Conference on Learning Representations. Rion Snow, Brendan O?Connor, Daniel Jurafsky, and Andrew Ng. 2008. Cheap and fast ? but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of Empirical Methods in Natural Language Processing. Paul Solomon. 1997. Conversation in information-seeking contexts: A test of an analytical framework. Library & information science research, 19(3):217?248. Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder for generative context-aware query suggestion. Proceedings of the ACM International Conference on Information and Knowledge Management. Charles Spearman. 1904. ?general intelligence,? objectively determined and mea- sured. The American journal of psychology, 15(2):201?292. Joel H Spring. 1972. Psychologists and the war: The meaning of intelligence in the alpha and beta tests. History of education quarterly, 12(1):3?15. Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Rus- lan R. Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929?1958. Robert J Sternberg. 1985. Beyond IQ: A Triarchic Theory of Human Intelligence. Cambridge University Press. 203 Marilyn Strathern. 1997. ?improving ratings?: audit in the british university system. European review, 5(3):305?321. Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and policy considerations for deep learning in NLP. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Roland Stuckardt. 2003. Coreference-based summarization and question answering: a case for high precision anaphor resolution. In International Symposium on Reference Resolution. Saku Sugawara, Kentaro Inui, Satoshi Sekine, and Akiko Aizawa. 2018. What makes reading comprehension questions easier? In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Saku Sugawara, Yusuke Kido, Hikaru Yokono, and Akiko Aizawa. 2017. Evaluation metrics for machine reading comprehension: Prerequisite skills and readability. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End- to-end memory networks. In Proceedings of Advances in Neural Information Processing Systems. Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhinav Gupta. 2017. Re- visiting unreasonable effectiveness of data in deep learning era. In International Conference on Computer Vision. Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. 2020. MobileBERT: a compact Task-Agnostic BERT for Resource-Limited devices. In Proceedings of the Association for Computational Linguistics. S. Shyam Sundar. 2007. The MAIN model : A heuristic approach to understanding technology effects on credibility. In Miriam J Metzger and Andrew J Flanagin, editors, Digital media, youth, and credibility, pages 73?100. The MIT Press. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learn- ing with neural networks. In Proceedings of Advances in Neural Information Processing Systems. Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A Smith, and Yejin Choi. 2020. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of Empirical Methods in Natural Language Processing. Association for Computational Lin- guistics. Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. In Proceedings of the International Conference on Learning Representations. 204 Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-Tau Yih, and Ashish Sabharwal. 2019. QUAREL: A dataset and models for answering questions about qualitative relationships. Association for the Advancement of Artificial Intelligence, 33:7063? 7071. Jean Tague-Sutcliffe. 1992. The pragmatics of information retrieval experimentation, revisited. Information processing & management, 28(4):467?490. Alon Talmor and Jonathan Berant. 2018. The web as a Knowledge-Base for an- swering complex questions. In Proceedings of the Association for Computational Linguistics. Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Com- monsenseQA: A question answering challenge targeting commonsense knowl- edge. In Conference of the North American Chapter of the Association for Computational Linguistics. Koji Tanaka, Junya Takayama, and Yuki Arase. 2019. Dialogue-act prediction of future responses based on conversation history. In Proceedings of the Association for Computational Linguistics. Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Ur- tasun, and Sanja Fidler. 2016. Movieqa: Understanding stories in movies through question-answering. In Computer Vision and Pattern Recognition. Mortimer Taube. 1965. A note on the pseudo-mathematics of relevance. American documentation, 16(2):69?72. David Taylor, Colin McNulty, and Jo Meek. 2012. Your starter for ten: 50 years of University Challenge. Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. 2006. Hi- erarchical dirichlet processes. Journal of the American Statistical Association, 101(476):1566?1581. Gerald Tesauro, David Gondek, Jonathan Lenchner, James Fan, and John M. Prager. 2013. Analysis of watson?s strategies for playing jeopardy! Journal of Artificial Intelligence Research, 47:205?251. James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos, and Arpit Mittal. 2019. The FEVER2.0 shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER). Association for Computational Linguistics. L L Thurstone. 1973. Primary mental abilities. In H J Eysenck, editor, The Measurement of Intelligence, pages 131?136. Springer Netherlands, Dordrecht. 205 Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. 2015. Chainer: a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems. Kirill Trapeznikov and Venkatesh Saligrama. 2013. Supervised sequential classifica- tion under budget constraints. 31:581?589. Ross E Traub. 1997. Classical test theory in historical perspective. Educational Measurement, 16:8?13. Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2017. Newsqa: A machine comprehen- sion dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP. A M Turing. 1937. On computable numbers, with an application to the entschei- dungsproblem. Proceedings of the London Mathematical Society. Third Series, s2-42(1):230?265. Alan M. Turing. 1950. Computers & thought. chapter Computing Machinery and Intelligence, pages 11?35. MIT Press, Cambridge, MA, USA. Juli?n Urbano. 2016. Test collection reliability: a study of bias and robustness to statistical assumptions via stochastic simulation. Information Retrieval Journal, 19(3):313?350. U.S. Department of State Office of the Historian. 2021. The immigration act of 1924 (the johnson-reed act). Accessed: 2021-07-29. Clara Vania, Phu Mon Htut, William Huang, Dhara Mungra, Richard Yuanzhe Pang, Jason Phang, Haokun Liu, Kyunghyun Cho, and Samuel R. Bowman. 2021. Comparing test sets with item response theory. In Proceedings of the Association for Computational Linguistics. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of Advances in Neural Information Processing Systems. Jerry Vinokurov. How to write questions. http://hsquizbowl.org/forums/ viewtopic.php?f=30&t=3945. Accessed: 2018-12-04. Jerry Vinokurov, Gautam Kandlikar, Gaurav Kandlikar, Matthew Jack- son, Ryan Westbrook, and Rob Carson. 2014. 2014-15 packet submis- sion guidelines. https://acf-quizbowl.com/archives/archived-guidelines/ 2014-15-packet-submission-guidelines/. Accessed: 2019-06-23. 206 Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich K?ttler, John Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Si- monyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence, Anders Ekermo, Ja- cob Repp, and Rodney Tsing. 2017. Starcraft II: A new challenge for reinforcement learning. F B Von Der Osten, M Kirley, and T Miller. 2017. The minds of many: Opponent modeling in a stochastic game. In International Joint Conference on Artificial Intelligence. Ellen M Voorhees. 1998. Variations in relevance judgments and the measurement of retrieval effectiveness. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery. Ellen M. Voorhees. 2000a. Overview of the trec-9 question answering track. In Proceedings of the Text REtrieval Conference. Ellen M Voorhees. 2000b. The TREC-8 question answering track report. Ellen M. Voorhees. 2001. Overview of the trec 2001 question answering track. In Proceedings of the Text REtrieval Conference. Ellen M. Voorhees. 2002a. Overview of the trec 2002 question answering track. In Proceedings of the Text REtrieval Conference. Ellen M Voorhees. 2002b. The philosophy of information retrieval evaluation. In Evaluation of Cross-Language Information Retrieval Systems. Ellen M Voorhees. 2003a. Evaluating the evaluation: A case study using the TREC 2002 question answering track. In Conference of the North American Chapter of the Association for Computational Linguistics. Ellen M. Voorhees. 2003b. Overview of the trec 2003 question answering track. In Proceedings of the Text REtrieval Conference. Ellen M. Voorhees. 2004. Overview of the trec 2004 question answering track. In Proceedings of the Text REtrieval Conference. Ellen M. Voorhees. 2019. The Evolution of Cranfield, pages 45?69. Springer Inter- national Publishing, Cham. Ellen M. Voorhees and Hoa Trang Dang. 2005. Overview of the trec 2005 question answering track. In Proceedings of the Text REtrieval Conference. Ellen M Voorhees, Donna K Harman, et al. 2005. TREC: Experiment and evaluation in information retrieval, volume 63. MIT press Cambridge, MA. 207 Ellen M. Voorhees and Dawn M. Tice. 2000. Building a question answering test collection. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval. Harm de Vries, Dzmitry Bahdanau, and Christopher Manning. 2020. To- wards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435. Eric Wallace, Shi Feng, and Jordan Boyd-Graber. 2018. Interpreting neural networks with nearest neighbors. In EMNLP 2018 Workshop on Analyzing and Interpreting Neural Networks for NLP. Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019a. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of Empirical Methods in Natural Language Processing. Association for Compu- tational Linguistics. Eric Wallace, Pedro Rodriguez, Shi Feng, and Jordan Boyd-Graber. 2019b. Trick me if you can: Human-in-the-loop generation of adversarial question answering examples. In Transactions of the Association for Computational Linguistics, pages 387?401. Eric Wallace, Jens Tuyls, Junlin Wang, Sanjay Subramanian, Matt Gardner, and Sameer Singh. 2019c. AllenNLP interpret: A framework for explaining predic- tions of NLP models. In Proceedings of Empirical Methods in Natural Language Processing. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2019a. SuperGLUE: A stickier benchmark for General-Purpose language understanding systems. In Proceedings of Advances in Neural Information Processing Systems. Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019b. GLUE: A multi-task benchmark and analysis platform for nat- ural language understanding. In Proceedings of the International Conference on Learning Representations. Aobo Wang, Cong Duy Vu Hoang, and Min-Yen Kan. 2013. Perspectives on crowd- sourcing annotations for natural language processing. Language Resources and Evaluation, 47(1):9?31. Kuansan Wang and Jan Pedersen. 2011. Review of MSR-Bing web scale speller challenge. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ?11, pages 1339?1340, New York, NY, USA. Association for Computing Machinery. 208 Shuohang Wang, Mo Yu, Xiaoxiao Guo, Zhiguo Wang, Tim Klinger, Wei Zhang, Shiyu Chang, Gerry Tesauro, Bowen Zhou, and Jing Jiang. 2018. R 3: Rein- forced ranker-reader for open-domain question answering. In Association for the Advancement of Artificial Intelligence. Zijie J. Wang, Dongjin Choi, Shenyu Xu, and Diyi Yang. 2021. Putting Humans in the Natural Language Processing Loop: A Survey. In Proceedings of the First Workshop on Bridging Human?Computer Interaction and Natural Language Processing. Association for Computational Linguistics. Ritchie D. Watson. 1996. Lillian Hellman?s "The Little Foxes" and the new south creed: An ironic view of southern history. The Southern Literary Journal, 28(2):59?68. Stuart Watt. 1996. Naive psychology and the inverted turing test. Psycoloquy, 7(14):463?518. David J Weiss. 1982. Improving measurement quality and efficiency with adaptive testing. Applied psychological measurement, 6(4):473?492. David J Weiss and G Gage Kingsbury. 1984. Application of computerized adaptive testing to educational problems. Journal of educational measurement, 21(4):361? 375. Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural QA as simple as possible but not simpler. In Conference on Computational Natural Language Learning. Wayne Weiten. 2016. Psychology: Themes and Variations. Cengage Learning. Joseph Weizenbaum. 1966. ELIZA?a computer program for the study of natural language communication between man and machine. Johannes Welbl, Nelson F Liu, and Matt Gardner. 2017. Crowdsourcing mul- tiple choice science questions. In Proceedings of the 3rd Workshop on Noisy User-generated Text. Association for Computational Linguistics. Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association for Computational Linguistics, 6:287?302. Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina Maria Rojas-Barahona, Pei hao Su, Stefan Ultes, David Vandyke, and Steve J. Young. 2017. A network- based end-to-end trainable task-oriented dialogue system. In Proceedings of the European Chapter of the Association for Computational Linguistics. Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2016. Towards ai-complete question answering: A set of prerequisite toy tasks. In Proceedings of the International Conference on Learning Representations. 209 Jason Weston, Sumit Chopra, and Antoine Bordes. 2015. Memory networks. Proceedings of the International Conference on Learning Representations. Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80?83. Terry Winograd. 1971. Procedures as a Representation for Data in a Computer Program for Understanding Natural Language. Ph.D. thesis, Massachusetts In- stitute of Technology. Terry Winograd. 1972. Understanding natural language. Cognitive psychology, 3(1):1?191. Terry Winograd. 1977. Five lectures on artificial intelligence. In A. Zampolli, editor, Linguistic structures processing, volume 5 of Fundamental studies in computer science, pages 399?520. North Holland. Tomer Wolfson, Mor Geva, Ankit Gupta, Matt Gardner, Yoav Goldberg, Daniel Deutch, and Jonathan Berant. 2020. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics, 8:183? 198. W A Woods. 1972. The Lunar Sciences Natural Language Information System: Final Report. Bolt, Beranek and Newman. M Wu, R Davis, B Domingue, C Piech, and Noah D Goodman. 2020. Varia- tional item response theory: Fast, accurate, and expressive. In 13th International Conference on Educational Data Mining. Tongshuang Wu, Marco Tulio Ribeiro, Jeffrey Heer, and Daniel Weld. 2019. Erru- dite: Scalable, reproducible, and testable error analysis. In Proceedings of the Association for Computational Linguistics. Association for Computational Lin- guistics. Yongqin Xian, Christoph H. Lampert, Bernt Schiele, and Zeynep Akata. 2018. Zero- shot learning - a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence. Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Take- fuji. 2018a. Wikipedia2vec: An optimized tool for learning embeddings of words and entities from wikipedia. In Proceedings of Empirical Methods in Natural Language Processing. Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, and Yoshiyasu Takefuji. 2017. Learning distributed representations of texts and entities from knowledge base. Transactions of the Association for Computational Linguistics, 5:397?411. Ikuya Yamada, Ryuji Tamaki, Hiroyuki Shindo, and Yoshiyasu Takefuji. 2018b. Studio ousia?s quiz bowl question answering system. 210 Roman V. Yampolskiy. 2013. Turing Test as a Defining Feature of AI-Completeness, pages 3?17. Springer Berlin Heidelberg, Berlin, Heidelberg. An Yang, Kai Liu, Jing Liu, Yajuan Lyu, and Sujian Li. 2018a. Adaptations of ROUGE and BLEU to better evaluate machine reading comprehension task. In Proceedings of the Workshop on Machine Reading for Question Answering, Mel- bourne, Australia. Association for Computational Linguistics. Yi Yang, Wen tau Yih, and Christopher Meek. 2015. Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of Empirical Methods in Natural Language Processing. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018b. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of Empirical Methods in Natural Language Processing. Tom Young, Erik Cambria, Iti Chaturvedi, Hao Zhou, Subham Biswas, and Min- lie Huang. 2017. Augmenting end-to-end dialogue systems with commonsense knowledge. In Association for the Advancement of Artificial Intelligence. AdamsWei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. QANet: Combining local convolution with global self-attention for reading comprehension. In Proceedings of the International Conference on Learning Representations. D Yu, J Li, and L Deng. 2011. Calibration of confidence measures in speech recog- nition. IEEE transactions on audio, speech, and language processing, 19(8):2461? 2473. Bianca Zadrozny and Charles Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the International Conference of Machine Learning. Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. SWAG: A large- scale adversarial dataset for grounded commonsense inference. In Proceedings of Empirical Methods in Natural Language Processing. Jiawei Zhang, Yang Wang, Piero Molino, Lezhi Li, and David S Ebert. 2019. Mani- fold: A Model-Agnostic framework for interpretation and diagnosis of machine learning models. IEEE transactions on visualization and computer graphics, 25(1):364?373. Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018a. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the Association for Computational Linguistics. 211 Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018b. ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. Chen Zhao, Chenyan Xiong, Xin Qian, and Jordan Boyd-Graber. 2020a. Complex factoid question answering with a free-text knowledge graph. In Proceedings of the World Wide Web Conference. Chen Zhao, Chenyan Xiong, Corby Rosset, Xia Song, Paul Bennett, and Saurabh Tiwary. 2020b. Transformer-xh: Multi-evidence reasoning with extra hop atten- tion. In Proceedings of the International Conference on Learning Representations. Zhengli Zhao, Dheeru Dua, and Sameer Singh. 2018. Generating natural adver- sarial examples. In Proceedings of the International Conference on Learning Representations. Hao Zhou, Tom Young, Minlie Huang, Haizhou Zhao, Jingfang Xu, and Xiaoyan Zhu. 2018a. Commonsense knowledge aware conversation generation with graph attention. In International Joint Conference on Artificial Intelligence. Kangyan Zhou, Shrimai Prabhumoye, and Alan W. Black. 2018b. A dataset for doc- ument grounded conversations. In Proceedings of Empirical Methods in Natural Language Processing. Justin Zobel and Alistair Moffat. 2006. Inverted files for text search engines. ACM Comput. Surv., 38:6. Valentina Bayer Zubek and Thomas G. Dietterich. 2002. Pruning improves heuristic search for cost-sensitive learning. In Proceedings of the International Conference of Machine Learning. 212