ABSTRACT Title of dissertation: PREDICTIVE CODING TECHNIQUES WITH MANUAL REVIEW TO IDENTIFY PRIVILEGED DOCUMENTS IN E-DISCOVERY Jyothi K Vinjumur, Doctor of Philosophy 2018 Dissertation directed by: Professor Douglas W. Oard College of Information Studies and UMIACS In twenty-first century civil litigation, discovery focuses on the retrieval of electron- ically stored information. Lawsuits may be won or lost because of incorrect production of electronic evidence. Organizations may generate fewer paper documents, leading to an increase in the amount of electronic documents by many fold. Litigants face the task of searching millions of electronic records for the presence of responsive and not-privileged documents, making the e-discovery process burdensome and expensive. In order to ensure that the material that has to be withheld is not inadvertently revealed, the electronic evidence that is found to be responsive to a production request is typically subjected to an exhaustive manual review for privilege. Although the budgetary constraints on review for responsiveness can be met using automation to some degree, attorneys have been hesi- tant to adopt similar technology to support the privilege review process. This dissertation draws attention to the potential for adopting predictive coding technology for the privilege review phase during the discovery process. Two main questions that are central to building a privilege classifier are addressed. The first question seeks to determine which set of annotations can serve as a reliable basis for evaluation. The second question seeks to determine which of the remaining annotations, when used for training classifiers, produce the best results. As an answer, binary classifiers are trained on labeled annotations from both junior and senior reviewers. Issues related to training bias and sample variance due to the reviewer’s expertise are thoroughly discussed. Results show that the annotations that were randomly drawn and annotated by senior reviewers are useful for evaluation. The remaining annotations can be used for classifier training. A research prototype is built to perform a user study. Privilege judgments are gath- ered from multiple lawyers using two user interfaces. One of the two interfaces includes automatically generated features to aid the review process. The goal is to help lawyers make faster and more accurate privilege judgments. A significant improvement in recall was noted when comparing the users’ review performance when using the automated an- notations. Classifier features related to the people involved in privileged communications were found to be particularly important for the privilege review task. Results show that there was no measurable change in review time. As cost is proportional to time during review, as the final step, this work introduces a semi-automated framework that aims to optimize the cost of the manual review process. The framework calls for litigants to make some rational choices about what to manually review. The documents are first automatically classified for responsiveness and privilege, and then some of the automatically classified documents are reviewed by human reviewers for responsiveness and for privilege with the overall goal of minimizing the expected cost of the entire process, including costs that arise from incorrect decisions. A risk-based ranking algorithm is used to determine which documents need to be manually reviewed. Multiple baselines are used to characterize the cost savings achieved by this approach. Although the work in this dissertation is applied to e-discovery, similar approaches could be applied to any case in which retrieval systems have to withhold a set of confidential documents despite their relevance to the request. PREDICTIVE CODING TECHNIQUES WITH MANUAL REVIEW TO IDENTIFY PRIVILEGED DOCUMENTS IN E-DISCOVERY by Jyothi K Vinjumur Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy Spring, 2018 Advisory Committee: Professor Douglas W Oard (Chair) Associate Professor Hal Daumé III (Dean’s Representative) Assistant Professor Vanessa Frias-Martinez Assistant Professor Beth St. Jean Dr. Fabrizio Sebastiani, National Research Council of Italy © Copyright by Jyothi Keshavan Vinjumur 2018 Dedication To mom and dad, Arjun and Kiran. ii Acknowledgments I owe my gratitude to all the people who have made this thesis possible and because of whom my graduate experience has been one that I will cherish forever. First and foremost I want to thank my advisor Douglas W Oard. It has been a great journey working with him for the last few years. Among many lessons, he has taught me how good research is done. I would like to express my special appreciation to him for letting me make mistakes and grow as a researcher. His timely advice on research, teaching and career tips have been invaluable. I am grateful to have had a few fun filled opportunities to know him personally. Above all, I owe him my deepest gratitude for boosting my confidence by helping me realize my strengths and improve on my weakness during my PhD journey. I would like to thank Beth St Jean, Fabrizio Sebastiani, Hal Daumè III and Vanessa Friaz-Martinaz for accepting the invitation to serve on my dissertation committee, and for the suggestions they provided during my thesis proposal and my dissertation work. I would like to extend my gratitude to Fabrizio Sebastiani for his collaboration, co-authorship and research guidance on a few of the major contributions in my disserta- tion. I thank Jiaul Paik and Amittai Axelrod, for giving me an opportunity to learn and collaborate with them. I am honored to have had the opportunity to meet multiple e-discovery experts along the way. My first thanks goes to Jason Baron whom I look up too with great admiration. He has always made me feel that both me and my work has great potential and has never missed an opportunity to introduce me to many of his colleges and acquaintances with great pride. I would like to thank David Lewis for letting me sit-in in one of his courses he offered in Georgetown University and providing me valuable inputs from time iii to time. I would thank Maura Grossman for taking time to meet me at her office in NYC to discuss the privilege review interface design. I extend my gratitude to all the lawyers who participated in my study. My work during my PhD journey was supported in part by NSF awards 1065250 and 1618695. I would like to thank NSF for supporting me and SIGIR travel grant for providing the financial assistance for conference travel. I thank Maarten de Rijke for giving me the opportunity to work with him and his wonderful team of researchers at UvA. My stay at Amsterdam was an absolute delight because of the awesome people I met there. I owe my sincere thanks to David Graus, Zhaochun Ren, Marlies van der Wees, Manos Tsagkias, Tom Kenter, Fei Cai, Anne Schuth and Richard Berendsen. I would like to thank Hans Henseler for giving me an invitation to attend an e-discovery symposium in Amsterdam and learn about the difference of this domain in a different continent. I am grateful to Amanda Jones to have given me an opportunity to work with her and her awesome team at H5 during the summer of 2016. The members of the Information Retrieval Research group in CLIP lab have con- tributed immensely to my personal and professional time at UMD. The group has been a source of friendships as well as good advice and collaboration. I am especially grateful for William Webber, Ning Gao, Mossaab Bagdouri, Rashmi Sankepally, Jiaul Paik from CLIP Lab and Camli Badrya a graduate student in Aerospace Engineering Department. From William, I learned the importance of maintaining a detailed record of my experiments as a script that I can re-run in the future. I am grateful to Ning Gao, Mossaab Bagdouri, Rashmi Sankepally, Jiaul Paik and Petra Galuscakova for suggestions they provided on various occasions especially during my practice talks for conferences, dissertation proposal and dissertation defense. iv I would like to thank June Ahn for encouraging me to talk to multiple professors in the iSchool and explore the area of research that interests me the most. Without his encouragement I would not have met my advisor Doug Oard. I would like to thank Ben Shneiderman and Catherine Plaisant for giving me an opportunity to work with them during summer 2012 and letting me know everything about the iSchool and about the PhD program offered in the iSchool. I extend my gratitude to Reinhard Radermacher and Vikrant Aute who supported and encouraged me to start my PhD program in part-time while I work full-time as a Faculty Research Assistant in the Mechanical Engineering Department. I would like to thank all the friends I made and their families for making the duration of my life in Maryland to be one of the best in this country so far. A special thanks to my husband, Arjun for supporting me throughout this experi- ence. To my little darling daughter Kiran, I would like to express my thanks for keeping me company during the final year of my PhD. v Table of Contents List of Tables ix List of Figures x List of Abbreviations xi 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.1 Predictive Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2.2 Manual Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.3 Predictive Coding with Manual Review . . . . . . . . . . . . . . . . 11 1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Background 15 2.1 E-Discovery and Privilege Review . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Test Collection Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2.1 TREC Legal Track Collection . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Topics & Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3 Manual Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.4 Interactive Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Predictive coding with cost-sensitive learning . . . . . . . . . . . . . . . . . 25 3 Predictive Coding 28 3.1 Test Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.1 Stratified Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.1.2 Privilege Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Evaluation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Classifier Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1.1 Graph Model . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1.2 Content Model . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2.1 Point Estimate . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3.2.2 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . 40 vi 3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.1 Test Collection Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Expertise and Sample Bias in Classifier Results . . . . . . . . . . . . 45 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4 Manual Review 49 4.1 Problem Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1.1 Privilege Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.1.2 Document Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2 The AID System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.1 Propensity Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.2 Person Role Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.3 Organization Type Annotation . . . . . . . . . . . . . . . . . . . . . 58 4.2.4 Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.5 Temporal Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.2.6 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2.7 Study Participants and Procedure . . . . . . . . . . . . . . . . . . . 63 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.1 Selecting a Benchmark for Evaluation . . . . . . . . . . . . . . . . . 65 4.3.2 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.3.3 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.4 Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3.5 Usefulness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5 Predictive Coding With Manual Review 71 5.1 Problem Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.2 Fully Automated baseline model . . . . . . . . . . . . . . . . . . . . . . . . 73 5.3 Fully Manual baseline model . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.4 Our MINECORE model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.4.1 Document Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.4.2 Algorithm & Evaluation Plan Overview . . . . . . . . . . . . . . . . 86 5.5 Other baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.5.1 Uncertainty Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.2 Relevance Ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.5.3 Active Learning via Uncertainty Sampling . . . . . . . . . . . . . . . 90 5.5.4 Active Learning via Relevance Sampling . . . . . . . . . . . . . . . . 90 5.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.6.1 Test Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6.2 The learning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.6.3 Cost structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.6.4 Experimental protocol . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 vii 6 Conclusions 105 6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.1.1 System Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.1.2 Practical Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.1.3 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.4 Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Appendices 117 .1 Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 .2 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 Bibliography 127 viii List of Tables 3.1 TA adjudication rates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Training Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Separation of email data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.4 Contingency Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.5 Overturn rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.1 TREC 2010 privilege judgments (For training and review) . . . . . . . . . . 54 4.2 Contingency table; for review of same families by S1 & S2) . . . . . . . . . 64 4.3 QUIS Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.1 Contingency table D (a) and cost matrix Λm (b) for our problem. . . . . . 74 5.2 Cost structure values in US$. . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.3 Results obtained from CostStructure1 . . . . . . . . . . . . . . . . . . . . . 96 5.4 Results obtained GPOL(as R)-CCAT(as P) class pair . . . . . . . . . . . . 100 5.5 Results from all cost structures . . . . . . . . . . . . . . . . . . . . . . . . . 100 ix List of Figures 1.1 E-discovery process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 3.1 Re-sampling Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Train-Set and Test-Set Split Procedure . . . . . . . . . . . . . . . . . . . . . 34 3.3 Sample Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Actor variants in emails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.5 Content-centric information in emails . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Recall, a4 ablated, random adjudication . . . . . . . . . . . . . . . . . . . . 42 3.7 Recall, a4 ablated, all adjudication . . . . . . . . . . . . . . . . . . . . . . . 42 3.8 Precision, a4 ablated, all adjudication . . . . . . . . . . . . . . . . . . . . . 44 3.9 Effect of Annotator Expertise on Training . . . . . . . . . . . . . . . . . . . 46 3.10 Analysis of Classifier Privilege Predictions . . . . . . . . . . . . . . . . . . . 47 4.1 Our depiction of Privileged Communication Network . . . . . . . . . . . . . 52 4.2 Missing Person Score Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 Privileged Email . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Indicative terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.5 The AID system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.6 The Baseline system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 User study procedure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.8 S1 and S2 Judgments by type . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.9 Evaluation - S1 judgments as Benchmark. . . . . . . . . . . . . . . . . . . . 67 5.1 MINECORE Framework Overview . . . . . . . . . . . . . . . . . . . . . . . 77 5.2 Phase 1 of the MINECORE Framework . . . . . . . . . . . . . . . . . . . . 79 5.3 Phase 2 of the MINECORE Framework . . . . . . . . . . . . . . . . . . . . 79 5.4 Phase 3 of the MINECORE Framework . . . . . . . . . . . . . . . . . . . . 83 5.5 Model Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.6 Overall costs with CostStructure1 as input . . . . . . . . . . . . . . . . . . . 98 5.7 Percentage increase in the overall cost . . . . . . . . . . . . . . . . . . . . . 104 x List of Abbreviations AID Avoiding Inadvertent Disclosure ALvRS Active Learning via Relevance Sampling ALvUS Active Learning via Uncertainty Sampling AS Adjudicated Set EDRM Electronic Discovery Reference Model ESI Electronically Stored Information FA Fully Automatic FCRP Federal Rules of Civil Procedure FM Fully Manual MINECORE MINimizing the Expected COsts of REview NAS Non-Adjudicated Set NIST National Institute of Standards and Technology QUIS Questionnaire for User Interaction Satisfaction RM Risk Minimization RR Relevance Ranking TA Topic Authority TAR Technology Assisted Review TREC Text Retrieval Conference UR Uncertainty Ranking xi Chapter 1: Introduction Civil litigation in United States jurisdiction is a legal dispute between two or more parties where either hold the right to request relevant evidence from each other. The term discovery refers to a process in the dispute where one party can request to obtain evidence (documents, tapes,etc,.) from the other party or parties. The party requesting documents during the discovery process is called the requesting party and the party who is responsible for producing the documents as per the request is called as the producing party. Although the question of what qualifies to be relevant is up for debate in the court of law, it is required by the producing party to perform some kind of review to provide documents that are relevant or responsive (in legalese) to the litigation requests and that are not subject to a claim of privilege (e.g., attorney-client privilege). During the discovery phase, the resulting transfer of documents from the producing party to the requesting party is referred to as production. In the year 1989, a temporary restraining order to preserve a collection of Electron- ically Stored Information (ESI) was granted in a court in Washington, DC. The ESI had been shared between members of the National Security Council in the Executive Office of the President of the United States [3]. The basis for this order, was a claim that electronic messages could constitute as evidence of activity in an organization. As a consequence of this event, on December 6th, 2006, the Federal Rules of Civil Procedure (FRCP) amended the traditional discovery process to address the discovery of all ESI. This amendment to the FRCP resulted in the term “electronic” to precede the word “discovery” giving birth to a legal process called “Electronic Discovery or E-Discovery”. Since then, identifying and retrieving relevant documents from large collections of electronic records and yet with- holding privileged documents during production is a common practical process during civil 1 litigation. The different stages of e-discovery process are illustrated in figure 1.1. E-discovery begins when a producing party is required to produce ESI for the requesting party from sources that it identifies as reasonably accessible. The production request is followed by the collection identification process, pre-processing and filtering before the manual review and analysis phase. In practice, producing parties conduct manual review linearly on all responsive document to assert privilege on some set of responsive documents to withhold confidential content before the final phase of document production.1 The process concludes when the producing party produces all the responsive and not-privileged documents to the requesting party. During this process, the cost of e-discovery is incurred at every stage: (1) Cost is incurred while locating the potential sources of ESI that collectively make up a searchable collection of electronic documents, (2) Pre-processing the collection of ESI and classifying the documents that are potentially responsive during the filtering stage and (3) Costs due to the manual review process of identifying responsive documents to be produced and privileged or confidential information to be withheld. Prior studies have shown that the majority of the cost in e-discovery is due to the manual review of documents for responsiveness and privilege (typically about 73 percent) [59]. Thus, it is the manual review phase of e-discovery process that this dissertation dis- cusses. We aim to introduce predictive coding techniques to identify privileged documents. We develop algorithms, evaluation measure and conduct a user study. We conclude this dissertation by focusing on techniques to reduce review cost. The upcoming section details the dissertation design with research questions to explain how we think of handling the problems of privilege review in e-discovery. 1There are several grounds on which documents might properly be withheld from production, some of which are referred to as privilege and the others go by other names (e.g. Attorney Work-Product Doctrine, etc,.). For convenience, we group them together and refer to them collectively as privilege. 2 Review  phase   Produc'on   Processing  and  Collec'on   Request   Filtering   Responsiveness   Review   Produc'on   Privilege  Review   Figure 1.1: E-discovery process 1.1 Motivation The document processing and filtering stage in e-discovery (refer figure 1.1) concen- trates on balancing the document count which affects the manual review cost and review time. Due to the exponential growth in digital content, exhaustive manual review can almost become impossible. This has led to the introduction of a number of techniques for Technology-Assisted Review (TAR), which can be defined as a set of automated tech- niques that support legal professionals who need to perform an e-discovery review. These automated techniques are also called as predictive coding techniques. One of the earliest articles to describe anything akin to predictive coding techniques was by Anne Kershaw [42]. She described a study that compared a human review team against a document assessment system. While the humans identified only 51% of relevant documents, the system identified more than 95%. The technology her article explained was not what we think of today as predictive coding because it lacks the sophistication of the statistical techniques to determine which documents were relevant. However, her analysis was an initial effort of TAR’s eventual refinement. The next turning point came in the year 2006. That year, the Text Retrieval Con- ference, an organization started in 1992 by the National Institute of Standards and Tech- nology (NIST) to study information retrieval techniques, launched something called the TREC Legal Track devoted to the use of search and information retrieval in e-discovery. 3 Its annual research projects provided critical evidence of the efficacy of these techniques in e-discovery. Two e-discovery researchers, Maura R. Grossman and Gordon V. Cormack analyzed data from the 2009 TREC Legal Track involving the use of predictive coding processes. They concluded that predictive coding was not only more effective than human review at finding relevant documents, but also much cheaper [36]. This study found that the use of predictive coding techniques produced almost a 50-fold savings in cost over manual review. Thus, it is becoming increasingly common to perform predictive coding during the e-discovery process. The use of predictive coding techniques have thus rev- olutionized the filtering process to identify responsive documents. However, the use of predictive coding techniques for the privilege review stage, is still less common. There are at least two factors causing this difference between the review for re- sponsiveness and the review for privilege. The first one is due to an observed practice of relevance review being performed before privilege review. This is done because it reduces the number of documents that must be reviewed for privilege, thus rendering a linear manual review for privilege more affordable. Second, the failure to detect and properly withhold a privileged document might incur more serious consequences for the party per- forming the review than would the failure to detect a relevant document determined not to be privileged. Hence legal professionals are less inclined to adopt any fully automated techniques for conducting privilege review. In this work we try to take an initial step to assure them that adopting predictive coding techniques to perform privilege review is a rational choice. Although we agree that there are cases in which fully manual review is the best choice, we argue that there exist cases where reliance on some degree of automation is a good choice. This work researches multiple ways to assist e-discovery practitioners to make these choices and to explain those choices once they have been made. Thus the main question here is not just about what the technology is able to do or how legal professionals use what the technologists build, but also about how the legal professionals could use the technology to ensure ESI production at a proportionate cost. 4 ANNOTATIONS FROM DOMAIN EXPERTS RQ4a RQ4b REUTERS RQ5a RQ4c User COLLECTION Interface Posterior Probabilities for Document Ranker Manual + Classifier Decisions Relevance Category Classifier Posterior Probabilities for Document Ranker Manual + Classifier Privilege Category Decisions RQ3a Training-Set Test-Set RQ1 RQ3b Final Decisions COST MATRIX TREC 2010 LEGAL TRACK TEST COLLECTION Overall Cost Measurement Errors RQ2 RQ5b Sampling Errors Figure 1.2: Dissertation Overview 1.2 Research Questions The overall design of this dissertation can be divided into three main components: (1) Predictive Coding component (Chapter 3) (2) Manual Review component (Chapter 4) and (3) Predictive Coding with Manual Review component (Chapter 5). Figure 1.2 graphically illustrates the overall design of this dissertation with clear pointers showing where our research questions fit in. The Predictive Coding component concentrates on building a probabilistic classifier to identify privileged documents. The Manual Review component aims to aid the manual privilege review process by utilizing the features from the probabilistic classifier. The Predictive Coding with Manual Review component empirically demonstrates the efficacy that can be achieved by our semi-automated system. We next discuss all the the research questions answered in each of the three main components listed above. 5 1.2.1 Predictive Coding Most e-discovery vendors today are adopting automation to classify documents dur- ing the responsiveness review phase based on input from reviewers. This automation is employed as an effort to expedite the process of filtering the documents in the collection. However, the filtering process during privilege review phase is mostly done manually by domain experts. Our main goal in this component of our dissertation is to build a binary classifier to identify privileged documents. To build a binary classifier, we need a test collection with privilege judgments that can be used both for training and for evaluation. Thus, before building a binary classifier, we start to think about the evaluation plan. In the year 2010, the initiative taken by the TREC Legal Track, released a Test Collection with privilege judgments for email communications. The document collection was derived from the Enron Email collection. Since that collection is the only public test collection available for conducting e-discovery research, the first research question we ask is: RQ1: Is it possible to create a labeled test set to enable unbiased classifier evaluation? Evaluation of predictive coding systems depends on test collections and the docu- ment judgments. During the TREC 2010 Legal Track, multiple teams submitted a total of five system. The results from those five systems were grouped to create a total of 32 categories. There are multiple ways to choose the documents from those categories that need to be manually judged. In TREC 2010, results from submitting teams were pooled to gather samples. Samples were drawn from those categories using a procedure called stratified sampling to obtain manual judgments. A procedure called adjudication was used to expedite the judgments on the sample to a senior assessor (who is an expert) for arbitration. Hence the judgments from this resulting collection were of two types; (1) A small number of documents had judgments from senior and junior (non-expert) assessors and (2) Most of the documents had judgments only from the junior assessor. For evaluating our privilege classifiers fairly, we need to build an unbiased test 6 set utilizing the judgments from the TREC 2010 Legal Track collection. To create this unbiased set with senior assessors’ judgments2 for evaluation purpose, we need to eliminate the selection bias (introduced by the appeal process during TREC 2010). We eliminate the bias by re-sampling from the stratified document categories. We maintained the sampling probabilities for each stratum. The procedure resulted in creating a total of 252 document families3 as gold standard for evaluation. Although we were able to create a held-out test set for evaluation purposes, the issues of using the stratified approach for document selection during the TREC Legal Track raised two more concerns. The next research question we ask is: RQ2: Are the privilege judgments obtained from the TREC 2010 Legal Track collection reliable and reusable? The first issue, which we refer as reliability, is that different manual assessors may reach different judgments for the same document. A second concern about reusability, is that new systems could find some documents that did not contribute to the selection process. Assuming these new documents not to be relevant might adversely affect system comparisons. In TREC 2010 Legal Track collection, multiple manual assessors with different levels of expertise were involved. Hence for reliability, the key question is to determine the extent to which privilege judgments correctly reflect the opinion of the senior assessor whose judgment is authoritative. For reusability, the key question is to determine the degree to which systems whose results contributed to the creation of the test collection can be usefully compared with other systems that use those privilege judgments in the future. These correspond to measurement error and sampling error, respectively. We performed set-based evaluation using a held-out set of families as test set for privilege classification using stratified sampling, with each strata defined by the overlapping classification results from different participating systems. We examine the impact of unmodeled assessor errors on evaluation results and show recall-precision graph with confidence intervals on the the held-out test set. Our results indicate that measurement errors by junior assessors are 2Senior assessor’s judgments were always considered as Gold Standard. 3In this context, a “document family” (a legal term) refers to an email messages plus all its attachments. 7 sufficiently large to require their exclusion from the test set if reliable system comparisons are to be made. Findings from RQ2 revealed inconsistencies in estimating absolute measures partic- ularly for recall while using junior assessor judgments. This means that, if uncorrected junior assessor judgments were a small fraction of the total judgments, this would be a smaller problem. But in TREC Legal Track 2010 collection, the judgments from the un- corrected junior assessors are being used for about 92% of the sampled documents. Hence the next questions we ask are: RQ3a: Are the judgments from junior assessors useful for classifier training? RQ3b: How does the process of selecting training documents (judged by senior assessors or junior assessors or both) affect the classifier performance? During the creation of TREC Legal Track 2010 collection, relevance judgments gathering process followed a two-stage assessment procedure, whereby an initial relevance assessment was made by junior reviewers for each document in the evaluation sample. A portion of the initially assessed documents were escalated (based on some criteria) to the senior assessor to obtain final judgment. This two-stage assessment procedure created a selection bias both during training and testing our classifiers. Traditionally, the data used to build a classifier usually comes from multiple datasets. The classifier is first trained on a set of labeled documents called the training set, validation sets can be used for parameter tuning and finally the test set is a set used to provide an unbiased evaluation of a final classifier fit on the training set. To answer the two research questions stated above, we build and evaluate our binary classifiers using multiple training sets and a single held-out test set. In RQ3a we study the effect of utilizing the large amount of junior assessor judgments for training our classifier. Although some documents in this training set may also have the senior assessor’s judgment, we consider only the junior assessor’s judgment to answer our question. 8 To answer RQ3b we build multiple binary classifiers. We build the classifier using both the content and metadata features. We study the effect of classifier training on (1) multiple annotator types (expert annotators and non-expert annotators) for the same sample and (2) multiple training sets (with and without selection bias). We evaluate our binary classifiers using a held-out test set with senior assessor judgments. The findings show that, larger unbiased training set labeled by a number of junior annotators is about as useful as a smaller biased training set created by a senior annotator. We thus conclude that the use of labeled set from both junior and senior annotators together can be justified for training (although not for testing) the classifier. By building predictive coding models to identify privileged documents, we evaluate the efficacy of adopting predictive coding techniques. Our classifier has better recall measure than precision. Since recall is the more important measure during privilege review in e-discovery, this is a promising result. We next concentrate on building an interface to determine whether automation can aid privilege review process, especially to avoid inadvertent disclosure of a privileged document. 1.2.2 Manual Review As manual annotation during privilege review in e-discovery is inevitable, we now seek to build a positive synergy between automation and manual review. It is important to know who uses our system and more so to understand what is it that they are looking for. The motivating factor for designing a user study by building a research prototype was to determine whether the use of automation can aid lawyers perform the privilege review task. We study what type of visible clues via predictive coding assistance could help manual reviewers during the review phase. The objective here is to investigate the extent to which the use of automation (in the form of highlighting potentially useful features and patterns utilizing the metadata and content information) can benefit the manual privilege review process. To this end, we build a research prototype by designing an interactive system to support privilege review in which the objective is to improve the speed and accuracy of the manual privilege review process. At the end of the user study we conducted a semi-structured interview to understand which specific feature was more 9 beneficial to perform the task. This idea led to the following three sub-questions: RQ4a: Do the accuracy of the manual reviewer’s privilege review judgments improve when system-generated features are presented during privilege review? Attorney-client privilege exist when the attorney and the client communicate in confidence about an active litigation. The first task we consider, is to identify the actors; who the client is and who the attorney is. We study the relationship of the actors in the email communication. We identify the organization information when available. We use email content to understand the context of the communication. And finally we utilize the time of the communication. Our system generates useful metrics (Discussed in detail in Chapter 4) using the information provided in the email family to provide visual cues to the manual reviewer during review. We then ask lawyers to label the email family as Privileged or Not-privileged or Unsure. We perform a hypothesis test by providing the lawyers two interfaces; one without any automation as a baseline condition and the other with automation as a treatment condition. We evaluate the accuracy of each of the reviewer’s judgment by considering one of the senior attorney’s judgments as gold standard. RQ4b: Does the manual reviewer’s review speed improve when system- generated features are presented during privilege review? A substantial amount of cost in e-discovery results from the process of manual review for privilege. Due to the high-stakes in the privilege review process, review for privilege is usually performed by senior attorneys. As a result privilege review costs more money when compared to the review for responsiveness (which is usually performed by junior lawyers). Our motivation to measure the review speed was designed to indirectly measure the review cost as attorneys are usually billed by the hour. In e-discovery, manual review time is proportional to manual review cost. To answer RQ4b, we study if the users perform the review faster in the presence of our system-generated features. We do this by recording the time-stamp of each event the 10 user performs. We record the duration spent by each user to review each email message. We run statistical tests to determine the difference in the average review speed. We carried out a paired t-test across the baseline interface and our treatment interface to compare the average speed of the privilege review task over the two sets of observations. Our findings reveal no significant difference in review speed. RQ4c: Which system-generated features do the manual reviewers believe are most helpful? To answer the question RQ4c, we performed a subjective evaluation using the Ques- tionnaire for User Interaction Satisfaction (QUIS). Our questionnaire aim to measure the system satisfaction along multiple interface factors (screen factors, learning factors and system capabilities) on a 9-point scale. During a semi-structured interview, participants reported that the actor role and identity features exposed by the system were most use- ful to them, and that the present implementation of features based on content or date added no discernible additional value. Quantitative results indicate that substantial and statistically significant improvements in recall were achieved. The scope of the research questions thus far were limited to the privilege review phase. However, review time is a factor that applies to both responsiveness and privi- lege review phase. In our next two research questions, we model the use of predictive coding system for the entire e-discovery review process. We consider both the review for responsiveness and the review for privilege. The main objective for the next two research questions is to understand how predictive coding can aid in making the e-discovery pro- cess more effective at the lowest possible incremental cost. We run our experiments on a different test collection to avoid the sampling challenges encountered during the creation of the TREC 2010 Legal Track collection. 1.2.3 Predictive Coding with Manual Review All the above research questions are specifically aimed to tackle the privilege review phase in e-discovery. Our initial questions RQ1, RQ2, RQ3a and RQ3b aim at building a predictive coding system to identify privileged documents while RQ4a, RQ4b and RQ4c 11 concentrate on manual privilege review. Findings from RQ4b motivate our last couple of research questions. Findings from RQ4b, reveal that having a lawyer look at the documents with features generated by our algorithm yields no improvement in review time. As time is directly proportional to money during privilege review, this process can be quite expensive. Consequently, we can infer that a fully manual review is not sustainable. Thus we aim to develop a semi-automated system where we utilize the manual reviewer’s time only when it is cost-effective. The analysis addresses a ternary classification problem. We propose a semi-automated system whose goal is to identify, within a set of documents D, the documents that are at the same time (a) responsive to a certain topic, and (b) non-privileged. Documents that are both responsive and non-privileged should be produced by the producing party to the requesting party; documents that are responsive and privileged should be declared in a privilege log; non-responsive documents should be withheld. We aim to make the review process of e-discovery more efficient. Using a fully manual system incurs huge amounts of manual annotation cost during review. Hence our goal is to involve the human only when the document review cost is smaller than the expected cost of accepting the decision of the automatic classifier (we call this as risk). We aim to achieve a reduction in cost by using a semi-automated system where-in we make use of a predictive coding system to automatically classify all the documents in the test set and use the reviewers to manually check the label for a document only when the risk involved in accepting the decision of the automatic classifier exceeds the review cost. To understand how our semi-automated system can improve the efficacy of the review process, we first develop a ranking algorithm to determine which documents in the test set need a manual review. We then compute the overall expected cost of the review process. In our semi-automated system, the documents that are not manually labeled by the reviewers, use the classifier predictions as labels. Hence we have two types of review costs; (1) Cost incurred due to manual review and (2) Cost incurred due to classifier misclassifications. To model the cost of misclassification error, we quantify the different e-discovery outcomes in terms of liability cost. Our input cost structure is formed on the basis that some mistakes are more severe than others. Besides, if the probability of making 12 that type of mistake is small, the expected cost for making any one decision will also be small. To compare our system performance with other effective baselines, we develop a linear evaluation function where the total expected cost for the review is simply the sum of the expected costs of each of the outcome. One of the research questions we ask in this component is: RQ5a: Which documents need to be manually reviewed? We ask this question because we know that there are some cases where adopting some degree of automation is the best solution. By answering this question, we aim to help e-discovery practitioners decide when and to what extent adopt automation during privilege review. The classifier model we aim to build, balances for the cost of review and the risk of compromising a privileged document. Our ranking algorithm utilizes the posterior probabilities and the cost of making a mistake. Unlike the traditional classification processes, the outcomes of our classifier vary significantly in terms of prediction errors; i.e., some type of classification errors are considered to be more acceptable than others. We first quantify the type of prediction errors as a representation of liability costs. We map the misclassification errors to a cost value. We develop a risk based ranking algorithm to determine which document needs to be reviewed by a human depending on the expected cost associated with each document. If the expected cost of accepting the decision of the automatic classifier is higher than the cost of manually reviewing that document, then it would be rational to manually review that document. And conversely, if the cost of reviewing a document exceeds the expected cost, then it would be rational not to manually review that document. The approach we take is to run the classifier on every document, sort the documents in decreasing order of the expected cost and then manually review documents from the top of the list as we go, until we reach the first document for which the expected cost is less than the cost of reviewing that document. We next ask our final research question; RQ5b: Does our semi-automated system yield lower overall expected cost when compared to other baseline models? 13 To answer this question, we develop a suitable evaluation measure that is optimized for review cost. We define multiple effective baselines. Our baseline methods are of different types; completely automated, completely manual solutions and human-in-the- loop systems. Their classification decisions are obtained via some combination of manual annotation and automatic classification. We compute the overall expected cost for each of the baselines and our semi-automated system. Using the cost structures exemplified in 5 we can evaluate each system by computing the overall expected cost for all the seven models. 1.3 Thesis Statement For the task of identifying privileged documents intermixed with responsive material during discovery, automation can be used to improve efficacy, accuracy, or both. 1.4 Dissertation Outline The remainder of this dissertation is structured as follows: Chapter 2 discusses the background with related work. In Chapter 3, we discuss the research question related to building a classifier, evaluation of the collection and highlight with detailed experiments the drawbacks of the TREC Legal Track 2010 test collection. We provide a fix for the drawbacks and describe the use of the test-collection. Then in Chapter 4, we attempt to seek users’ (lawyers) help to determine how our work could help them perform the task of privilege review better. We design and conduct a user study and a semi-structured interview with 6 legal professionals. Next in Chapter 5, we introduce our risk-minimization framework to show when and which document in the collection needs human input. We define and develop six baseline models and compare all the models with our model. We conclude in Chapter 6, with experimental limitations, looking to future directions, and articulating some of the broader impacts of our work. 14 Chapter 2: Background The work reported in this dissertation is related to multiple lines of research. Section 2.1 introduces the research domain and its background. This dissertation uses the TREC Legal Track 2010 test collection. In section 2.2, we explain the related work done in the area of evaluating test collections along with the necessary background about the TREC test collection utilized for our experiments. We discuss prior work about manual review in section 2.3, interactive review that supports our study (discussed in chapter 4) in section 2.4 and the predictive coding techniques in section 2.5 2.1 E-Discovery and Privilege Review In the United States, civil lawsuits generally proceed through distinct steps: plead- ings, discovery, trial and possibly an appeal. E-discovery is a process in which a producing party involved in the lawsuit is responsible to produce all the relevant electronically stored documents to the requesting party. The legal professionals are increasingly confronting a new reality: massive and growing amounts of electronically stored information (ESI) required to be retained by law, in anticipation of litigation. Spotlight has now formed on how lawyers decide to meet their obligations in various e-discovery contexts. One major aspect involves the study about how researchers adopt technology to identify relevant elec- tronic evidence in response to a discovery requests or due to some other external demand for information coming from a requesting party. Research on the process is increasingly important, given the legal costs. The e-discovery cost grows exponentially as a portion of relevant documents need to be protected due to the existence of Privilege or Attorney work-product. In litigation, there are many types of Privilege namely: • Legal Professional Privilege or Attorney-Client Privilege 15 • Public Interest Privilege • Without Prejudice Privilege • Privilege Against Self-Incrimination • Others Attorney work-product is a doctrine that protects from discovery, the materials prepared by the attorney or attorney’s representative [33]. In this dissertation, we conduct experi- ments that concentrates on this type of privilege. In legal context, Attorney-Client privilege is a right given to the parties in a lawsuit to provide protection against the involuntary disclosure of information. Attorney-client privilege in particular exists to protect the information exchange between privileged persons for the purpose of obtaining legal advice. Privileged persons include [33]: • the client (an individual or an organization) • the client’s attorney • communicating representatives of either the client or the attorney, and • other representatives of the attorney who may assist the attorney in providing legal advice to the client Since the 2006 amendments to the FRCP, the task of withholding documents on the basis of attorney-client privilege alone has faced multiple challenges in litigation [34, 48]. The attorney-client privilege is aimed to foster trust and promote at-will communication between the parties and their attorneys. However, privilege does not arise simply because privileged persons communicate; it can only be claimed when the content of the commu- nication merits the claim. For example, an email from Jeff Skilling (Enron’s president) sent only to James Derick (Enron’s general counsel) about pending litigation would be privileged; an email with the same content sent to both James Derrick and a personal friend of Skilling’s who was not involved in Enron’s business operations would not be, and an email from James Derrick to Skilling that indicated (only) his intent to resign in order to spend more time with his family also would not be privileged. 16 Apart from people information, privilege strongly depends on the context of the communication. Thus privilege is a property of a communication that happened between two or more privileged people about the topic of litigation. Even when the communication between the privileged entities has been made in confidence for the purpose of obtaining legal advice, the existence of privilege can be waived due to the involvement of a third party [2] or sometimes even due to inadvertent disclosures. In practice, inadvertent disclosures appear at greater frequency [1,4,33]. Such accidental disclosures of privileged information cause litigators greater anxiety, since the possibility of failing to protect the attorney-client privilege may potentially lead to lawsuits on unrelated topics. To avoid privilege to be waived due to inadvertent disclosures, dependence on human to review each and every responsive electronic document is adopted. Thus, in e-discovery, the cost of privilege review process is dominated due to the process of having human reviewers review the documents that the classifier predicts as responsive. A study of large scale review for both responsiveness and privilege which was performed with 225 attorneys, revealed that an average of 14.8 documents were annotated per hour per attorney [62]. Such numbers would cause the cost of the review process to grow quickly with the increase in collection size, making linear review impractical [60]. As linear review is becoming impractical, this dissertation attempts to determine if adopting automation to some extent can help in reducing the cost of review. 2.2 Test Collection Evaluation The modern literature on the effectiveness and reliability of retrieval experiments is largely confined to the problem of constructing test collections for IR evaluation. The Text Retrieval Conference (TREC) was created to address the problem of IR evaluation for large datasets. TREC typically follows the Cranfield paradigm [76], which evaluates the results of participating systems against a gold standard that identifies every relevant document. A test collection consists of documents and assessments of which documents are relevant to. These relevance assessments are made by human assessors. Depending on 17 the collection, some documents have multiple human judgments. Effectiveness measures are then calculated based on the return of relevant documents by systems under eval- uation [77]. Gathering human relevance assessments is one of the most expensive and problematic aspects of test collection formation. Human judgment is subject to various cognitive, perceptual and motivational biases [61]. Researchers identify multiple factors that influence evaluation of test collection: documents; judgment conditions; judgment scales; and factors like human expertise [51]. Saracevic [64] surveys experimental work on these factors. Analysis by Voorhees [75] shows that while absolute effectiveness scores are sensitive to variations in relevance judgments, relative scores remain broadly stable. In e-discovery, evaluating the absolute effectiveness matters at least as much as than systems’ relative scores. The traditional test collection methodology assumes that all documents in a collec- tion are judged in response to every query in the test set. As collection sizes have grown, exhaustive assessment has become infeasible. Evaluation campaigns such as TREC there- fore make use of a pooling approach, where documents for assessment are taken from the answer lists of participating systems. Zobel [52,89] finds pooling robust in determining rel- ative system rankings, but incomplete in identifying all relevant documents. Subsequent work has suggested that for very large collections, pooling may be unreliable even for relative comparisons [20, 21, 81]. Yilmaz and Aslam propose the simple random stratified sampling method [85] [87]. Pooling of results introduces bias against unpooled systems because distinctive documents returned by these systems are assumed to be irrelevant. A possible fix is to ignore unassessed documents in calculated metric scores. This was proposed by Buckley and Voorhees [21]. There has been considerable recent interest in techniques for the efficient estimation of effectiveness metrics. Yilmaz and Aslam [86] introduce infAP, a method for estimating average precision using uniform sampling from the set of complete relevance judgments. A refinement is statAP, which uses stratified sampling requiring smaller sample sizes than infAP for the same accuracy [23]. Stratified sampling was also used in the TREC Filtering Track [45]. Chapter 3 of this dissertation explores the reusability of the TREC Legal Track 2010 test collection. We utilize the collection and address multiple issues related to (1) 18 Use of the judgments from different assessors for building and evaluating classifiers and (2) adjudication conditions. In the next section we discuss the details about how that test collection was created. 2.2.1 TREC Legal Track Collection The first effort at creating a platform for e-discovery domain research and evaluation was initiated by the TREC Legal Track after the 2006 amendments to the Federal Rules of Civil Procedure (FRCP). The principal goal of the TREC Legal Track was to develop mul- tiple ways of evaluating search technology for e-discovery [13]. Keyword search approach was one of the initial attempts taken to help the lawyers manage the enormous amounts of documents [14]. Each document matching the query term in the keyword approach would be subjected to a linear manual review. The idea of using keyword search approach was to filter the number of documents to be reviewed by human annotators. Some extensions to keyword search approach called concept search are employed to extend the search terms to include context information [44]. However, as corporate collections have continued to grow, filtering by keywords have left huge document sets to be linearly reviewed [19] making linear review procedure insupportable [60]. As more and more litigators today are familiar with the use of technology and auto- mated classifiers, the effectiveness and evaluation of such automated classifiers has gained the interest of not only E-discovery vendors but also the courts [55]. Thus use of auto- mated classifiers with a higher degree of technological assistance using machine learning techniques is currently being studied [35]. Although many types of electronically stored documents could be important in e-discovery, emails are of particular interest because much of the activity of an organization is ultimately reflected in the emails sent and re- ceived by its employees. Since email collection one avenue to search for communications that could be withheld on the grounds of attorney-client privilege, we utilize the rele- vance and privilege judgments obtained from TREC 2010 Email Test Collection in our experiments reported in chapter 3 and chapter 4. TREC 2010 Legal Track focuses on evaluation of search technology for discovery of ESI in litigation and regulatory settings. The TREC 2010 Legal Track consisted of 19 two distinct tasks: the Learning task, in which participants were required to estimate the probability of relevance for each document in a large collection, given a seed set of documents, each coded as responsive or non-responsive; and the Interactive task, in which participants were required to identify all relevant documents using a human-in-the-loop process. We used Interactive Task topics for our experiments. 2.2.2 Topics & Assessment In the 2010 TREC Legal Track’s “Interactive task”,1 one off the three relevance top- ics (Topic 303) required finding “all documents or communications that describe, discuss, refer to, report on, or relate to activities, plans or efforts (whether past, present or future) aimed, intended or directed at lobbying public or other officials regarding any actual, pend- ing, anticipated, possible or potential legislation, including but not limited to, activities aimed, intended or directed at influencing or affecting any actual, pending, anticipated, possible or potential rule, regulation, standard, policy, law or amendment thereto.” [29]. And the privilege topic in the 2010 TREC Legal Track2 requested “ all documents or communications that are subject to a claim of attorney-client privilege, work-product, or any other applicable privilege or protection”. Although privilege classification is normally performed as a second pass after classification for relevance, nothing in the definition of privilege is specific to any litigated matter. The collection to be searched was version 2 of the EDRM Enron Email Collection, which includes both messages and attachments. The items to be retrieved were “document families,” where (following typical practice in e-discovery) a family3 was defined as an email message together with all of its attachments. Once the submissions from the participants were received during TREC 2010, the collection was stratified for each topic and evaluation samples were drawn. Stratification followed the pooling-based design whereby one stratum was defined for email families all participants found relevant (the All-R stratum), another for email families no par- ticipant found relevant (the All-N stratum), and others for the various possible cases of conflicting assessment among participating teams. The operative unit for stratification 1A task in which participants design both a system and an interactive process for using that system 2For bookkeeping purposes, the (non-topical!) privilege task was Topic 304. 3Use of families is referred to as “message” evaluation in [29]. 20 was the document family, and families were assigned intact (parent email together with all attachments) to strata. Samples were composed following the allocation plan whereby strata are represented in the sample largely in accordance with their full-collection pro- portions. An exception to proportionate representation was made in the case of the very large All-N stratum, which is under-represented in the sample relative to its full-collection proportions. To manually gather relevance and privilege assessments, selection within each stratum was made using simple random selection without replacement. The process of gathering assessment followed a two-stage procedure, whereby an initial relevance as- sessment is made of each document in each evaluation sample and then a selection of those first-pass assessments are escalated to a subject matter expert or Topic Authority (TA) for final adjudication. Once the evaluation samples were drawn, they were made available to review teams for first-pass assessment. The review teams, were all staffed by commercial providers of document-review services. At the outset of each review team’s work, an orientation call was held with the Topic Authority for the team’s topic; on the call, the Topic Authority outlined his or her approach to the topic, and the review team had the opportunity to ask any initial questions it had regarding the relevance criteria to be applied in assessing documents. Finally, once the review got under way, an email channel was opened, whereby the review team could ask the Topic Authority any questions that arose, whether regarding specific documents or regarding the relevance criteria in general, in the course of their assessment of the evaluation sample. Dual assessments were gathered on a subset of the families. The dual-assessment subset was chosen by random selection from families already included in the sample. Both assessments were supplied by the same review team; indeed, it is not impossible that, in some cases, the same individual supplied both assessments. What we can say about the dual assessments is that they represent distinct assessments of the same message on two different occasions. A set of first-pass assessments are escalated to the Topic Authority for adjudica- tion. The families that are escalated are derived from multiple sources: (1) First-pass assessments can be appealed by one or more of the participants (2) Assessments with dis- 21 agreements (only those that are dual assessed) and (3) A sample of non-appealed families with first-pass assessments. Once selected, the families were made available to the Topic Authority for final assessment. In making their assessments, the Topic Authorities had access to the assessment guidelines they had prepared for the first-pass assessors, as well as any other materials they had compiled in the course of their interactions with the par- ticipants. Once the Topic Authorities had completed their reviews of their adjudication sets and the sample assessments had been finalized, the relevance judgments gathering process were deemed complete. 2.3 Manual Review In e-discovery, documents that are initially marked as responsive to a production request (i.e., a specific request for evidence by the counterparty) are typically subjected to a linear manual review for privilege in order to be sure that content that could properly be withheld is not inadvertently revealed. Failure to identify a privileged document could jeopardize the interests of the party performing the review, so it is common practice to have highly qualified (and thus expensive) lawyers perform the privilege review. However, it is well known that human assessors frequently disagree on the relevance of a document to a topic. Experienced TREC assessors working from only sentence length topic descriptions, had an average overlap (size of intersection divided by size of union) of between 40% and 50% on the documents they judged to be relevant [75]. Voorhees concludes that 65% recall at 65% precision is the best retrieval effectiveness achievable, given the inherent uncertainty in human judgments of relevance. Bailey et al. [12] survey other studies giving similar levels of inter-assessor agreement. One way of characterizing accuracy is by measuring inter-assessor agreement, which has consistently proven to be lower than one might expect [75,79]. When searches are done by different users, disagreement might reflect different notions of relevance or, in our application, different ways of reaching decisions regarding privilege. Reasons for disagreement between different relevance assessors, such as the instructions given to judges or the different topics have also been analyzed [79, 83]. In e-discovery, however, there is a single senior attorney who ultimately certifies the 22 correctness and completeness of the review process, and their interpretation of privilege is thus taken to be authoritative [58].4 The Interactive Task of the Legal Track of TREC includes such a topic authority, and provides a process of appeal to this authority for uncovering assessor errors. The appeal results for TREC 2009 found that, on an assessment set in which 90% of documents were actually irrelevant, 33% of relevant assessments were in error, as were 3% of irrelevant assessments [37]. This is likely a lower bound to the error rate, since some errors may not have been appealed (although conversely some appeals may have been erroneously upheld). Carterette and Soboroff have found that when judgments from one person are used to predict system preferences that would be obtained by computing evaluation measures using the judgments of another person, the quality of the prediction can be enhanced by selecting a relatively conservative assessor (i.e., one that has a lower tendency to make a false positive error) as the source of judgments that are the basis for the prediction [24]. This is an intriguing result for our privilege review task because in privilege review it is the risk of false negative errors that would generate the greatest concern on the part of the party performing the review. The Legal Track of TREC provides an objective environment in which to validate and compare different retrieval methods for e-discovery [15]. Two other known studies have compared the quality of automated retrieval and manual review, one by a re-review of an earlier manual production [62], the other through an analysis of data from the TREC 2009 Legal Track [36]. The former study finds automated retrieval to be at least as consistent as manual review, while the latter concludes that automation gives superior reliability. While there has also been some work on the design and evaluation of automated classifiers to actually perform the privilege review task [29, 35, 74], there is a widely held belief among attorneys that (absent compelling reasons to the contrary such as a need for privilege review at a scale that would otherwise be impractical), reliance on a fully automated classifier for privilege review would incur an undesirable level of as-yet uncharacterized risk. 4This certification can itself be litigated; in such cases the court would make the authoritative deter- mination. 23 Thus automated classifiers are more often used for consistency checking on the results of a manual privilege review process than as the principal basis for that review. A part of the work in this dissertation (Chapter 5) explores a second possible use of the technology. That is, use of automated annotations to (hopefully) improve the accuracy and reduce the cost of a manual review process. 2.4 Interactive Review Chapter 4 focuses on building tools that can help lawyers to make faster and more accurate privilege judgments. We do that by scoring the importance of specific email addresses to determine each actor’s propensity to engage in privileged communication. We choose email messages along with the attachments as our document collection because much of the activity of most organizations is ultimately reflected in the emails sent and/or received by its employees. For this reason, email provides an excellent environment to initially develop techniques to improve the productivity and accuracy of privilege review when it is rational to conduct it manually [57]. Prior work on email collections has shown promising results in classifying emails using features by isolating unstructured text (fields like subject & body) and the semi- structured text (categorical text from “to”, “from”, “cc” and “bcc”) [31,49]. Shetty et al. study the pattern of email exchanges over time between 151 employees in Enron during the height of the company’s accounting scandal [66]. McCallum et al took an initial step towards building a model that captures actor roles and email relationships using depen- dencies between topics of conversation [50]. Since then, several other generative models have been proposed [78,88]. Identifying key nodes or individuals in email communications has become an essential part of understanding networked systems, with applications in wide range of fields like; marketing campaigns [41], litigation [22], etc. Since such social network and textual content features have shown to uncover interesting communication patterns in emails, we attempt to exploit the benefits of using metadata information and the email content information to build features for our classification system. To evaluate classifiers, availability of reliable annotated data is required. 24 The process of gathering reliable annotations are fraught with multiple problems. In e-discovery, one such problem is the requirement for skilled legal annotators for review who make the review process more expensive. Thus, the cost further depends on the expertise of the annotator. Previous work has demonstrated that training a system on assessments from non-expert assessors leads to a significant decrease in reliability of the retrieval effectiveness while evaluated on expert judgments [82] and empirical findings have shown that annotations from experts would lead to better classifier accuracy [12]. However, Cheng et al. describes the benefits of utilizing noisy annotations to enhance classifier performance in a multiple annotation type environment [25]. Thus it is reasonable to accept that many factors like sampling, annotator expertise, etc., affect the process and quality of gathering relevance assessments. Although it is not realistic for human annotators to be infallible [75,79,80], we make the assumption that human annotators to be infallible in chapter 5. 2.5 Predictive coding with cost-sensitive learning Automating the process of search, analysis, and review are different tasks with different objectives. The objective of search is to find enough documents to satisfy an information need, such as a request for documents that are relevant to a topic. However, one of the results from the TREC legal track is that many relevant documents are missed by the best present search methods [15,69]. Thus, automated methods for retrieving relevant documents could take advantage of predictive coding techniques and ranking algorithms. The state-of-the-art in the application of predictive coding technique in e-discovery is reviewed in [58], and has been the subject of many recent studies [27, 28, 36, 62, 63, 74]. Predictive coding for privilege classification has been recently addressed [35,71,73]. Four recent cases has brought predictive coding techniques to the forefront [5–8]. These cases have attracted considerable attention in law and technology blogs. Attorneys, typically senior attorneys, work to train or calibrate the predictive cod- ing system. Most of the prior studies on predictive coding technique like [27, 28] begins either with attorneys selecting a seed set of responsive and nonresponsive documents, or 25 reviewing and coding a random sample of documents. These initial documents are then analyzed by the predictive coding system. The system begins to make judgments on prob- able relevance of other documents. The attorneys review further samples produced by the system, again applying their own judgment as to relevance, responsiveness, and priv- ilege. The process continues until the attorneys are satisfied that the software is properly calibrated. At that point, the results are said to be optimized. In Chapter 5 of this dissertation we use predictive coding system to optimize for the overall cost of the e-discovery process. We do this by limiting the total number of documents that need to be manually reviewed for relevance and for privilege. We quantify the different types of classifier errors in terms of costs. Depending on type of the classifier error, the cost value varies. In other words, the cost value of each mistake is sensitive to the type of errors produced by our predictive coding system. It is thus important to take the cost of every type of error into account so as to avoid the costliest of errors.5 Some of the principles applied in our work are described in [16]. We utilize the idea of gain presented in [16] for ranking automatically classified documents in order to optimize the work of human reviewers who annotate some of them. One major difference is that [16] is more theoretical, while this dissertation work can be seen as an application to an e-discovery context. The cost matrix emerges from the evaluation function (e.g., F1), which is given as an input to the problem [16], while in our model it is the evaluation function which emerges from the cost structure, which is given as an input to the problem. The framework we discuss in chapter 5 employs cost-sensitive active learning. The work related to ours are [17, 40, 70], where the cost of manually annotating a document is an explicit variable in a model that ranks items for presentation to a human reviewer. However, the goal of [17, 40] is not prioritizing the documents whose annotation would bring about the highest reduction in overall cost, but annotating the documents that would prove most valuable when used as training examples for retraining the classifier. In other words, the task we deal with is not retraining the best possible classifier, but 5The work discussed here is currently under review and was done in collaboration with Douglas W. Oard and Fabrizio Sebastiani; Minimizing the Expected Costs of Review for Responsiveness and Privilege in E-Discovery [54]. 26 reviewing a set of documents at the minimum possible overall cost; this difference in goals shapes the difference between that technique and our model. Other work in cost-sensitive active learning (e.g., [32, 65, 68]) are even more different from ours since they focus on modelling the fact that different types of items may involve different annotation costs, and an issue that we do not address in our model. The next chapter explores the first component of our dissertation: Designing and building a predictive coding system to identify privileged documents. 27 Chapter 3: Predictive Coding In e-discovery, the task of withholding documents on the basis of privilege (attorney- client privilege or attorney work-product doctrine) has surfaced many challenges in lit- igation cases [34] [48]. As more and more litigators today are familiar with the use of automated classifiers, the effectiveness and evaluation of such classifiers has gained the interest of not only commercial e-discovery vendors but also the courts [55]. This chapter details the contribution of building and evaluating privilege classifiers using the only exist- ing test collection1 . As a first step, we develop a test set to enable fair classifier evaluation. We next evaluate how reliable and reusable the existing test collection is. We finally build binary privilege classifiers that utilize the judgments from the test collection [74]. Evaluation of information retrieval systems relies on test collections in which rel- evance judgments can be created for only a small portion of the collection [67]. One approach for evaluating our systems is using a test collection with relevance and privilege assessments. Since collection of a realistic size are too large to exhaustively evaluate for relevance and then for privilege, the approach taken instead is, to assess documents that are highly ranked or retrieved by at least one retrieval system. By selecting these doc- uments the focus is on those documents found by the systems that are to be compared. This approach, known as pooling, has been widely used in the Text Retrieval Conference (TREC) and elsewhere. The procedure of pooling the system results during the TREC 2010 Legal Track collection created two major concerns. We study the first one by asking a question about reliability since different assessors may reach different judgments for the same document. 1The work discussed in this chapter is published in SIGIR and ICAIL conferences and was done in collaboration with Douglas W. Oard and Jiaul Paik; Assessing the reliability and reusability of an E-discovery privilege test collection [74] and Evaluating expertise and sample bias effects for privilege classification in e-discovery [71]. 28 Voorhees has shown that absolute measures of effectiveness are sensitive to this effect but that relative comparisons between systems are relatively insensitive to inter-assessor disagreement [75]. A second concern, reusability, is that new systems will generally find some documents that did not contribute to the pool, and assuming such documents not to be relevant might adversely affect even relative comparisons. Reusability is important because reusable test collections allow the cost of relevance judgments to be amortized over future uses of a test collection. Reusability of pooled judgments was examined by Zobel [89], who found that TREC pooling had likely found no more than half of the relevant documents, but that relative comparisons remained reliable. Buckley et al. [21] later highlighted a key limitation of that conclusion, finding that when distinctive systems had contributed to the pool, removing one such system could yield a substantial adverse effect on measurements of mean average precision. One way to partially address this concern, introduced by Yilmaz and Aslam, is to sample the documents to be judged from the full collection and then to estimate the evaluation measure from the sampled judgments [85,87]. Random samples drawn from very large collections yield confidence intervals that are so large as to be uninformative, so in this chapter we explain and focus on the sampling design used in the interactive task of the TREC Legal Track, in which set intersections were used as a basis for stratification [56]. Between 2006 and 2011, the TREC Legal Track created relevance judgments for more than 100 topics (which in e-discovery are called “production requests”). In 2010, this was augmented by the world’s first (and to date only) shared-task evaluation of privilege classification [29]. We study the reliability and reusability of the resulting test collection. When work- ing with Legal Track test collections we need to think a bit differently about reliability and reusability. For reliability, we are interested not just in relative comparisons, but also in the reliability of absolute measures of effectiveness, and most particularly in estimates of recall. Point estimates from samples are (in expectation) insensitive to sample size, so characterizing the reusability of stratified samples requires comparing confidence intervals for systems that did and didn’t contribute to the stratification. What we call reliability thus corresponds to the statistical concept of measurement error, reusability to the statis- 29 tical concept of sampling error. We then study the effect of building classifier by training privilege classifiers on two sets of families using the TREC Legal Track test collection. 3.1 Test Collection This section introduces the required background to answer our research question RQ2. In the 2010 TREC Legal Track, the document collection used for all Interactive tasks (including the privilege detection task) was derived from EDRM Enron Collection, version 2, which is a collection of Enron email messages. The privilege task during TREC Legal Track 20102 was to retrieve to withhold “all documents or communications that are subject to a claim of attorney-client privilege, work-product, or any other applicable privilege or protection”. Although privilege classification is normally performed as a second pass after classification for relevance, nothing in the definition of privilege is specific to any litigated matter. The collection to be searched was the EDRM Enron collection, version 2, which is a collection of Enron email messages for which text extracted from attachments is provided with the collection. Following the practice for privilege review in e-discovery, the items to be classified were “document families,” in this case a family was defined as an email message together with all of its attachments.3 3.1.1 Stratified Sampling Two teams (A and H)4 submitted system results (runs) for the TREC 2010 privilege classification task: Team A submitted four runs (a1, a2, a3, a4); Team H submitted one (h1). Each run was a binary assignment of families to one of two classes: privileged or not privileged. Following TREC convention, we refer to these five runs as participating systems; each run was produced by people and machines working together (TREC refers to this as interactive task). The collection was partitioned into 32 strata, each defined by a unique 5-bit vector (e.g., 01010 for the stratum containing families runs a1, a3, and h1 classified as not privileged and runs a2 and a4 classified as privileged) [29]. The 00000 2For bookkeeping purposes, the (non-topical!) privilege task was “topic 304.” 3Use of families is referred to as “message” evaluation in [29]. 4In [29] Team A was called CB, team H was called IN. 30 stratum included 398,233 of the 455,249 families (87% of the collection), but only 3,275 of the 6,766 samples (48%) were allocated to that stratum. The resulting sampling rate for the 00000 stratum (0.8%) was far sparser than for any other stratum (which averaged 6.1%). The allocation of samples was a bit denser for smaller strata since a 6.1% sampling rate might otherwise result in very few samples being drawn. Few samples were allocated to these very small strata in aggregate, so the sampling rate remained above 6% for every stratum other than the 00000 stratum. 3.1.2 Privilege Assessments First-tier junior level privilege assessors (henceforth, assessors), who were lawyers employed by a firm whose business included provision of document review services for e-discovery, were provided with detailed guidelines written by a senior privilege review attorney (the Topic Authority (TA)). Assessors recorded ternary judgments: privileged, not privileged, or unassessable (e.g., for display problems, foreign-language content, or length). As expected, assessors sometimes made judgments that disagreed with the TA’s conception of privilege. For other tasks, differing judgments might be treated as equally valid, but in e-discovery the TA’s judgments are authoritative (because the TA models the senior attorney who will certify that the review was performed correctly). Judgments that disagree with those of the TA are therefore considered incorrect. In TREC 2010, an assessor’s judgment regarding whether a family should be classified as privileged could be escalated to the TA for adjudication in three ways. First, a team might appeal the decision of an assessor to the TA. A total of 237 such appeals were received. Of course, teams might not as easily notice, nor would they rationally appeal, assessor errors that tended to decrease their estimated classification accuracy. In particular no team would rationally appeal an erroneous assessor judgment of privileged in the 11111 stratum, nor an assessment of not privileged in the 00000 stratum. The set of appealed judgments is thus biased [80]. To create an unbiased sample, 223 assessor judgments were thus independently drawn using simple random sampling. Since this is a random sample of a stratified sample, it results in a smaller stratified sample of the full collection. To facilitate symmetric comparisons among assessors, a second simple random sample containing 730 31 Table 3.1: TA adjudication rates. Category Assessed Adjudicated Rate Random sample 6,766 223 3.3% Team appeal 6,766 237 3.5% Assessor disagreement 730 76 10.8% Table 3.2: Training Families Train-Set Case Case ID Privileged Not-Privileged Prevalence AS − TA 166 113 0.59 AS −A 169 110 0.60 NAS 166 113 0.59 families was drawn and each family in that sample was duplicated in the set of families to be assessed. This was done in a manner that had been expected to result in the duplicated families being assigned to different assessors.5 When conflicting assessments for a family were received, the judgment was adjudicated. Table 3.1 summarizes the selection process. Families chosen to be adjudicated by the TA will henceforth be called as Adjudicated Set or AS; Families that are not adjudicated as Non-Adjudicated Set or NAS. 3.2 Evaluation Plan In this section, we answer our research question RQ1 and introduce the framework for the questions RQ3a and RQ3b. We study the effect of training privilege classifiers on two sets of families. We utilize the relevance judgments for building and evaluating our classifiers. Since the families in the AS are biased due to the presence of the families appealed by the team and the families that were in disagreement between assessors, to create an unbiased set of adjudicated families for evaluation, we need to eliminate the selection bias by re-sampling from the biased adjudicated categories. Figure 3.1 shows our graphical re-sampling procedure for a single stratum 00110. In 00110 stratum, 40 dual assessed families that caused disagreement among the assessors were adjudicated along with 58 families that were appealed. To maintain the sampling probability at the rate of 0.036, we randomly draw 2 families from appeal (A) category and one from assessor 5Some pairs may have been judged by the same assessor. 6This is the sampling probability of the random category. 32 Strata Category(Sample) New Sample New sampling probability eeeeee22 A(2/58) // e hh44 0.0307 eeeeeee r88 hhhh hhhrrr A(58/58) hhhh hh rrr99 hhh r sss hh rr rr s sss eeee22 D(1/40)e rrr rr s sss eeeee eeee rrr r ssss ff22 D(40/40)ff rrr r ssffffff ff r 00110 XXXXX eee22 R(44/1343)XXXXXX,, eeeeeee e R(44/1343) ... Figure 3.1: Re-sampling Procedure disagreement (D) category and include these families in the test-set. This procedure is repeated for each stratum creating an unbiased stratified sample of 252 families across all strata. To reduce the impact of measurement error on the classifier evaluation, we use TA judgments (on the unbiased 252 families in the held-out test-set) as gold standard [74]. The remaining families in the AS and NAS are used for training our classifiers. Figure 3.2 graphically explains the process of selecting families for training and testing our classifiers. The 6,766 family annotations from the 2010 TREC Legal Track are utilized to create an unbiased test-set. Although the families in the held-out test-set have assessments from both the assessors and the TA, we use the TA judgments on the 252 families in the test-set as gold standard for evaluation [74]. The remaining families in AS annotated by the TA (AS-TA) and the assessor (AS-A), along with annotations from the NAS, create the three training cases. Table 3.2 shows the privilege class prevalence and the number of privileged and not-privileged families in each of the three training cases. We build three different classifiers for each of these three training sets. The classi- fiers differ in their feature set as explained in section 3.3. Thus, the 9 (3 different models trained on 3 different train-sets) automated classifiers allow us to study the influence of (1) annotator expertise and (2) selection bias on the training families. We build supervised classifiers using labeled families from the two disjoint sets. One set utilizes the families in AS for training while the other utilizes an equal number of families (to maintain the prevalence π) from NAS. Since the families in AS are dual-assessed, we utilize the assess- 33 6766 Families! Privilege Annotations from 536 Families! 2010 TREC Legal Track! 6230 Families! T! Adjudicated Set Non-Adjudicated Set e! (AS)! (NAS)! s! R! A! t! Families ! from ! S! D! A + D!Removal of bias – Refer Figure 1! e! t! Topic Authority ! Assessor! Annotations! Annotations! NAS! AS-TA! AS-A! Train Set Cases! Figure 3.2: Train-Set and Test-Set Split Procedure ments from TA (model-AS-TA7) and the assessors (model-AS-A8) to study the effect of expertise on classifier training. All families in the NAS are annotated by only assessors. Thus, in the results section, we use the classifiers’ performance to (1) analyze the effect of expertise on training classifiers by comparing model-AS-TA and model-AS-A; (2) analyze the effect of selection bias on training classifiers by comparing model-AS-A and model-NAS. 3.3 Classifier Design Traditionally text classification applications have achieved successful results by us- ing the bag-of-words representation. A number of approaches have sought to replace or improve the bag-of-words representation by adding complex features, however the results have been mixed at best. Although privilege classification can be viewed as a classic text classification problem, the parameters that determine attorney-client privilege de- pend strongly on (1) the people and (2) the content of the email communication. Since both people and content are important in finding privilege, we use both the network and 7This notation denotes that the model is trained on families in AS with expert (TA) judgments. 8This notation denotes that the model is trained on families in AS with non-expert (Assessor) judg- ments. 34 Table 3.3: Separation of email data Actor-Centric Features or view1 Content-Centric Features or view2 Sender information Content - Subject field data Recipient information Content - data in email body and attachments content information of the families to define features. We do this by separating the infor- mation in each family into two disjoint components (henceforth called views). as shown in Table 3.3. The first view view1 exploits the metadata9 information to obtain the importance score of each actor. We removed a small handful of labeled families (29 families) that are missing sender or/and recipient information during our experiments. In this view, a family is represented as a directed multi-graph (a graph in which multiple edges are permitted between the same nodes) in which each node is an actor and each edge is an email commu- nication between actors. We define view1 as a Graph Model (GM). Our intuition is that, an email message sent/received by an actor “a” has a high probability of being privileged if actor “a” frequently communicates with other actors who have a higher probability of being involved in privileged communications. The second view view2 utilizes the content information in each family. view2 is defined as a Content Model (CM). In CM, we use only the words occurring in the subject field and the content field of the family to derive term features. For model performance comparison, we build a joint model called Mixed Model (MM). The MM uses the features from both the GM and CM. In our experiments, we used three types of classification algorithms: Linear Kernel Support Vector Machines (SVM), Logistic Regression and NaiveBayes, all using the implementations in the Python Scikit-Learn Framework.10 We report only linear kernel SVM classifier results since we did not observe any significant change in the model performance while using the other two classification algorithms. We compare the classifier results by deriving point estimates for recall and precision with two-tailed 95% approximate confidence intervals. 9Data in From, To, Cc and Bcc fields 10http://scikit-learn.org/stable/ 35 3.3.1 Models In this section, we describe the models in detail. We explain the estimation and interval calculation. 3.3.1.1 Graph Model 1. P, DocID=2 NP, DocID=5 P, DocID=6 2 4 NP, DocID=4NP, DocID=4 3 Figure 3.3: Sample Graph One common way of representing the information extracted from view1 is by a directed graph structure. Let G = (V,E) denote a directed multi-graph with node set V and edge set E. For a single directed edge (u, v), u is called the sender and v the recipient of the email communication. In the model built using view1 data, each node would represent an individual person and the edge linking the two nodes would represent a family. Consider an example graph sample space G as shown in Figure 3. Here, each edge connecting the nodes is a labeled family. Each labeled training family is represented by the nodes as its features. However our feature extraction technique faces challenges in identifying unique nodes in emails due to the absence of a linked knowledge base. Hence as a first step, we extract unique actors from emails using string pattern matching approach. The task is defined as follows: an email is composed of multiple actors with a variety of name mentions as shown in Figure 3.4. The objective is to identify a set of unique actors across all email communications. To obtain a unique set of actors, we extract the (sender, [recipient]) from each family. Once this is done, we compute the similarity using a pattern recognition algorithm between every pair of nodes [18]. The steps for computing similarity in name mention of nodes in emails are as follows: (1) Remove suffixes (like “jr”, “sr”) and remove generic terms like “admin”,“enron america”, “support”, “sales”, 36 EXAMPLE - 1 Date: Tue, 19 Dec 2000 08:33:00 -0800 (PST) From: Sheri L Cromwell To: Mark Taylor, Mark Greenberg Cc: Tana Jones Subject: Please see attached from Leslie Hansen Actor   Name  Men-on  in   Sheri L. Cromwell Emails   EXAMPLE - 2 Mark  Taylor   Mark  Taylor   Date: Wed, 12 Sep 2001 07:51:37 -0700 (PDT) Mark.Taylor@   Subject: FW: Draft On -- Amendment Ideas ees.com   From: Yoho Lisa To: Mark.Taylor@ees.com Mark.Taylor@   enron.com   Mark: Please review and … Lisa EXAMPLE - 3 Date: Wed, 2 Jun 1999 02:38:00 -0700 (PDT) From: Mark.Taylor@enron.com To: Brent Hendry, Sara Shackleton, Carol St Clair Subject: Omnibus Revisions Richard Sanders has asked us to revise the arbitration … Mark Figure 3.4: Actor variants in emails etc.; turn all white-space into a single hyphen. Next, we merge the first name with the last name using a single hyphen to recognize the person’s full name as a single entry. This step ensures that mike.mcconnell and mike.riedel are not similar. Thus, at the end of this step we obtain a list of actor nodes N ; (2) For each node n in the set N we identify a set of similar nodes using an approach to match string patterns based on the Ratcliff- Obershelp algorithm [18]. We used the implementation provided by the Python “difflib” module with the cutoff threshold set to 0.75. For the examples shown in figure 3.4, given the target node “mark.taylor@ees.com”, the following close matches are obtained: “mark- taylor, mark.taylor@enron.com”. Next, we obtain the correct match by comparing the target word with all its close matches and identifying the matching sub-sequences. The accuracy of identifying unique nodes using this technique is 0.83 with false positive errors at a higher rate (0.62) than false negatives. As future work, we propose to undertake a better approach in clustering nodes to reduce the false positive errors. 37 Figure 3.5: Content-centric information in emails 3.3.1.2 Content Model In this model, an email family is typically stored as a sequence of terms where the terms represent a collection of text from the email message together with the text in all its attachments. Information retrieval researchers have developed a variety of techniques for transforming the terms representing the documents to vector space models to perform statistical classification. In content model, we simply use the words occurring in the subject field and the content field of the family to derive term features. We remove any metadata information (text in black in figure 3.5) included in the body of the email message. Figure 3.5 shows the boundaries of the content data extracted from the email message. Text in the attachment is also included in the Content Model. After extracting the text content, we represent the text as a vector space model where the terms are scored using a TF-IDF weighting algorithm. 38 Table 3.4: Contingency Table Prediction/Judgment Privileged Not Privileged Retrieved Nrp Nrp′ Nr Not Retrieved Nr′p Nr′p′ Nr′ Np Np′ N 3.3.2 Evaluation Metric The evaluation metrics are derived from two intersecting sets; the set of families in the collection that are privileged, and the set of families that a system retrieves (as shown in Table 3.4). Section 3.3.2.1 and section 3.3.2.2 explain the derivation of point estimates and confidence intervals respectively. 3.3.2.1 Point Estimate This section details the calculations used to estimate the recall and precision of the system. In order to estimate the precision for system Ti, we estimate N irp, the number of privileged families returned by system Ti and the total number of families returned by that system N ir. Let Nhrp be the number of privileged families in stratum h. The number of privileged families returned by System Ti is the sum of the number of privileged families in the strata returned by System Ti. Thus if N̂hrp is an unbiased estimator of Nhrp then ∑ N̂ i = N̂hrp rp (3.1) h:T ∈Thi is an unbiased estimator of Nrp for system Ti where T h is the set of all systems that retrieved documents in the stratum h. Now, let the number of documents in stratum h be Nh. A sample of size nh is drawn from the stratum by simple random sampling without replacement, and nhp of the families in the sample are privilege. Then, an unbiased estimator of Nhrp is nh N̂h rrp = Nh ∗ (3.2)nh 39 Finally,the estimator of System Ti’s precision can be obtained using i ˆ i N̂rp Precision = (3.3) N ir In order to estimate recall, an estimate of Np , the total number of privilege documents or yield of the collection, is also required. An unbiased estimate of Np is obtained by summing the yield estimates for each stratum as shown below: ∑ N̂p = N̂ h p (3.4) h:T hi∈T The recall estimate of the system Ti is then calculated using the expression i ˆ i N̂rp Recall = (3.5) N̂p 3.3.2.2 Confidence Intervals The recall and precision values derived in section 3.3.2.1 are point estimates, and are subject to random variation due to sampling and measurement error. Here, we focus on providing an indication of the expected range of variability around a point estimate, and to account for it when comparing two scores. A two-tailed (1-α) confidence interval, [θl , θu], provides the range within which the population θ lies with confidence (1-α); in other words, if samples were to be repeatedly drawn from the population, and intervals calculated using the same method, then (1-α) of the time, that confidence interval would include θ, the parameter of interest. An exact confidence interval is calculated by finding the lowest upper and highest lower θ value that satisfy a one-tailed significance test. Exact confidence intervals, are often hard or impossible to calculate [9]. An approximate confidence interval is derived by other methods, and typically aims to achieve (1-α) coverage on average across values of the parameter θ, rather than guaranteeing it for every parameter. In the experiments reported in this chapter, we calculate 95% approximate confidence intervals from beta-binomial posteriors on stratum yields [?]. 40 3.4 Results In this section we report the results of RQ2 (section 3.4.1), RQ3a and RQ3b (section 3.4.2). 3.4.1 Test Collection Bias Here we analyze the reliability and reusability of the TREC 2010 Legal Track priv- ilege task test collection. Analysis of Measurement Error The use of assessor judgments for families that the TA had not adjudicated would be reasonable if the appeal process had identified most of the assessor errors. This is a testable hypothesis. Although the TA might also make errors, we ignore that factor because we believe its effect to be small. We therefore treat the TA’s judgments as a gold standard. As a further simplification, we treat the small handful of unassessable documents (13 families) as not privileged in our analysis. One way of visualizing the effect of assessor errors is to use only some or all of the families that were selected for adjudication, plotting confidence intervals using TA judgments in one case and using assessor judgments in the other. The adjudicated sample is less than 8% of the size of the full set of official judgments, so this yields fairly large confidence intervals, but the comparison does offer useful insights. Figure 3.6 compares the (95%) confidence intervals on recall for each participating system using only the families that were selected for adjudication by the simple random sample; Figure 3.7 shows a similar comparison using all of the adjudicated families. From Figure 3.6 we can observe that judgments from assessors yield somewhat higher recall estimates than the judgments from TA, but Figure 3.7 shows the opposite effect. The difference results from some combination of sampling error, appeals that disproportion- ately benefit participating systems, or systematic biases in the families on which assessors disagree. As the size of the error bars illustrates, we cannot reject sampling error as an explanation. Nonetheless, there is some evidence to support the hypothesis that appeals disproportionately benefit participating systems. Table 3.5 shows how the overturn rate varies with the reason for adjudication and 41 Assessors Judgments as gold standard Topic Authority Judgments as gold standard ● ● ● ● ● ● ● ● ● ● a1 a2 a3 a4 h1 System/Run Figure 3.6: Recall, a4 ablated, random adjudication Assessors Judgments as gold standard Topic Authority Judgments as gold standard ● ● ● ● ● ● ● ● ● ● a1 a2 a3 a4 h1 System/Run Figure 3.7: Recall, a4 ablated, all adjudication with the original judgment. As the random sampling results show, assessors are more likely to mistakenly judge a family as privileged than as not privileged. Specifically, a z-ratio test for independent proportions finds the 1 → 0 overturn to be significantly more likely than a 0 → 1 overturn (p < 0.05). The same is not true for documents appealed by participating teams, however, where the overturn rates in each direction are statistically indistinguishable. Said another way, the increase in total overturn rate from 23% to 36% between randomly sampled adjudications and appealed adjudications (a 58% relative increase) can be largely explained by participating teams being no better than chance at recognizing an assessor’s false positive judgments, but by being much better than chance 42 Recall with Confidence Intervals Recall with Confidence Intervals 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 at recognizing an assessor’s false negative judgments. The implications of this for the reliability of the test collection are clear: estimating absolute measures, and particularly absolute estimates of recall, using assessor judgments that exhibit systematic errors results in estimates that are open to question. If uncorrected assessor judgments were a small fraction of the total judgments, this would be a relatively minor concern, but uncorrected judgments are being used for about 92% of the sampled families. On the positive side, the availability of adjudicated random samples offers the potential for modeling differential error rates conditioned on the first-tier assessor’s judg- ment. On the negative side, the inability to associate judgments with individual assessors in TREC 2010 means that such corrections can only be applied on an aggregate basis. We note, however, that relative comparisons between participating systems can still be informative, so long as assessor errors penalize all participating systems similarly. Analysis of Sampling Error To assess reusability, we need to assess the comparability of evaluation results for systems that did and did not contribute to the development of the test collection. A standard way of performing such analyses is through system ablation [89]: removing a system that in fact did participate in the stratification and then rescoring all systems, including the ablated system, and observing the effect on system comparisons. With pooling, ablation results in removing judgments for documents that were uniquely found by one system. With stratified sampling, by contrast, ablation results in re-stratification. For example, when system a4 (the participating system with the highest recall) is ablated, the 00000 stratum and the 00010 stratum become merged into a 000?0 stratum (where ? indicates a don’t-care condition), the 11001 stratum gets merged with the 11011 stratum to form a 110?1 stratum, and similarly for each other stratum pair that is differentiated only by the ablated system. If we then reapply the process for deciding on the number of families to sample from each merged stratum, we will see little effect on the sampling rate for most strata. The one important exception is the 000?0 stratum (continuing with our example of ablating system a4), where we are merging large strata with quite different sampling rates (very small strata can also see substantial changes in their sampling rate, but their effect on the overall estimate will be small). We therefore model the effect of 43 Table 3.5: Overturn rates Assessor → Topic Authority Adjudication Basis 0 → 1 1 → 0 Random sample 31 of 161 (19%) 20 of 62 (32%) Team appeal 32 of 77 (42%) 54 of 160 (34%) Disagreement 28 of 49 (57%) 9 of 27 (33%) After ablation Before Ablation ● ● ● ● ● ● ● ● ● ● a1 a2 a3 a4 h1 System/Run Figure 3.8: Precision, a4 ablated, all adjudication ablation by allocating all of the samples in each pair of strata to the corresponding merged stratum, adjusting the contributions of each sample to the estimate of the yield for the merged stratum to be equal. To generalize, let a refer to the stratum in the pair including families classified as privileged by the ablated run, b to the corresponding stratum containing families classified as not privileged by the ablated run, and c to the merged stratum. We assume that the merged stratum would include the same number of samples that the two original strata contained separately; that is nc = na + nb and the sampling rate for merged stratum c is pc = nc/Nc, where Nc = Na +Nb. We performed three ablation experiments, in each case ablating one system with high, medium or low recall and then recalculating point estimates and confidence intervals for every system. Comparing post-ablation to pre-ablation results, we see that point estimates are unchanged, as expected, but as Figure 3.8 shows confidence intervals for precision increase for the ablated system (system a4 in this figure). We attribute this to 44 Precision with Confidence Intervals 0.1 0.2 0.3 0.4 0.5 0.6 the reductions in the sampling rate for the 00010 stratum (from merging with the 00000 stratum, which results in documents in the former 00010 stratum being sampled at a far lower rate), since we expect families classified uniquely by any reasonable system as privileged to more often actually be privileged than families that no system classified as privileged. The same pattern is evident in our other two ablation experiments (ablating systems a2 or h1; not shown). No similar effect was observed for confidence intervals on recall, however, perhaps because the estimates for the retrieved set contribute to both the numerator and the denominator of the recall computation. 3.4.2 Expertise and Sample Bias in Classifier Results Here we analyze the influence of (1) annotator expertise; and (2) selection bias, on classifier training. Effect of Annotator Expertise We study the effect of annotator expertise on training by using the adjudicated families for training (families in set AS−TA and AS−A), and the unbiased held-out set for testing. Although the sample drawn for adjudication in the test collection represents less than 8% of the total size of the official judgments, due to which the results yield fairly wide confidence intervals, the comparison discussed here does offer useful insights. We compare the classifier performance using recall and precision values with 95% confidence intervals. Figure 3.9 shows the performance with (95%) confidence intervals on recall and precision for the three classifiers, each of which is trained on each of the three training cases. By comparing the performance of training the GM (GM-AS-TA and GM-AS-A) and CM (CM-AS-TA and CM-AS-A) classifiers on set AS, we observe that classifiers trained on neither expert nor non-expert annotations yield better results. However, by comparing the performance of the joint MM model, MM-AS-TA and MM- AS-A, we observe a significant increase in the recall of the automated classifier trained on families in AS with the expert’s (TA) annotations. We explain this by collectively analyzing the classifiers’ privilege predictions on the families in the test-set. Figure 3.10 shows the intersecting sets of all the classifiers’ predictions on the privileged families in the test-set. By analyzing a pair of intersecting 45 1   1   0.9   GM-­‐AS-­‐TA   0.9   CM-­‐AS-­‐TA   0.8   GM-­‐AS-­‐A   0.8   CM-­‐AS-­‐A   0.7   GM-­‐NAS   0.7   CM-­‐NAS   0.6   0.6   0.5   0.5   0.4   0.4   0.3   0.3   0.2   0.2   0.1   0.1   0   0   0   0.2   0.4   0.6   0.8   1   0   0.2   0.4   0.6   0.8   1   RECALL   1   RECALL   0.9   MM-­‐AS-­‐TA   0.8   MM-­‐AS-­‐A   0.7   MM-­‐NAS   0.6   0.5   0.4   0.3   0.2   0.1   0   0   0.2   0.4   0.6   0.8   1   RECALL   Figure 3.9: Effect of Annotator Expertise on Training sets; (1) CM-AS-TA and MM-AS-TA (total of (22+0-111) families), and the sets CM-AS- A and MM-AS-A (total of (15+4-0) families) (2) GM-AS-TA and MM-AS-TA (total of (7+7-1) families), and the sets GM-AS-A and MM-AS-A (total of (2+13-0) families), we conclude that the performance of MM-AS-TA model gains a significant increase in recall over MM-AS-A. Effect of Selection Bias Comparing the performance of model-AS-A and model-NAS for each of the three classifiers (MM , GM and CM) in the figure 3.9 shows that, automated classifiers trained on the unbiased annotations from cheaper non-expert sources (Families in NAS) derive the best results. An increase in recall is noticed for all the classifier trained on NAS (GM-NAS, CM-NAS, MM-NAS) when compared to their corresponding classifiers trained on AS-A (GM-AS-A, CM-AS-A, MM-AS-A). A possible explanation to our finding is the presence of bias in choosing training families. Since families in AS have a selection bias due 11Privileged family that is predicted as not-privileged by both CM-AS-TA and GM-AS-TA 46 PRECISION   PRECISION   PRECISION   CM-­‐AS-­‐TA   GM-­‐AS-­‐TA   CM-­‐AS-­‐A   GM-­‐AS-­‐A   0   0   7   4   0   13   47   46   22   7   15   2   1   0   MM-­‐AS-­‐TA   6   MM-­‐AS-­‐A   10   Figure 3.10: Analysis of Classifier Privilege Predictions to the presence of (1) assessor disagreed families and (2) team appealed families, we argue that training classifiers on families in AS could affect the results due to the presence of families which are hardest to annotate (which explains the assessor disagreement) or which could strategically benefit the team’s performance (which explains the team-appeals). Nonetheless, we have shown some evidence that support our findings that: (1) Training classifiers on families chosen at random (annotated by non-expert reviewers) yields the best result and (2) Expert’s annotations can also be useful in training automated privilege classifiers. 3.5 Summary In this chapter, we have explored set-based evaluation for privilege classification using stratified sampling, with strata defined by the overlapping classification results from different participating systems. We have studied collection reliability by examining the impact of unmodeled assessor errors on evaluation results, and collection reusability by showing that confidence intervals are affected when we reconstruct the test collection in a way that does not rely on the contributions of one participating system. We show that assessor errors do adversely affect absolute estimates of recall. To study effect of training data and classifier accuracy, we utilize the privilege judg- 47 ments from TREC Legal Track 2010. We conduct our analysis by training automated classifiers on privilege judgments from annotators with different levels of expertise. We studied the effect of selection bias in the annotated samples on training. Approximate confidence intervals from beta-binomial posteriors on stratum yields is employed for com- paring classifier results. We conclude that selection bias in training could hurt the classifier performance. Our results show that training privilege classifiers on randomly chosen, non- expert annotations generally yields the best results. As future work, we motivate to study the effect of annotator expertise on training not only for privilege classifiers but also for responsiveness with the aim to arrive at a cost-effective training methodology. 48 Chapter 4: Manual Review Manual review denotes a process in which every document that is marked for pro- duction is reviewed for relevance (responsiveness and/or privilege) by at least one human reviewer. Exhaustive manual review involves having a human reviewer examine every document in a collection and code each document as relevant or non-relevant, and per- haps apply additional labels such as “privileged” or not, “hot document” or not, and sometimes, specific issue tags. It is not uncommon to have human reviewers exhaustively annotate documents during privilege review phase. Manual review is often accompanied by some sort of quality control process in which a portion of the documents is re-reviewed and, where indicated, re-coded by a second, more authoritative reviewer or a senior attor- ney. When the coding decisions disagree, action may be taken to diagnose and mitigate the cause. However, the vast majority of documents in the collection are reviewed only once, and the original reviewer’s coding is the sole determinant of the disposition of the document. Automated review denotes a situation in which the decision to produce or not to produce some proportion of the documents is made algorithmically, without a linear man- ual review. The term “technology-assisted review” is often used instead. In this chapter we introduce what we call technology-assisted manual review to utilize automation during the manual review process.1 Lawyers have shown interest to adopt predictive coding technique for finding relevant evidence. As the stakes involved in inadvertent disclosure of privileged content are high, it is natural to doubt any fully computerized technique to accurately recognize content 1The work discussed in this chapter is published in CHIIR and ASIST conferences and was done in collaboration with Douglas W. Oard and Amittai Axelrod; An AID for Avoiding Inadvertent Disclosure: Supporting Interactive Review for Privilege in E-Discovery [73] and Finding the Privileged Few: Supporting Privilege Review for E-Discovery [72]. 49 that can properly be withheld. Hence, attorneys are reluctant to trust fully automated techniques for privilege review.2 This chapter describes the design of an interactive system to support privilege review in which the goals are to improve the speed and accuracy of privilege review. 4.1 Problem Design Our work in this chapter is focused on providing useful tips to human reviewers during the privilege review process in e-discovery. Several types of privilege might be asserted, but in this chapter we focus principally on attorney-client privilege.3 Our basic approach to supporting privilege review is to train features or model annotators4 to label specific components of a message with information that we expect might help a reviewer to make a correct decision. We use a total of five annotators to enrich three types of components: people (or, more specifically, the email addresses for senders and recipients of a message), terms (words found in the message or in attachments to the message), and the date (on which the message was sent). In each case, we compute a numerical score for which higher values indicate a greater likelihood of privilege [72]; for people we also annotate job responsibilities (when known) or organization type (when known, if the job responsibilities are not known). We study the usefulness of different types of features to human reviewers using a within-subjects user study in which six lawyers each reviewed two sets of documents (email messages, together with their attachments), one set using a baseline system with no annotations, and the second set using our AID system (named for our goal of Avoiding Inadvertent Disclosures) in which annotations were shown for people, terms, and dates. Quantitative measures of review accuracy (e.g., precision and recall) and of review speed are augmented with analysis of self-reported response to questionnaires and interviews. We seek to answer the three research questions (RQ4a, RQ4b and RQ4c): 2So long as the scale of the privilege review (i.e., the number of relevant documents) is not so great as to preclude manual review. 3The rationale behind attorney-client privilege is that justice will be best served when attorneys can communicate freely with their clients (e.g., on matters of fact, intent, or legal strategy), and open com- munication can be fostered by prospectively protecting such communication from disclosure. 4We use the word “annotator” here to refer to an automated system that generates the features. 50 • Do the accuracy of the user’s privilege review judgments improve when system- generated annotations are presented during privilege review? • Does the user’s review speed improve when system-generated annotations are pre- sented during privilege review? • Which system-generated annotations do users believe are most helpful? Our results indicate that recall can be enhanced by displaying annotations. Al- though the improvements in recall come at some cost in precision, given the nature of this application, that cost may be acceptable. Participants in the study principally attribute the beneficial effects to annotations of people (rather than of terms or of dates). These formative evaluation results have implications for annotator and interface design. 4.1.1 Privilege Features Privilege in legal context is a right given to the parties in a lawsuit to provide protection against the involuntary disclosure of information. Attorney-client privilege in particular exists to protect the information exchange between “privileged persons” for the purpose of obtaining legal advice. Privileged persons include [33]: (1) the client (an individual or an organization), (2) the client’s attorney, (3) communicating representatives of either the client or the attorney, and (4) other representatives of the attorney who may assist the attorney in providing legal advice to the client. However, privilege does not arise simply because privileged persons communicate; it can only be claimed when the content of the communication merits the claim. Our intuition is that, an email message sent or received by a person (e.g. Person3) has a higher probability of being involved in privileged communication if that person frequently communicates with other people (Person5, Person6, etc.) who themselves have a higher probability of being involved in a privileged communication. Figure 4.1 illustrates this idea. As shown in the figure, the node Person3 in the example email network has multiple privileged (P) email exchanges with the node Person5 which in-turn has privileged email exchange with Person6. The privilege propensity of node Person3 depends not only on the emails sent/received by Person3, but also on the email traffic of all the nodes 51 Person3   P Person5   Person1   P P NP NP P P NP Person2   Person Person6  4   Figure 4.1: Our depiction of Privileged Communication Network †P ⇒ Privileged, NP ⇒ Not-Privileged Person3 communicates with. Thus we define “propensity” as a measure of the degree to which we expect a person to engage in privileged communication. It is a number between 0 (low propensity) and 1 (high propensity). While there has also been some work on the design and evaluation of automated classifiers to actually perform the privilege review task [29, 35, 74], there is a widely held belief among attorneys that reliance on a fully automated classifier for privilege review would incur an undesirable level of uncharacterized risk. Thus automated classifiers are more often used for consistency checking on the results of a manual privilege review process than as the principal basis for that review. In this paper, we explore a second possible use of the technology. That is, use of automated annotations to (hopefully) improve the accuracy or the cost of a manual review process. 4.1.2 Document Collection For our study, we need a set of documents that we know to be relevant to some request that we might typically see in e-discovery. To train our annotators, we also need a set of similar documents that we know to be privileged. We thus need a test collection that contains some relevance and some privilege judgments. One such collection, which we used in this chapter, was produced during the TREC Legal Track in 2010. 52 In the 2010 TREC Legal Track’s “Interactive task”,5 one task (Topic 303) was to find “all documents or communications that describe, discuss, refer to, report on, or relate to activities, plans or efforts (whether past, present or future) aimed, intended or directed at lobbying public or other officials regarding any actual, pending, anticipated, possible or potential legislation, including but not limited to, activities aimed, intended or directed at influencing or affecting any actual, pending, anticipated, possible or potential rule, regulation, standard, policy, law or amendment thereto.” [29] The collection to be searched was version 2 of the EDRM Enron Email Collection, which includes both messages and attachments. The items to be retrieved were “document families,” where (following typical practice in e-discovery) a family was defined as an email message together with all of its attachments. Five teams contributed a total of six interactive runs for Topic 303, with each run being a binary assignment of all families as relevant or not relevant. A stratified sample of families was drawn from submitted runs, and 1,090 of those families were judged to be relevant [29]. We have drawn a random sample of 200 of those relevant families for use in our study. Our automated annotation pipeline failed on 12 of those 200 families which lacked a critical field (From, To, or Date), so we removed those 12 families from consideration and randomly split the remaining families into two disjoint sets of 94 families each, which we refer to as D1 and D2. We consistently use set D2 with our Baseline system and set D1 with the treatment6 system. In the 2010 TREC Legal Track’s Interactive task, a second task (called “Topic 304”) was to find “all documents or communications that are subject to a claim of attorney-client privilege, work-product, or any other applicable privilege or protection” [29]. Two teams submitted a total of five runs, with each run being a binary assignment of every family as Privileged or Not Privileged. A stratified sample of 6,736 families were marked as privileged or not privileged by experienced reviewers,7 and prior work has shown that these annotations can be used to train a privilege classifier with reasonable levels of accuracy [74]. A total of seven families from this random sample were, by chance, also present in either 5A task in which participants design both a system and an interactive process for using that system 6The treatment system uses an interface that have system-generated features highlighted during review. 713 of these 6,736 had actually been marked as Unjudged, but during our experiments those 13 were treated as Not Privileged. The effect of this is negligible. 53 Table 4.1: TREC 2010 privilege judgments (For training and review) Training D1 D2 Privileged 932 2 3 Not Privileged 5,799 1 1 D1 or D2, and we removed the five that had been judged as Privileged from the set that we used for training our numerical annotators.8 As Table 4.1 indicates, this resulted in a total of 932 families annotated as Privileged and 5,799 families annotated as Not Privileged that could be used for training our automated annotators. Most of the judgments are from junior annotators. We refer to this set as NAS as described in Chapter 3. The smaller set of training documents are from AS-TA. We first study which of the two training set of judgments give the best coverage. 4.2 The AID System Our web interface system which we name “AID” (which starts for Avoiding Inadvertent Disclosure) system is a research prototype that is designed to help explore the design space for providing automated assistance to users during privilege review. In this section, we first describe the design of the five types of automated annotators that we have built. We next explain the user interface and interaction design of our AID system. 4.2.1 Propensity Annotation We define propensity of a person to engage in privileged communication as a number between 0 (indicating low propensity) and 1 (indicating high propensity). We utilize the expert or non-expert labels to indicate whether the family (represented as an edge) connecting the persons in the multi-graph is privileged or not-privileged. Given the labels of the edges, the task is to assign a score to the nodes that depends on the edge labels. We start by computing a privilege weight value that is associated with each edge in the graph as a prior using the network information from labeled families in the train-set. We then use the idea behind the weighted PageRank technique to score the propensity for 8Because of presentation order neither of those Not Privileged documents was seen by any participant in the user study that we describe in this paper. 54 Algorithm 1 Missing Person Score Algorithm Input: Graph test PRdictionary (A ) score train uniqueNodes test 1: procedure GetMissingNodeScores test 2: rankDictionary sort(PRdictionary (A )) score train 3: uniqueNodeScoreDict NULL 4: for do test 5: if s in uniqueNodes then test 6: sum zero 7: if then r in rankDictionary.keys() 8: sum sum+ d[r] 9: else 10: sum sum + [(min(d.values()) + max(d.values()))÷ length(d)] 11: end if 12: end if 13: score sum÷ num(recipients) n 14: unqNodeScoreDict[n] score n 15: end for 16: return unqNodeScoreDict[n] 17: end procedure Figure 4.2: Missing Person Score Algorithm each person [84]. We define w[Edge(x, y)] as the edge weight between x and y given the label of each communication edge as: ∑ n(x, y)e w[Edge(x, y)] = p (4.1) (n(x, y)ep + n(x, y)∈ ee E np ) train where Etrain is a set of labeled edges used in training set; n(x, y)ep and n(x, y)enp is the number of edges labeled as privileged and not-privileged respectively, with x as sender and y as the recipient. The weight of Edge(x, y) indicates the privilege probability between the two people. To score the individual nodes, we use these weighted edges in the graph as an input to a power iteration algorithm to obtain the “propensity score” or PRscore for each person using: ∑ PRscore(v) PRscore(x) = (1− d) + d (4.2) Nv v∈Ex 55 where d = 0.859 is the dampening factor; Ex is the set of edges where x is the recipient; and Nv is the total number of edges where v is the sender. Given the PRscore of each person seen in the labeled training set families, the final step of our person scoring algorithm is to calculate the PRscore of each person seen in the test set. Only 32% of the senders or recipients of unannotated emails have a PRscore greater than zero when trained on labeled training set NAS and 30% when trained on the labeled training set AS-TA. The other 56% are not present even once in either training sets. To estimate propensity for people who are not present in the training set, we leverage each unknown persons’ egocentric communication network, ultimately increasing the number of people to whom we can assign a propensity score to 94% of senders and recipients in the test set when trained on documents from NAS (93% when trained on documents from AS-TA). Figure 4.3 shows an example family where none of the 6 persons are seen in the training set. However, our missing person algorithm scores 3 of the 6 (shown in bold font). To calculate the propensity score for each person in the test set, our algorithm follows two steps: Common Person Scoring: We obtain a set of common persons (persons seen in both train and test set) Commona. For each person i in the test set, if i ∈ Commona then we use the PRscore(i). Missing Person Scoring: For each person i in the test set, if i ∈/ Commona we take the approach described in Algorithm 4.2. For each person in the test graph who is not seen in the train graph, we exploit the the person’s network information. If the missing sender is connected to one or more recipients who are seen in the train graph, we assign the average of recipient’s node scores as the missing sender score. However, if the sending person is connected to only missing recipients, we assign the sender the average of all PRscore values in the train graph. We take this conservative approach to scoring missing persons as we do not want to mislead the annotator by providing a zero propensity score when we are actually simply unsure about the propensity of a person. 9We fix the value d to 0.85 to assign 15% privilege likelihood for persons with no prior labeled privileged communication. 56 From: Dawne Davis In#house)Lawyer) ! Chief)Li3ga3on))To:bzikes@enron.com, From: carol.clair@enron.com ! Counsel)charles.cheek@enron.com, To: !drunnels@andrews-kurth.com! susan.flynn@enron.com, mlawles@enron.com! Cc: !mark.taylor@enron.com! Counsel) Cc: staci.holtzman@enron.com! Subject: !Jason Peters! ! ! Legal)Employee)! David! Subject: Grynberg March 17 hearing! Just wanted to let you know that Jason is doing In the past couple of days, I have been deluged a fantastic job helping us out! He is eager and with last-minute requests to videoconference willing to do whatever we give him and a into the March 17 hearing... [Text removed]! quick learner. You made an excellent choice Dawne Davis! and we appreciate it. Thanks.! Carol ! Legal Assistant Holland & Hart! Figure 4.3: Privileged Email 4.2.2 Person Role Annotation Propensity annotation is intended to help call a user’s attention to a specific person, but actually knowing how to interpret the importance of that person requires additional information. Professional reviewers would typically have information about the roles of specific people (e.g., they might know who the attorneys and the senior executives are), and in complex cases such lists could be quite extensive. The speed, and perhaps the accuracy, of the review process might be enhanced if we could embed that information in the review system. For this purpose, we need a role annotator that can associate each email address with some (generic or specific) version of their job tile. For our experiments we therefore built a simple role annotator using table lookup. We manually populated this table for 160 of the 1,611 unique email addresses that appear in at least one of the 188 families in either of our two test-sets. We obtained these roles from the MySQL database released by Shetty and Abidi [66], from ground truth produced for evaluating the Author-Recipient- Topic model of McCallum et al [50], from other lists found on the Web,10 from manual examination of automatically inserted signature blocks in email messages throughout the collection, from public profiles such as LinkedIn, and through manual Web searches. The 10http://cis.jhu.edu/∼parky/Enron/employees, http://www.desdemonadespair.net/2010/09/bushenron- chronology.html 57 roles were manually edited for consistency and conciseness. 4.2.3 Organization Type Annotation When the role of a specific person is not known, reviewers might benefit from know- ing the type of the organization for which that person works. We therefore used the same lookup table to annotate the organization in such cases. We did this by manually examin- ing the domain name of an email address and then using a current domain name registry, a Web search, or our personal knowledge to label the organization’s type, when possible. For example, some messages in the Enron collection are from addresses with the domain ‘brobeck.com’, and Wikipedia indicates that (at the time) Brobeck, Phleger & Harrison was a law firm. 4.2.4 Content Analysis Term unigrams have been reported to be a useful feature set for privilege classifica- tion [71], so it is natural to also consider annotating terms. The families in our collection contain many more terms than email addresses. Hence some approach to feature selection is needed if we are to avoid the display clutter that would result from annotating every term. We perform this feature selection by obtaining the entropy difference for each term. The entropy difference score identifies words that are like words in the Privileged set and also unlike words in the Not Privileged set [53]. To do this, we first tokenize the email message subject field, email message body and extracted text from each attachment for each family in the training set and in the test-set. We then build two unigram language models on these terms (i.e., the unstemmed tokens), one for the 932 families in the training set that were labeled as Privileged, and the other for the 5,799 families in the training set that were labeled as Not Privileged. We then rank each term w present in either of the test-set families using the entropy difference: score(w) = Hp(w)−Hnp(w) 58 agreement termination credit provisionmastercounterparty ENA changes tax draft issue payment commentsterms formseller legalcourt without marklitigation provided collateral agreements tradingupon language respect event agree transactions delivery amounts use otherwise claim swap gtc obligationsthirdcouldclaims kaynotice guaranty date damages saratransaction nonlaw price memo documents letter whether pursuant paragraph defaulting amount provisions counselsupport subject parties contractbuyer ISDA casearbitration entities section Figure 4.4: Indicative terms where Hp(w) and Hnp(w) respectively represent the entropy of the token w in the Priv- ileged and the Not Privileged language models [11]. Negative Entropy difference scores indicate terms that are indicative of privilege. Figure 4.4 shows the Indicative terms where larger the font size; higher the negative Entropy Difference. We used the top 10% unique terms with a high negative entropy difference value. Out the the top 350 terms, we annotate 117 terms with the highest negative entropy difference as strongly indicative of privilege, the middle set of 117 terms as moderately indicative of privilege, and the remaining 116 terms as somewhat indicative of privilege. 4.2.5 Temporal Likelihood Email communications that focus on the lawsuits often occur during specific time intervals, so it seems reasonable to expect that privileged communication regarding those events might exhibit some predictable temporal variation. We therefore also built an annotator for dates that estimates the likelihood of privileged communication on (or near) 59 that date. To do that, we parse the date field of the email that heads each family in the training set. We then use maximum likelihood estimation with Laplace smoothing to estimate the probability that a family sampled from the set of training families sent on a specific date would be privileged. We calculate that probability estimate as: npd + 1 P (d |nx) = ii d (4.3)npd + n np d + 2i i where di is the date of the message, npd and n np d are the total number of Privileged and Not-i i Privileged families sent on di respectively. Because TREC performed stratified sampling, designed to oversample potentially privileged families, we expect this to be a substantial overestimate of the actual probability. Nonetheless, we would expect relative values of the estimate to be informative. 4.2.6 User Interface Our research prototype is designed to help explore the design space for providing automated assistance to users during privilege review. We use the design of the five types of automated annotators that we have built. We then explain the interface and interaction design of our review system. We characterize the coverage of each of our automated annotators as the fraction of the unique items (people, terms or dates) in the 61 families for which annotations are available. Figure 4.5 shows a screenshot for our AID system. Documents are presented to every user in the same order, and the user must record a judgment (Privileged, Not Privileged, or No Decision) before being shown the next document. They could return to any previously judged document to change their judgment if they chose to do so. Annotations are provided as visual scaffolds during the privilege review process. Whenever a person role or organization type annotation is available, the associated email address is displayed with a red background, and the role or type annotation can be displayed in a manner similar to a “tool tip” (using a graphical control element that is activated when the user hovers the mouse over the shaded area). We shade the background with variations of the color red to indicate the propensity category (darker red for strong propensity, 60 Figure 4.5: The AID system. Figure 4.6: The Baseline system. 61 Review Review Training Training On AID D1 set with On Baseline D2 set with system AID system system Baseline QUIS QUIS Semi-Guidelines or or or system structured On Baseline D2 set with Questionnaire On AID or Questionnaire Interview system Baseline system D1 set with User system AID system Figure 4.7: User study procedure. lighter red for moderate propensity, very light red for all other cases in which role or type information is available).11 On average (across the 61 viewed families), 58% of the email addresses appearing as senders or recipients had a role or a type annotation available (55% for person role, 3% for organization type). About two-thirds of those cases in which role or type annotation were available, were displayed with shading indicating strong or moderate propensity. The display of terms that are indicative of privilege in the subject line, email message body, or attachments follows a similar pattern, but by altering the color of the typeface rather than the background. For example, the term “credit” is rendered in the darkest shade of red12 in Figure 4.5, thus indicating it was strongly indicative of privilege. On average (across the 61 viewed families), 2% of all term occurrences are highlighted. Temporal likelihood is plotted as a connected line plot, with date as the horizontal axis and temporal likelihood as the vertical axis. This has the effect of visually performing linear interpolation of temporal likelihood for dates on which that likelihood can not be computed directly. The displayed date range can be reduced (by a click and drag zoom-in functionality) by the user for finer-grained display. Figure 4.6 shows the user interface of our Baseline system. As can be seen, the only differences from the AID system are that none of the annotations are present, and the omission of the temporal likelihood plot permits more of the content to be displayed. Both the systems log the time, family ID, user ID and judgment (Privileged, Not Privileged, or No Decision) for each reviewed family. The principal goal of our user study was to determine whether any of our system- 11Low propensity addresses for which no role or organization type information is available have no background shading. 12We chose to use the same color gradations for terms and email addresses to simplify training, but the question of optimal color choices merits further investigation. 62 generated annotators could help the users to perform the review task more quickly, more accurately, or both. A secondary goal was to determine whether there were usability issues with our current interface design that might adversely affect our ability to determine the effects of specific annotators. A third goal was to use our current AID system design as an artifact around which we could discuss specific as-yet unimplemented capabilities that experts might believe would provide useful support for the task. 4.2.7 Study Participants and Procedure We were able to recruit a total of six participants from the first two groups, which we judged to be adequate for the comparisons we wished to make, so we limited our study to those six participants. Two of the six were senior attorneys employed by law firms with a current e-discovery practice. These senior attorneys are experienced litigators who have extensive experience conducting both relevance and privilege review for email using commercial Technology Assisted Review (TAR) tools.13 We refer to these senior attorneys as S1 and S2. The remaining four participants were law school graduates. Two of the four had prior experience conducting relevance and privilege review using commercial TAR tools, but neither was currently working in an e-discovery practice; one of the two is a graduate student in another discipline, the other is an intellectual property attorney. We refer to this pair of experienced reviewers as E1 and E2. By coincidence, E2 had experience working as a reviewer during the original Enron litigation. The remaining two participants had experience conducting e-discovery reviews some time ago, principally on paper, but neither had experience using current TAR tools. One was a retired attorney, the other was currently a faculty member in another discipline. We refer to these (TAR) inexperienced reviewers as I1 and I2. I2 had little direct experience using computers. Figure 4.7 summarizes the study procedure for one of the six single-participant sessions.14 Each participant completed the study in about two hours, with a 10 minute 13Tools like Recommind, Nuix, kCura, etc. 14Refer to Appendix A for details about the IRB approval. 63 S1-Baseline 17/53 36/53 S1- AID 9/56 46/56 1/56 S2-Baseline 12/71 47/71 12/71 S2- AID 15/61 36/61 10/61 0   10   20   30   40   50   60   70   Privileged Not-Privileged No-Decision Figure 4.8: S1 and S2 Judgments by type Table 4.2: Contingency table; for review of same families by S1 & S2) S1: Privileged S1Not Privileged S1: No Decision S1: Not Seen S2: Privileged 15 7 0 5 S2: Not Privileged 5 62 0 16 S2: No Decision 6 12 1 3 S2: Not Seen 0 1 0 75 †There was one family that was skipped in sequence by chance by S2; but was not skipped by S1. break at the end of the first hour. Participants were given an overview of the review task and were asked to read a written description of the study that we provided before signing a consent form. Each participant then received a 5 minute tutorial on the first system they would use, presented by the investigator, in which the different parts of the system were demonstrated using a few example families. 4.3 Results In this section we first focus on quantitative results for accuracy and speed. Fol- lowing that we contextualize these results from qualitative results from our interview and from our usability questionnaire. We then draw insights from each of these analyses to discuss what we see as the most important conclusions that can be drawn from this study. 64 4.3.1 Selecting a Benchmark for Evaluation If we are to make any useful statements about the accuracy of a privilege review, we must first select an informative set of judgments as benchmark against which accuracy can be measured. This benchmark judgments need not be perfect for the resulting measures to be informative, but we will have the greatest confidence in our results if we select the best available benchmark judgments. Thus it is natural to begin by characterizing the results from the two senior attorneys, since we would expect their judgments to be natural candidates as a benchmark. Figure 4.8 shows the number of judgments of each type made by S1 and S2 for each of the two conditions. As can be seen, S2 is somewhat faster than S1 (making 33% more judgments in the same 30 minutes in the Baseline condition, and 9% more in the AID condition). S2 records many more No Decision judgments (22 for S2 vs. 1 for S 151). As Table 4.2 shows, senior attorney S1 marked a total 15+5+6=26 families as Privileged while S2 marked a total of 15+7+5=27 families as Privileged. Among the families seen by both senior attorneys (using either system), 15 families were marked as Privileged by both. Computing chance corrected inter-annotator agreement between S1 and S2 using Cohen’s Kappa (κ) yields 0.68, a value that Landis and Koch [43] characterize as “substantial.” Indeed, given the class prevalence in our test sets, chance agreement would be 0.57, making very high levels of κ difficult to achieve [10]. TREC 2010 Interactive Task Topic 304 privilege judgments are available for seven of the families in our test set. Of those seven, 5 were Privileged and 2 were Not Privi- leged. Of the 5, three families were adjudicated by the Topic Authority (a senior attorney whose judgments were authoritative) who was responsible for providing guidance and ad- judicating disputes. Out of the three Privileged families adjudicated by the TREC Topic Authority, two were reviewed by both S1 and S2. S1 agreed with the Topic Authority on one of the two families by marking one of the two families as Privileged while the other as Not-Privileged. S2 never agreed with the Topic Authority. S2 marked one of the two families as Not Privileged (the same family marked as Not Privileged by S1) and the other 15Participants mark a family as No Decision when a clear distinction between Privileged and Not Privileged could not be made on the email message or any of its attachments. 65 was marked as No Decision. Comparisons on two judgments is not sufficient to determine whether the two senior attorneys in our user study are (1) generally more inclined to judge documents as Not Privileged than the TREC Topic Authority would have been (2) generally inclined to agree with each other, but we can say that there is no evidence to refute such a claim. From this analysis, either senior attorney could reasonably be chosen as a benchmark against which the other participant’s judgments could be measured for accuracy. However, because S2 left 19 families unjudged and skipped reviewing 1 family throughout the review sequence and all 24 of the families that were not seen by S1 were late in the review sequence, a larger number of useful judgments are available from S1. We therefore use judgments from S1 as a benchmark for evaluation. We evaluate participants on the basis of precision and recall estimates that we report in Figure 4.9. 4.3.2 Accuracy Figure 4.9 shows the privilege review effectiveness of S2, E1, E2, and I1 for the Base- line and AID conditions, evaluated as if the judgments by S1 were the ground truth. We calculate point estimates for precision and recall using only the cases judged as Privileged or Not Privileged by both S1 and by the participant whose decisions are being evaluated (i.e., we omit No Decision and Not Seen cases from both). Because we are comparing estimates for different sets of documents, we also show the 95% confidence intervals for recall and for precision, computed using the standard approximation method described by Agresti et al. [9]. Results for I2 are not shown because after removal of the 21 No Decision judgments recorded by I2 there were only 7 families judged by I2 (3 in the AID condition, 4 in the Baseline condition), a number insufficient for useful estimation of intervals.16 From Figure 4.9 we can conclude that there is a consistent and statistically signifi- cant improvement in recall when the review task is performed using our AID system for all four participants (S2, E1, E2, I1).17 This improvement is, however, accompanied by 16All 7 were judged as Privileged, suggesting that participant I2 may have intended to record judgments of Not Privileged and instead incorrectly selected No Decision. It was participant I2 who had only limited personal experience using computers. 17We consider a difference to be statistically significant if each point estimate lies outside the 95% confidence interval for the other condition. 66 1   0.9   0.8   0.7   0.6   0.5   0.4   0.3   0.2   0.1   0   0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9   1   RECALL   E1-­‐Baseline   E2-­‐Baseline   I1-­‐Baseline   S2-­‐Baseline   E1-­‐AID   E2-­‐AID   I1-­‐AID   S2-­‐AID   Figure 4.9: Evaluation - S1 judgments as Benchmark. a statistically significant reduction in precision for three of the four participants. Instead using S2 as a reference to evaluate S1, E1, E2, and I1 (not shown) yields similar results, with statistically significant improvements in recall in 1 of 4 cases and statistically signif- icant decreases in precision in 2 of 4 cases. Since the principal goal of our AID system is to avoid inadvertent disclosures, this consistent bias in favor of recall (i.e., in avoiding false negatives), regardless of which senior attorney we select as a reference, is well in line with that goal. 4.3.3 Speed To characterize the effect of the choice of system on review speed, we computed the number of families reviewed by each participant in 30 minutes using the Baseline and the AID systems, observing little difference in the means (averaging 40.1 families for the AID condition and 43.6 families for the Baseline condition).18 A paired t-test found no 18Data from participant I2 is omitted from this analysis. 67 PRECISION   Table 4.3: QUIS Summary S1 S1 S2 S2 E1 E1 E2 E2 I1 I1 I2 I2 BL AID BL AID BL AID BL AID BL AID BL AID Review experience NF Good Bad Good Good Great Good Good Bad Good Fair Good Info was adequate D A SD MD A SA A SA SD MD SD A People info was useful SA A A SA SA NF Term info was useful SA NF MA NF NF SD Date graph was useful NF D D MA D A Logical use of colors A A SA NF A A †SA=Strongly Agree, A=Agree, MA=Moderately Agree, SD=Strongly Disagree, D=Disagree, MD=Moderately Disagree; NF=Neutral Feedback, blank indicates not applicable; BL=Baseline. detectable difference in average review speed across the two conditions (p > 0.38). From these results we conclude that there is no indication that our AID system results in faster review, and indeed it is possible that our AID system might result in marginally slower review. 4.3.4 Usability Table 4.3 summarizes participant responses to six of the seven QUIS questions (a seventh question, about layout, evoked no useful differences in the responses). Five of the six participants assigned a higher rating to the overall experience with the AID system than with the Baseline system (and the sixth participant noted no difference). All six participants gave more positive scores to the AID system than to the Baseline system in response to the question about adequacy of the displayed information. Person highlighting was reported to be useful (to at least some degree) by five of the six participants, whereas term highlighting and the date graph were each reported to be useful to some degree by only two of the six participants. 4.3.5 Usefulness During the semi-structured interview session, we asked each participant which type of system-generated annotation they found to be most useful; five of the six named person annotation. The following excerpts are representative of responses that participants gave to our open-ended questions. “I think having the role or type information in-line on the user interface was very helpful. All I had to do was to hover over the name instead of looking it up on a sheet of paper as we normally do.” — S1 68 “I would honestly like the people highlighting concept much more if it would give me more information about the meta-data. Having information about the domain addresses of people who are not Enron employees is one such information.” — S2 “The presence of highlighted people made me look into the documents more carefully in non obvious cases for the presence of potentially privilege content. It help me to make a filtering decision about which document need more attention. The highlighting helped me to be quicker.” — E1 “I think the trickiest part was to review the document when the information about a sub-set of the people was missing. For example, if there were 6 people and we have information about 3 of them but not the other 3, it is hard to predict who the other players are.” — E2 “I think the highlighting of the people was useful to do the review; the highlighting of the terms were less useful because almost all emails contain the same boilerplate language and the term highlights did not provide much information; and about the dates, I did not feel the need to use the date information displayed on the graph.” — I1 “The ideas presented in the AID system are good, however the information provided was sometimes confusing to me. The role and type information provided was useful but the term highlighting was distracting; mainly because the highlighted terms did not make sense to determine privilege and I lost my faith on the terms.” — I2 4.4 Summary Our quantitative results clearly indicate that our AID system resulted in a greater ability to detect privileged documents. The QUIS and our semi-structured interviews provide consistent support to our belief that our annotation of people (or, more specifically, of email addresses) is principally responsible for this improvement. Of the three ways we annotate people (for propensity, for person role, or for organization type) we have the strongest evidence for a claim that role and organization type annotation was believed by our participants to be useful; we do not have sufficient evidence to separately identify the effect of propensity annotation. Neither our present implementations of term highlighting 69 nor the date graph were often commented on favorably by the participants. From these observations, we conclude that our current AID system achieves its principal objective of helping to avoid inadvertant disclosure, that further study is needed to separately analyze the value of propensity annotation, that the value of term annotation has not yet been shown, and that further refinement of our approach to date annotation will not be among our highest near-term priorities. We base this last conclusion in part on the following comment by S1, who said “Date information could be helpful during responsiveness review. But for privilege review, it is less likely to be useful”. We were somewhat surprised by the magnitude and consistency of the drop in pre- cision that accompanied the increases in recall that we observed from the use of our AID system. In privilege review, low precision could result in incorrectly withholding some families that should properly have been turned over to the requesting party. Perhaps such cases might be discovered and corrected in a second stage of privilege review, but a two-stage review process would naturally lead to higher costs. Future work aimed at understanding the reason for the reduction in precision will thus be a high priority. More- over, trade-offs between recall and precision are natural, so it may be that similar results might be obtained in other ways (e.g., by providing financial incentives based on the number of privileged documents found). In future work it will therefore be important to develop task-tuned utility measures that account for the relative importance of recall and precision for the privilege review task and to develop study designs in which recall at comparable levels of precision can be studied. Our participants made some suggestions for improvements that might be made to our AID system. One useful suggestion was to consider highlighting multi-word expres- sions that are indicative of privilege, rather than only single-terms as our present system does. Another useful suggestion was to consider augmenting our role annotations with an opportunity to drill down to learn more (e.g., date assigned to that role, previous roles, or supervisory relationships). In future work we are interested in exploring the potential for viewing privilege review as a structured collaboration task, and when we asked about this several of our participants (three of the five who we asked) indicated that system support for collaboration might be of interest for privilege review. 70 Chapter 5: Predictive Coding With Manual Review Adoption of predictive coding technique to categorize each document in a collection as privileged or not, and to prioritize the documents based on expected risk before the manual review is the key approach we discuss in this chapter. The party performing the review before production may incur costs of two types, namely, annotation costs (deriving from the fact that human reviewers need to be paid for their work) and misclassifica- tion costs (deriving from the fact that failing to correctly determine the responsiveness or privilege of a document may adversely affect the interests of the parties in various ways). Relying exclusively on results from the predictive coding model would minimize manual annotation costs but could result in substantial misclassification costs, while rely- ing exclusively on manual review could generate the opposite consequences. The principal focus of the work presented in this chapter is therefore on developing a semi-automated process. The goal of the semi-automated system is to develop an efficient way of auto- matically ranking documents based on classifier decisions1 and partially reviewing those ranked documents manually to minimize the overall cost of the e-discovery process. Our approach is based on a realistic intuition that automation is imperfect. Thus attorneys will often perform partial or complete manual review depending on the classifier’s results. If the manual review of a sample of the classifier’s output reveals an unacceptably high error rate, then additional manual review would be needed. Additional training data might yield improved classifier accuracy, but ultimately some limit will be reached beyond which an alternative strategy is needed. If the best error rate that the automatic classifier can achieve remains worse than what human reviewers can achieve, then additional manual 1The work discussed in this chapter is currently under review and was done in collaboration with Douglas W. Oard and Fabrizio Sebastiani; Minimizing the Expected Costs of Review for Responsiveness and Privilege in E-Discovery [54]. 71 review can further decrease the overall error rate. This approach works because in e- discovery we are ultimately classifying some finite population of documents and it is thus the accuracy of the classification decisions, and not of the classifier itself, that we care about. The main contributions of this piece of our work are (1) to be able to quantify a classifier error to a cost value, (2) to derive a cost function as the basis of our evaluation and (3) to determine when and to what extent is it rational to adopt automation in this human-in-the-loop application domain. This chapter answers research questions RQ5a and RQ5b introduced in Chapter 1, section 1.2. 5.1 Problem Design We model our algorithm based on an assumption that all relevant costs can be quan- tified. These costs are of two types, namely, annotation costs; resulting from the wages paid to human reviewers for their time and work, and misclassification costs. Misclassi- fication costs result from the fact that failing to correctly determine the responsiveness or privilege of a document results in incorrect decisions, which would have consequences that we model as costs. The notion of risk arises naturally in a cost-sensitive classification context, due to the existence of multiple outcomes in probability theory. Depending on the outcome, each outcome has its own cost (e.g., incurring a sanction for having entered on the privilege log a document that should instead have been produced). Minimizing this risk requires avoiding certain outcomes for which a combination of probability of oc- currence and cost is high. Here, the notion of “risk” R(d) is the converse of the notion of utility; one usually speaks of “risk” when each of the possible events has an associ- ated cost (i.e., the amount at risk due to an undesired consequence), whereas one usually speaks of “utility” when each possible event has an associated gain (i.e., a desired conse- quence). Anyway, the two notions are interchangeable; we prefer speaking of “risk” here since the entire process involves costs, and not gains, for the producing party, and it is the expectation over these costs that we want to minimize. Thus, we formalize the problem of the e-discovery process as a risk minimization 72 framework (called MINECORE, for “MINimizing the Expected COsts of REview”) that seeks to strike an optimal balance between the annotation and the misclassification costs. MINECORE is defined as a semi-automated system whose goal is to identify doc- uments that need to be produced (responsive and nonprivileged documents) to an e- discovery request; documents that are responsive and privileged should be put on a priv- ilege log; nonresponsive documents should be withheld. In other words we model the problem as a classifier generating h : D → C such that C = {cP , cL, cW } three target classes, where • cP is the class of the responsive nonprivileged documents, that should be Produced to the requesting party; • cL is the class of the responsive privileged documents, that should be entered on the privilege Log; • cW is the class of the nonresponsive documents, that should be Withheld by the producing party. Since different classification errors bring about different costs, our problem defined above is quite sensitive to the value of the misclassification costs. For instance, producing a document that should have been on the privilege log typically brings about a higher cost than producing a document that should instead have been withheld. Hence we assume the existence of a cost matrix Λm = {λmij } (for i, j ∈ {P,L,W}) as an input to our algorithm. The structure of the cost matrix is illustrated in Table 5.1(b) above, where each entry λmij , in unit cost is a nonnegative value representing the cost incurred when misclassifying an element of cj into ci (the m superscript stands for “misclassification”). In the next few sections, we explain the six baseline methods in detail and compare their performance against our MINECORE algorithm. 5.2 Fully Automated baseline model In the fully automated baseline model, we train two automated classifiers hr (binary classifier for responsiveness) and hp (binary classifier for privilege), and we apply them to 73 Table 5.1: Contingency table D (a) and cost matrix Λm (b) for our problem. actual actual cP cL cW cP cL cW cP DPP D m m PL DPW cP 0 λPL λPW cL DLP D m m LL DLW cL λLP 0 λLW cW DWP DWL D c m m WW W λWP λWL 0 (a)Contingency table D (b)Cost Matrix Λm the collection D. The classifiers are generated independently of each other. In this chapter we make the simplifying assumption that training and running automated classifiers has zero cost. For each document d ∈ D, hr and hp generate two posterior probabilities Pr(cr|d) and Pr(cp|d), which represent the classifiers’ confidence in the fact that d is responsive and that d is privileged respectively. For Pr(cr|d) a value of 1 represents total certainty that d ∈ cr, a value of 0.5 represents total uncertainty, and a value of 0 represents total certainty that d ∈ cr; the same for Pr(cp|d). From Pr(cr|d) and Pr(cp|d), posterior probabilities Pr(cP |d), Pr(cL|d), Pr(cW |d) are obtained as Pr(cP |d) ≡ Pr(cr|d)Pr(cp|d) (5.1) Pr(cL|d) ≡ Pr(cr|d)Pr(cp|d) (5.2) Pr(cW |d) ≡ Pr(cr|d) (5.3) We next classify each document d in the class with the lowest expected cost using equation 5.4. h(d) = argminR(d, ci) (5.4) ci 74 pred pred where R(d, ci) (the risk associated with assigning d to class ci) is defined as ∑ R(d, c mi) = λij Pr(cj |d) (5.5) j∈{P,L,W} As a result, the risk brought about by this classification is ∑ R(D) = R(d, h(d)) (5.6) d∈D In other words, each document d is assigned a class (cP or cL or cW ) that brings about the minimum expected misclassification cost. Here the expected misclassification cost is computed as the sum of the misclassification costs of all possible events (i.e., classes to which d might truly belong), each multiplied by the probability of occurrence of the event (which is estimated by the classifier). For measuring misclassification cost, we use the equation below; ∑ Cm(D) = λmijDij (5.7) i,j∈{P,L,W} where the m superscript stands for “miscla∑ssification”. Note that Cm(D) is linear, i.e., it can alternatively be written as Cm(D) = md∈D C (d), where Cm(d) = λmh(d)y(d) is the cost of predicting a document to be in class h(d) when its true class is y(d). 5.3 Fully Manual baseline model In the Fully Manual baseline model a reviewer (typically: a junior lawyer) annotates all documents in D for responsiveness. All the documents in D that the reviewer deems responsive are forwarded to another reviewer (usually a senior lawyer) who annotates them for privilege, while all the others are withheld. All the documents that this latter reviewer deems nonprivileged are produced to the requesting party, while all the documents that the senior lawyer deems privileged are entered on the privilege log. The two reviewers usually work sequentially, rather than in parallel. This is justified by cost issues. It is a waste of resources to annotate by privilege a document that has already been ruled out 75 on counts of responsiveness, and that the reviewers who deal with responsiveness usually work at cheaper hourly rates than the reviewers who deal with privilege. This suggests to have a first pass carried out by the former before the latter intervene. We also assume, for ease of explanation, that there is only one reviewer for responsiveness and only one reviewer for privilege. In real applications there are often several reviewers of each type; however, what we describe straightforwardly applies to the case of more than one reviewer of each type. In this chapter we make the simplifying assumption that our reviewers are perfectly reliable (i.e., they do not make annotation errors); we defer the study of a model which relaxes this assumption to future work. Let the pair Λa = (λar , λap) denote the costs of annotating a single document for responsiveness (λar) and for privilege (λap), where the a superscript stands for “annotation”. As a function for measuring annotation cost (which derives from the intervention of human reviewers) we use the equation below; Ca(D) = λarτ + λar pτp (5.8) where τr and τp are the numbers of documents manually annotated for responsiveness and for privilege, respectively. Note that for the fully manual solution, τr is the number of documents in D, and τp is the number of responsive documents in D. Similarly to the cost matrix Λm, we assume the unit costs in Λa to be input parameters, since they are not under the control of the experimenter. 5.4 Our MINECORE model Both the baselines in section 5.2 and section 5.3 have drawbacks. The fully auto- matic model has the advantage of zero annotation cost, but bears the drawback of having non-negligible classifier error rate. As a consequence, this model is susceptible to the case of withholding documents that should have been produced, and (more dangerously) producing documents that should have been withheld. The costs generated by too many such misclassifications might be severe. On the other end, the fully manual model has 76 Posterior probabilities Updated posterior Final posterior probabilities probabilities Training documents Phase 1 Phase 2 Phase 3 Finalclassification decisions Unlabelled documents Predic've  Coding     Relevance  Review   Privilege  Review   System   Provisional Updated provisional classification decisions classification decisions Figure 5.1: MINECORE Framework Overview the advantage of perfect accuracy (assuming manual review is perfect) but is expensive, since the costs involved in manual annotation are high, and is sometimes infeasible, since it might be impossible to manually annotate each document given the time constraints imposed by the lawsuit. Thus, we propose our MINECORE model where we try to strike a balance between the two. Figure 5.1 shows the overall architecture of our semi-automated MINECORE model. The execution of this model can be described in three phases; 1. All the documents in D are first assigned a class in {cP , cL, cW } by an automatic classifier that classifies according to Equation 5.4, following which 2. Junior annotators annotate a subset D′ of the documents in D for responsiveness which may cause some of the documents in D′ to be reassigned a class in {cP , cL, cW } different from the one assigned in Phase 1. 3. In the final phase, senior annotators annotate a subset D′′ of the documents in D for privilege, which may cause some of the documents in D′′ to be reassigned a class in {cP , cL, cW } different from the one assigned in Phase 1 and 2. 77 Of course, the right question here is how to strike an optimal balance, i.e., how to decide which documents should be annotated for responsiveness in Phase 2, and for privilege in Phase 3, and which others should instead be left unchecked. Our solution to arrive at such a balance makes use of • the posterior probabilities Pr(cr|d) and Pr(cp|d) generated by the automated classi- fiers hr and hp; • a cost matrix Λm and a pair Λa of unit annotation costs. From now on, by the term cost structure we indicate a pair Λ = (Λm,Λa), with Λm a cost matrix and Λa a pair (λar , λap) of unit annotation costs. The only constraints we impose on Λ are that (i) all unit misclassification costs in Λm and both unit annotation costs in Λa must be nonnegative; (ii) all λm mii ∈ Λ must be 0; and (iii) it must hold that λar ≤ λap. Thus, the overall cost of the process can be quantified as Co(D) = Cm(D) + Ca(D) (5.9) where the o superscript stands for “overall”, and where Cm(D) and Ca(D) are the costs defined in Equations 5.7 and 5.8. Co(D) is the evaluation function we adopt in this work for all systems we experimentally compare, and not just for MINECORE. Note that for the fully automated solution Co(D) coincides with Cm(D), since for this solution we have assumed the annotation cost to be zero, and for the fully manual solution Co(D) coincides with Ca(D), since for this solution we have assumed the misclassification cost to be zero. 5.4.1 Document Ranking MINECORE consists of an automatic classification phase (Phase 1), followed by two human annotation phases (Phase 2 and Phase 3) in which only the documents whose manual annotation is expected to reduce the overall cost are annotated. For each phase ϕ and for each document d, two posterior probabilities Prϕ(cr|d) and Prϕ(cp|d) are generated. Based on these probabilities, a class hϕ(d) is assigned in Phase ϕ to each document d as 78 Phase 1 Training Learner documents Classification Posterior probabilities models Unlabelled Classifiers Risk minimizer documents Provisional classification decisions Figure 5.2: Phase 1 of the MINECORE Framework Phase 2 Ranker Updated posterior probabilities Posterior probabilities Risk minimizer Junior assessor Updated provisional classification decisions Ranking of unlabelled documents Figure 5.3: Phase 2 of the MINECORE Framework 79 ∑ hϕ(d) = argminRϕ(d, ci) = argmin λmij Prϕ(cj |d) (5.10)ci ci j∈{P,L,W} where ci ranges on {cP , cL, cW }. Equation 5.10 is just Equation 5.4 where the phase ϕ in which the probabilities are computed and the class is assigned is made explicit. In Phase 1 of MINECORE, shown in figure 5.2, we train two automated classifiers, hr (which classifies for responsiveness) and hp (which classifies for privilege), from training data that we assume available, and we apply them to D. As in the fully automated solution described in Section 5.2, the two classifiers gen- erate two posterior probabilities Pr1(cr|d) and Pr1(cp|d) for each document d ∈ D. The two posterior probabilities represent the classifiers’ confidence that d is responsive and that d is privileged, respectively. Using these posterior probabilities, we assign a class h1(d) ∈ {cP , cL, cW } to each document d ∈ D using Equation 5.10. In Phase 2 of MINECORE, shown in figure 5.3, the documents in D are ranked, and the reviewer (typically: a junior lawyer) annotates the top-ranked τr documents for responsiveness. Annotating d has the effect of eliminating the uncertainty on the responsiveness of d. As a consequence, if d is annotated as responsive we set Pr2(cr|d) = 1, while if d is annotated as nonresponsive we set Pr2(cr|d) = 0; no annotation for privilege is performed in this phase, so Pr1(cp|d) = Pr2(cp|d). At this point, by using Equation 5.10, d is assigned a class h2(d) ∈ {cP , cL, cW }, which is possibly different from h1(d). The documents d from the (τr + 1)-th position onwards are not manually anno- tated; everything remains unchanged for these documents, i.e., Pr2(cr|d) = Pr1(cr|d) and Pr2(cp|d) = Pr1(cp|d), which implies that h2(d) = h1(d). In order to maximize the cost-effectiveness of this approach it is necessary to choose (i) an optimal ranking of the documents in D and (ii) an optimal threshold τr (which acts as the stopping condition for the annotation process). Concerning point (i), similarly to the approach of [16] we adopt the principle that the documents in D are to be ranked in terms of the reduction in overall risk that anno- tating the document brings about; the documents whose manual annotation brings about the highest reduction are top-ranked. If by Cmϕ (d) we indicate the misclassification cost 80 brought about by attributing class hϕ(d) to d, the difference ∆or(d) in overall cost that annotating d for responsiveness brings about can be written (using Equation 5.9) as ∆or(d) = Co(d)− Co2 1(d) = Cm2 (d) + C a m a (5.11) 2 (d)− C1 (d)− C1 (d) = Cm2 (d) + λ a r − Cm1 (d) However, at the time of ranking D the true class of d (noted as y(d)) is not known, so Cm1 (d) and Cm2 (d) are also unknown. Therefore, at the time of ranking D what we can actually compute, instead of ∆or(d), is an expectation of ∆or(d) over the y(d) random variable, i.e., E or my[∆ (d)] = Ey[C2 (d) + λ a r − Cm1 (d)] = E [Cm(d)] + λay − Ey[Cm(d)] (5.12)2 r 1 = R2(d, h2(d)) + λ a r −R1(d, h1(d)) Actually, at the time of ranking D we also do not know the value of the yr(d) variable (a binary variable that indicates whether, if the reviewer had to annotate d, s/he would deem it responsive or not). This means that also the class h2(d) that would be assigned as a result of annotating d is not known. R2(d, h2(d)) is thus not known either, which means that Equation 5.12 cannot be used directly as a criterion for ranking D. At the time of ranking D we thus must compute an expectation of Ey[∆or(d)] over the yr(d) random variable, i.e., E [∆oryry (d)] = Eyr [R2(d, h2(d)) + λ a r −R1(d, h1(d))] (5.13) = Eyr [R2(d, h2(d))] + λ a r −R1(d, h1(d)) where we have shortened Eyr [Ey[·]] as Eyry[·], and where the last simplification is justified by the fact that R1(d, h1(d)) does not depend on yr(d). Eyr [R2(d, h2(d))] is computed by assigning probabilities to the events cr (i.e., “the reviewer annotates d as responsive”) and cr (“the reviewer annotates d as nonresponsive”). To do this, the best we can do is to “trust” our classifiers and assume that d will be 81 annotated as responsive with probability Pr1(cr|d) and nonresponsive with probability Pr1(cr|d). Each of these probabilities is multiplied by the misclassification risk that the annotation would bring about, i.e., Eyr [R2(d, h2(d))] = R2(d, h2(d)|cr) · Pr1(cr|d) +R2(d, h2(d)|cr) · Pr1(cr|d) (5.14) where by R2(d, h2(d)|cr) we indicate the misclassification risk that would result from assuming that Pr2(cr|d) = 1 and Pr2(cp|d) = Pr1(cp|d), and by R2(d, h2(d)|cr) we indi- cate the misclassification risk that would result from assuming that Pr2(cr|d) = 0 and Pr2(cp|d) = Pr1(cp|d). Equation 5.13 finally gives us a concrete method for ranking the automatically classified documents: for each d ∈ D compute Eyry[∆or(d)] (the expected increase in overall cost brought about by annotating d for responsiveness), and rank the documents in D according to their E [∆oryry (d)] score, top-ranking those with the lowest scores. This guarantees that the reviewer will first annotate the documents characterized by the highest expected reduction in cost that manually annotating them would bring about. In turn this guarantees that, whatever the amount τr of documents that the reviewers annotate, the expected cost-effectiveness of the annotation work will be maximized. Equation 5.13 gives us also a concrete method for addressing point (ii) above, i.e., for setting the τr threshold. The overall cost Co(d) is expected to decrease as a result of annotating d (i.e., E [∆or(d)] < 0) when the cost λayry r of annotating d is more than offset by the expected reduction (R1(d, h1(d)))−Eyr [R2(d, h2(d))] in misclassification cost that annotating d brings about; conversely, if E [∆oryry (d)] ≥ 0, then the expected reduction in misclassification cost is not worth the additional annotation effort. Therefore, the criterion we adopt is in order to decide when to stop annotating is: Stopping condition (responsiveness). Let d be the document at the k-th rank position. If Eyry[∆or(d)] < 0, then annotate d by responsiveness and move on to the document in the (k+1)-th rank position, else stop annotating. The rationale for this criterion is that a reviewer will annotate a document only if 82 Phase 3 Ranker Final posterior probabilities Updated posterior Risk minimizer probabilities Senior assessor Final classification decisions Ranking of unlabelled documents Figure 5.4: Phase 3 of the MINECORE Framework this action is expected to diminish overall cost. Since the likelihood of diminishing overall cost decreases the more we go down the ranking, it follows that we should choose τr to be τr = |{d|E oryry[∆ (d)] < 0}| (5.15) At this point, in Phase 2 the human reviewer has manually annotated the τr docu- ments characterized by the lowest value of E [∆oryry (d)]. Phase 3 of MINECORE, shown in figure 5.4, does for privilege essentially what Phase 2 did for responsiveness. In Phase 3 the documents in D are again ranked, and the reviewer (typically: a senior lawyer) annotates the top-ranked τp documents for privilege. If the reviewer annotates d as privileged we set Pr3(cp|d) = 1, while if the reviewer annotates d as nonprivileged we set Pr3(cp|d) = 0; no annotation for responsiveness is performed in this phase, so Pr2(cr|d) = Pr3(cr|d). At this point, by using Equation 5.10, d is assigned a class h3(d) ∈ {cP , cL, cW }, which is possibly different from h2(d). The documents d from the (τp + 1)-th position onwards are not manually annotated for privilege; for these documents, Pr3(cr|d) = Pr2(cr|d) and Pr3(cp|d) = Pr2(cp|d), which implies that h3(d) = h2(d). Class h3(d) ∈ {cP , cL, cW } is the final class assigned to d by MINECORE, and the class that determines whether the document is produced to the 83 requesting party (h3(d) = cP ), entered on the privilege log (h3(d) = cL), or withheld (h3(d) = cW ). The difference ∆op(d) in overall cost that annotating d for privilege brings about is ∆op(d) = Co3(d)− Co2(d) = Cm3 (d) + C a 3 (d)− Cm2 (d)− Ca (5.16)2 (d) = Cm3 (d) + λ a m p − C2 (d) Similarly to Equation 5.11, and for the same reasons, Equation 5.16 cannot be used directly as a criterion for ranking D. At the time of ranking D we thus compute the expected difference in cost E [∆op(d)] = E [Cm(d) + λa − Cmy y 3 p 2 (d)] = Ey[C m(d)] + λa − E [Cmy (d)] (5.17)3 p 2 = R (d, h (d)) + λa3 3 p −R2(d, h2(d)) Due to the fact that the value of yp(d) (a binary variable that indicates whether, if the reviewer had to annotate d, s/he would deem it privileged or not) is not known at the time of ranking, we must compute an expectation of Ey[∆op(d)] over the yp(d) random variable, i.e., E [∆opypy (d)] = Eyp [R3(d, h3(d)) + λ a p −R2(d, h2(d))] (5.18) = Eyp [R3(d, h3(d))] + λ a p −R2(d, h2(d)) where we have shortened Eyp [Ey[·]] as Eypy[·]. To compute Eyp [R3(d, h3(d))], we assume that d will be annotated as privileged with probability Pr1(cp|d) and nonprivileged with probability Pr1(cp|d), thus bringing about Eyp [R3(d, h3(d))] = R3(d, h3(d)|cp) · Pr1(cp|d) +R3(d, h3(d)|cp) · Pr1(cp|d) (5.19) Analogously to Equation 5.13, Equation 5.18 now gives us a concrete method for ranking the documents: rank the documents in D according to their Eypy[∆op(d)] score, top-ranking those with the lowest scores. The same equation also gives us a concrete 84 Phase-2 Review Phase-3 Review   MR   MNR   τ   MP   MNP   P   r   MR+  AR   τp   M +A     A   NP   NP NP   AR   Γp   AP   Γr   MR / MP à # of Documents Manually Annotated as Relevant/Privileged W   L   MNR / MNPà # of Documents Manually Annotated as Not-Relevance/Not-Privileged ANR   AR/ APà # of Documents Automatically Annotated as Relevant/Privileged ANR/ANPà # of Documents Manually Annotated as Not-Relevance/Not-Privileged τ r/ τp  à Threshold for Manual Annotation during Phase2/Phase3 Γr / Γp à Threshold for Automatic Annotation during Phase2/Phase3 P à Produce L à Log W à Withhold Figure 5.5: Model Parameters method for setting the τp threshold: along the same lines discussed for Phase 2, the criterion we adopt in order to decide when to stop annotating is: Stopping condition (privilege). Let d be the document at the k-th rank position. If E [∆opypy (d)] < 0, then manually annotate d by privilege and move on to the document in the (k + 1)-th rank position, else stop annotating. and we should choose τp to be τ = |{d|E [∆opp ypy (d)] < 0}| (5.20) Thus to summarize, equations 5.13 and 5.18 gives us a concrete method for ranking the automatically classified documents. The rank order guarantees that the assessor will first annotate the documents characterized by the highest reduction in expected cost that manually annotating them would bring about. In turn this guarantees that, whatever the total number of documents that the assessors annotate, the expected cost-effectiveness of 85 the annotation work will be maximized in both Phase 2 and Phase 3. This also gives us a criterion for deciding when to stop the manual annotation. Let us assume that, d is the document at the k− th rank position, and assume we are considering whether annotating d or stop annotating. The criterion we adopt is: If ∆or(d) < 0, then annotate d and move on to the document in the (k + 1) − th rank position, else stop annotating. The condition for review is that an assessor will annotate a document only if this action is expected to exceed the unit annotation cost for a document, i.e., if the cost of annotating the document is expected to be offset by a decrease in misclassification cost. 5.4.2 Algorithm & Evaluation Plan Overview The overall algorithm of our MINECORE model can be depicted as shown in figure 5.1. In Phase 1 of our model we train two automated classifiers hr (which classifies according to responsiveness) and hp (which classifies according to privilege) from the training data (that we assume is available at a zero labeling cost) and we apply them to set D. The documents in D are ranked using equation 5.11 (section 5.4.1). In Phase 2, the assessor (typically: a junior lawyer) annotates the top-ranked (τr) documents for responsiveness. As shown in figure 5.5, at this point, we have MR documents manually annotated as responsive and MNR documents manually annotated as not-responsive. The documents that are not manually annotated in Phase 2 fall into two categories based on our automatic classifier results. Thus we have AR documents automatically classified as responsive and ANR documents automatically classified as not-responsive. At the end of Phase 2, we obtain a responsive document set ′D = (MR + AR) which is then passed on to Phase 3 for manual privilege annotation. In Phase 3, The documents in ′D are ranked using equation 5.16 and the assessor (typically: a senior lawyer) annotates a total of τp documents for privilege. After the Phase 3 annotation step, (similar to Phase 2) we have MP documents manually annotated as privilege and MNP documents manually annotated as not-privileged. We divide the documents that are not manually annotated in Phase 3 into two categories based on our automatic classifier results. Thus we have AP documents automatically classified as 86 Input : A training set Trr of documents labeled for responsiveness; A training set Trp of documents labeled for privilege; Documents D to be analysed for possible production to the requesting party; Cost structure Λ = (Λm,Λa). Output: A partition of D into the following three sets: – Set DP of documents to be produced to the receiving party; – Set DL of documents to be put on the privilege log; – Set DW of documents to be withheld; Annotation cost Ca(D) incurred in the process. /* Phase 1 */ Train classifiers hr and hp from Trr and Trp, respectively; for d ∈ D do Compute Pr1(cr|d) by means of hr and Pr1(cp|d) by means of hp; Compute h1(d) via Equation 5.10; end /* Phase 2 */ for d ∈ D do Pr2(cr|d) ← Pr1(cr|d); Pr2(cp|d) ← Pr1(cp|d); Compute E [∆oryry (d)] using Equation 5.13; end Generate a ranking RrD of D in increasing order of E oryry[∆ (d)]; /* Rr (k) denotes the document at the k-th rank position in RrD D */ k ← 1; τr ← 0; while E [∆or(Rryry D(k))] < 0 do Have the reviewer annotate document RrD(k) for responsiveness; if RrD(k) has been judged responsive by the reviewer then Pr2(cr|RrD(k)) ← 1 else Pr2(cr|RrD(k)) ← 0 end τr ← τr + 1; k ← k + 1; end for d ∈ D do Compute h2(d) using Equation 5.10; end /* Phase 3 */ for d ∈ D do Pr3(cr|d) ← Pr2(cr|d); Pr3(cp|d) ← Pr2(cp|d); Compute E opypy [∆ (d)] using Equation 5.18; end Generate a ranking RpD of D in increasing order of E opypy[∆ (d)]; /* RpD(k) denotes the document at the k-th rank position in R p D */ k ← 1; τp ← 0; while E op pypy [∆ (RD(k))] < 0 do Have the reviewer annotate document RpD(k) for privilege; if RpD(k) has been judged privileged by the reviewer then Pr (c |Rp3 p D(k)) ← 1 else Pr3(cp|RpD(k)) ← 0 end τp ← τp + 1; k ← k + 1; end for d ∈ D do Compute h3(d) using Equation 5.10; end DP ← {d|h3(d) = cP }; DL ← {d|h3(d) = cL}; DW ← {d|h3(d) = cW }; Compute Ca(D) using Equation 5.8. Algorithm 1:MINECORE, a model for MINimizing the Expected COsts of REview for responsiveness and privilege. 87 privilege and ANP documents automatically classified as not privilege. Equation 5.9 is the primary evaluation function we will adopt. When running the experiment, (in which we indeed know the labels of the test documents) we will compute, at the end of the process, the overall cost Co(D) of the process for each of our 6 baseline models and MINECORE. We compute the cost of manual annotation Ca(D) using equa- tion 5.8, and the cost of misclassification Cm(D) using estimates as in equation 5.7. Since, due to the involvement of an automatic classification component, we are in the presence of uncertainty, in developing our MINECORE method we use a risk minimization approach, where we try to minimize an expectation over the overall cost described in Equation 5.9; i.e., we want to minimize E[Co(D)] = E[Cm(D) + Ca(D)] (5.21) where E[·] stands for “expected value”. Note that E[Cm(D) + Ca(D)] does not break down as E[Cm(D)] + E[Ca(D)], since Cm(D) and Ca(D) are not independent. That is, we can easily bring down Cm(D) to zero by choosing to manually annotate all documents, which would however make Ca(D) very high; and we can easily bring down Ca(D) to zero by choosing to automatically annotate all documents, which would however make Cm(D) very high. Thus our attempt is to jointly minimize E[Cm(D)] and E[Ca(D)]. The overall algorithm that implements MINECORE is summarized in Algorithm 1. 5.5 Other baselines We are here proposing some baseline methods against which to compare MINECORE. Throughout this chapter we use the same vector representations for the documents, the same supervised learning algorithm, and the same classifier outputs, for all the methods being compared. Each method (be it MINECORE or a baseline method) assigns, for each test document d, a class in C = {cP , cL, cW }. Our baseline methods are (aside from the fully automated and fully manual so- lutions) mixed-initiative, “human-in-the-loop” systems, i.e., their classification decisions are obtained via some combination of manual annotation work and automatic classifi- 88 cation. Using the cost structures exemplified in Table 5.2 we can evaluate each system using the evaluation measure described in Equation 5.21; that is, for each system we com- pute the misclassification cost Cm(D), the annotation cost Ca(D), and the overall cost Co(D) = Cm(D) + Ca(D) they incur. The best system is the one with the lowest Co(D) cost. 5.5.1 Uncertainty Ranking In Uncertainty Ranking or UR we first annotate for responsiveness the τr documents whose Pr(cr|d) is closest to 0.5 (i.e., the ones whose responsiveness is most uncertain). A document is then deemed responsive if the reviewer has annotated it as such, or (for the documents which have not been manually annotated for responsiveness) if Pr(cr|d) > 0.5. We then annotate for privilege, among the documents that have been deemed responsive, the τp documents whose Pr(cp|d) is closest to 0.5. A document is then deemed privileged if the reviewer has annotated it as such, or (for the documents which have not been manually annotated for privilege) if Pr(cp|d) > 0.5. This baseline is similar to MINECORE in that the class assigned to a test document may result from the reviewers’ manual annotation work, or from the automated classifiers, or from a combination of them. However, neither annotation costs nor misclassification costs play a role in UR. 5.5.2 Relevance Ranking In Relevance Ranking or RR we first annotate for responsiveness the τr documents with the highest Pr(cr|d), and we then annotate for privilege, among the documents that the reviewers have deemed responsive in the previous phase, the τp documents with the lowest Pr(cp|d). Unlike MINECORE and UR, RR assumes that only the documents that have been certified responsive and nonprivileged by the reviewers are going to be produced (documents certified responsive and privileged by the reviewers are entered on the privilege log, while all other documents are withheld); as a result, the two rankings (by Pr(cr|d) and Pr(cp|d)) attempt to top-rank the documents that have the highest chances of meeting the requirements (responsiveness and nonprivilege) for disclosure. 89 5.5.3 Active Learning via Uncertainty Sampling In the design of MINECORE our focus has been on cases in which we have already built the best classifier that we can, and in such cases we would not expect further gains from active learning. In our experiments, however, we have simply trained on a fixed set of documents, and it is possible that active learning might indeed give further gains. This motivates our choice to include ALvUS and ALvRS (see below) as additional baselines. In ALvUS, an interactive process asks the reviewer to annotate for responsive- ness the k documents in D for which Pr(cr|d) is closest to 0.5; at this point, this set D′ of k documents is added to the training set, the posterior probabilities Pr(cr|d) of the doc- uments d annotated as responsive are set to 1, hr is retrained, and D/D′ is classified for responsiveness again; this process is repeated (using the newly computed Pr(cr|d) values) until exactly τr documents have been annotated.2 After this, an identical process is used for privilege, substituting hp and τp for hr and τr in the above. At the end, a document d ∈ D is assigned to cP iff Pr(cr|d) > 0.5 and Pr(cp|d) ≤ 0.5; to cL iff Pr(cr|d) > 0.5 and Pr(cp|d) > 0.5; and to cW otherwise. ALvUS is similar to MINECORE and UR, in that the class assigned to a test document may result from the reviewers’ manual annotation work, or from the automated classifiers, or from a combination of them. In the experiments reported in this chapter we use k = 1000, which was found to work well by [26]. Note that the comparison between MINECORE and ALvUS is only partially fair, since ALvUS is much more expensive computationally, given that it requires ⌈τr/k⌉ + ⌈τp/k⌉ retraining operations (unlike MINECORE, which requires none). 5.5.4 Active Learning via Relevance Sampling A variant of the previous baseline is obtained if the active learning process asks the reviewer to annotate for responsiveness the k documents in D for which Pr(cr|d) is highest (and the ones for which Pr(cp|d) is lowest when the reviewer annotates for privilege). The rest of the process is as in ALvUS; in particular, here too we use k = 1000. At the end, 2To be more precise, in the last iteration fewer than k documents may be annotated, so as to make the total number of documents annotated equal to τr. For example, if τr = 3267 and k = 1000, 1000 documents will be annotated in each of the first three rounds, while in the final round only 267 documents will be annotated. 90 a document d ∈ D is assigned to cP iff it has been manually annotated as responsive and nonprivileged; it is assigned to cL iff it has been manually annotated as responsive and privileged; it is assigned to cW otherwise. Unlike ALvUS, ALvRS thus assumes that, unless a document has been under the scrutiny of both the junior reviewer (for responsiveness) and the senior reviewer (for privilege), it is withheld. Among e-discovery researchers and practitioners, ALvRS is known as “continuous active learning” (CAL) [26, 27, 30]; was originally introduced in [46], where it was indeed called “Relevance Sampling”.3 The latter paper is also the work in which ALvUS was introduced first, under the name of “Uncertainty Sampling”. Note that for every baseline system other than FA and FM we compute the cost Co(D) that the baseline incurs when manually annotating exactly τr documents for re- sponsiveness and, if possible,4 τp documents for privilege, where τr and τp are the values used in the MINECORE system. This policy may be biased in favour of MINECORE, since τr and τp are optimal settings for MINECORE whereas other systems might have yielded lower overall costs with either more or less manual reviewing. However, none of the baseline systems we test have an apriori way of analytically setting the optimal number of documents to manually review. This means that our comparisons are, if not to post-hoc optimal systems, at least to reasonable systems. 5.6 Experiments In this section we describe a number of experiments that we have conducted to test the cost-effectiveness of MINECORE. 3CAL, as described in [26, 27, 30], is actually a simpler variant of ALvRS since it deals with one classification task only (i.e., responsiveness), instead of the two cascaded tasks (i.e., responsiveness and privilege) that ALvRS deals with. 4In some cases a baseline system might deem responsive fewer than τp documents, which means that fewer than τp documents (i.e., all the ones deemed responsive) would be annotated for privilege; in this case the comparison between this baseline system and all other systems (including MINECORE) is still fair, though, since this system will incur a smaller annotation cost (for privilege) than MINECORE. 91 5.6.1 Test Collection One problem that hinders the evaluation of MINECORE is that, in the world of e-discovery, at present, there is no publicly available test collection that is annotated by both responsiveness and privilege. The TREC 2010 Legal Track included a privilege topic and several responsiveness topics, but each topic was independently sampled so there are very few privilege annotations on documents that were annotated for relevance. Chapter 3 further discusses the issues with the TREC 2010 collection. One solution is to generate such an annotated collection ourselves: however, this would be a major feat in terms of annotation cost, since it takes real lawyers to do this annotation, and real lawyers (especially senior ones, whom we would need in order to annotate for privilege) can be extremely expensive. We bypass this problem by running “simulated” experiments, on a collection unrelated to e-discovery in which documents can belong to more than one class, and by repeatedly picking two classes to play the role of cr and cp, respectively. As a test collection we have chosen RCV1-v2, a standard, publicly available bench- mark for text classification first presented in [47] and consisting of 804,414 news stories produced by Reuters from 20 Aug 1996 to 19 Aug 1997.5 RCV1-v2 ranks as one of the largest corpora currently used in text classification research. RCV1-v2 is multi-label, i.e., a document may belong to several classes at the same time, which makes it suitable for our purposes. In [47] the collection is partitioned into a training set of 23,149 documents and a test set of 781,265 documents, the latter being split into four chunks of 199,328, 199,339, 199,576, 183,022 documents, respectively. In the experiments reported in this chapter we have used the 23,149 training documents as the training set Tr, and the first chunk of 199,328 test documents as the test set Te. In the topic hierarchy of RCV1-v2 there are 103 classes, of which 101 have at least one positive training example. Since we experiment with pairs of classes (representing cr and cp), we could in principle experiment with 1012 = 10, 201 different pairs. Aside from representing a substantive computational load, this would also mean experimenting 5http://trec.nist.gov/data/reuters/reuters.html 92 with classes whose prevalence is, for many e-discovery scenarios, not realistic. We have therefore limited our experiments to pairs (cr, cp) such that the test set prevalence of cr (i.e., Pr(cr|Tr)) is in [0.03,0.07] and the prevalence of cp in the responsive documents (i.e., Pr(cp|cr, T r)) is in [0.01,0.20]. These values are representative of some e-discovery settings, and they yield a sufficient number of positive labels for our experiments. For each of the 24 responsiveness classes that meet the responsiveness prevalence criterion we have randomly selected 5 privilege classes that meet the privilege prevalence criterion. This gives rise to 120 class pairs, which is the set we use for the experiments described in this chapter. 5.6.2 The learning algorithm For all the experiments reported in this chapter we have used Support Vector Ma- chines (SVMs) as the classifier, since they have consistently delivered strong performance in text classification. We have used the well-known SVM light implementation for which we have used the default parameter values [38,39]. Concerning the vector representations fed to the SVM learner, we have used the ones made available during the creation of the Reuters collection [47]. We refer to that work for details on the preprocessing techniques that were used to generate them. SVMs return confidence scores that are not posterior probabilities; these scores must thus be converted into posterior probabilities, since MINECORE essentially depends on the availability of posterior probabilities. Given that the returned scores are a mono- tonically increasing function of the classifier’s confidence in the fact that the document belongs to the class, this conversion may be obtained by applying to the scores a logistic function, since such a function has a sigmoidal shape. We obtain well-calibrated posterior probab∑ilities (defined as the posterior probabilities Pr(c|d) such that, given class c and set s, d∈s Pr(c|d) is equal to the class prevalence Pr(c|s)) by using a generalized logistic function and optimizing its slope parameter; for this optimization we follow exactly the same process as described in [16], to which we refer the reader for details. 93 5.6.3 Cost structures In order to use realistic misclassification costs and annotation costs, we have chosen to elicit our cost structures from e-discovery experts. We have been able to obtain the help of three senior members of the e-discovery community; two lawyers and an technical expert in technology-assisted review, each of whom have extensive experience with actual e-discovery cases in their practice. We asked the two lawyers to think of an actual case they may be familiar with, and to articulate the cost structure that characterizes that case. To be sure to understand their cost values, we conducted a 60 minute call with each of the two lawyers. During the call, the lawyers explicated their rationale behind choosing the cost values. We took a different approach to gather the cost structure inputs from an e-discovery professional who is an expert in TAR. We developed a questionnaire (For details please refer Appendix B) with a total of eight questions. Answers to all of the eight questions were made mandatory since a partially filled out questionnaire would be less useful to us. Each question except the first has three possible answers. The task was to pick a single answer by ticking one of the three boxes, and then to fill in the requested relative cost value. The expert attempting to answer the questionnaire was allowed to make any assumption about the type of case and the amounts at stake in the case, but required to make the same assumptions for every question. Through this process we obtained 3 cost structures, which are detailed in Table 5.2. Each individual cost is expressed in US$. Note that the values indicated by the 3 experts are in some cases markedly different (e.g., there is a factor of 150 between the values of λLP indicated by two of the experts); this need not be taken as evidence of disagreement among the experts for decisions on the same task, since different experts were free to choose different legal cases to have in mind when arriving at these estimates. Rather than trying to reconcile these cost structures in any way, we have thus run 3 experiments, one for each of the cost structures, on the assumption that MINECORE should be able to cater to different application needs. 94 Table 5.2: Cost structure values in US$. λa ar λp λPL λPW λLP λLW λWP λWL CostStructure1 1.00 5.00 600.00 5.00 150.00 3.00 15.00 15.00 CostStructure2 1.00 5.00 100.00 0.03 10.00 2.00 8.00 8.00 CostStructure3 1.00 5.00 1000.00 0.10 1.00 1.00 1.00 1.00 5.6.4 Experimental protocol The experimentation protocol we adopt is the following. As groundwork, we train our binary classifiers via the chosen binary learner using the 23,149 training documents, and apply them to the 199,328 test documents (the test set Te thus plays the role of our universe D). For each document d ∈ Te, the classifier for class c generates a confidence score, from which we obtain a posterior probability Pr(c|d) via probability calibration. At this point, we run each of the seven methods (MINECORE plus the six baseline methods) for each of the cost structures (see Table 5.2) we have elicited from the experts. In particular, for the risk minimization method, we first simulate the manual anno- tation process for responsiveness: for all d ∈ D such that Eyry[∆or(d)] < 0 we set Pr2(cr|d) to 1 if d is responsive and to 0 if d is nonresponsive. We then do the same for privilege: for all d ∈ D such that E [∆opypy (d)] < 0 we set Pr3(cp|d) to 1 if d is privileged and to 0 if d is nonprivileged. We then compute the total cost of the process via Equation 5.21. 5.7 Results In this section we present the results of testing MINECORE against the 6 baseline methods presented in Section 5.5, on the 120 class pairs described at the end of Section 5.6.1; we have run each such experiment for each of the 3 cost structures discussed in Section 5.6.3. In Table 5.3 we exemplify, on a sample cost structure (CostStructure1), what the results look like. The table reports, the class prevalences of cr and cp, the values of τr and τp that MINECORE returns, and the Co(D) value (expressed in thousands of US$) resulting from each of the seven methods for 80 class pairs (due to space constraints). For each of the 6 baseline methods, we also report the increase in Co(D) value with respect 95 Table 5.3: Results obtained from CostStructure1 Pr Pr | FA FM UR RR ALvUS ALvRS RMcr cp (cr) (cp cr) τp τr Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) 1 M12 M14 3% 1% 3257 1100 26 +13% 227 +865% 29 +22% 34 +45% 30 +28% 33 +41% 23 2 M12 CCAT 3% 5% 1738 1997 49 +36% 227 +533% 58 +63% 60 +68% 65 +82% 59 +65% 36 3 M12 M132 3% 7% 2889 1201 60 +38% 227 +424% 68 +57% 68 +57% 65 +51% 67 +54% 43 4 M12 E21 3% 11% 2048 2063 72 +44% 227 +353% 85 +71% 84 +68% 87 +73% 83 +66% 50 5 M12 M131 3% 18% 2726 1400 180 +30% 227 +64% 192 +39% 189 +36% 196 +41% 177 +29% 139 6 M132 GPOL 3% 1% 2254 1227 30 +25% 229 +859% 33 +39% 38 +59% 34 +44% 36 +54% 24 7 M132 CCAT 3% 2% 1794 2300 41 +26% 229 +596% 52 +58% 55 +66% 50 +51% 54 +66% 33 8 M132 M12 3% 6% 2360 1828 37 +12% 229 +588% 43 +30% 48 +45% 41 +25% 47 +42% 33 9 M132 M131 3% 7% 2506 1685 68 +29% 229 +332% 79 +49% 78 +48% 78 +47% 73 +38% 53 10 M132 GCAT 3% 15% 2258 1152 41 +25% 229 +592% 46 +40% 49 +48% 49 +47% 48 +46% 33 11 M131 CCAT 3% 1% 1141 2797 52 +34% 231 +490% 67 +71% 67 +72% 65 +65% 66 +70% 39 12 M131 M132 3% 6% 1709 1528 63 +27% 231 +365% 78 +56% 72 +44% 65 +30% 69 +40% 50 13 M131 E12 3% 7% 1309 2066 83 +36% 231 +280% 94 +55% 93 +52% 103 +69% 93 +53% 61 14 M131 ECAT 3% 9% 822 3334 95 +61% 231 +291% 111 +88% 112 +90% 112 +88% 108 +83% 59 15 M131 M12 3% 15% 1465 1823 75 +34% 231 +313% 80 +44% 82 +47% 91 +63% 82 +47% 56 16 E12 M11 3% 1% 8371 437 55 +32% 232 +458% 45 +7% 47 +14% 46 +10% 46 +12% 42 17 E12 GDIP 3% 3% 7135 1334 73 +22% 232 +290% 73 +22% 74 +25% 72 +21% 77 +29% 60 18 E12 E212 3% 4% 7135 1336 71 +30% 232 +323% 71 +29% 75 +36% 71 +29% 74 +35% 55 19 E12 M131 3% 7% 7639 1467 87 +35% 232 +261% 92 +42% 96 +49% 102 +58% 93 +45% 64 20 E12 E21 3% 13% 5589 1769 99 +33% 232 +210% 110 +47% 112 +49% 111 +48% 114 +52% 75 21 C21 C17 4% 1% 5862 1101 78 +18% 235 +254% 76 +14% 79 +19% 72 +9% 75 +13% 66 22 C21 C15 4% 3% 4610 1651 84 +11% 235 +211% 88 +16% 90 +19% 85 +13% 87 +15% 75 23 C21 ECAT 4% 5% 3084 2184 93 +10% 235 +180% 104 +24% 104 +24% 95 +13% 102 +23% 84 24 C21 C31 4% 18% 2037 2298 104 +15% 235 +159% 117 +29% 116 +27% 130 +43% 121 +32% 91 25 C21 M141 4% 20% 7052 389 103 +15% 235 +162% 99 +10% 101 +12% 98 +9% 99 +10% 90 26 E212 GPOL 4% 2% 2527 3592 47 +3% 236 +416% 62 +35% 67 +47% 58 +27% 66 +46% 46 27 E212 E12 4% 4% 2357 1410 40 +8% 236 +543% 45 +23% 48 +30% 46 +25% 49 +32% 37 28 E212 M12 4% 8% 2312 1805 70 +31% 236 +342% 78 +47% 78 +47% 85 +60% 80 +52% 53 29 E212 MCAT 4% 9% 2059 3171 73 +23% 236 +297% 90 +51% 91 +53% 95 +59% 89 +50% 59 30 E212 C17 4% 19% 1967 2574 61 +11% 236 +327% 74 +34% 74 +35% 82 +48% 75 +37% 55 31 GCRIM E212 4% 1% 6001 815 44 +46% 237 +693% 39 +31% 49 +65% 37 +23% 46 +54% 30 32 GCRIM C15 4% 2% 4533 3118 57 +18% 237 +390% 68 +41% 76 +56% 71 +47% 73 +53% 48 33 GCRIM C18 4% 3% 4909 2088 48 +24% 237 +513% 53 +37% 61 +58% 50 +29% 59 +52% 39 34 GCRIM GDIP 4% 6% 3891 2930 80 +40% 237 +316% 91 +59% 96 +69% 86 +52% 96 +69% 57 35 GCRIM GPOL 4% 20% 2352 4572 105 +42% 237 +221% 124 +68% 129 +74% 128 +74% 129 +74% 74 36 C24 GDIP 4% 1% 9416 1624 77 +27% 240 +294% 67 +11% 71 +17% 62 +1% 66 +9% 61 37 C24 C15 4% 2% 6552 2979 89 +18% 240 +218% 94 +24% 100 +32% 90 +20% 94 +25% 75 38 C24 C31 4% 5% 4318 3803 106 +21% 240 +172% 122 +39% 126 +43% 118 +34% 122 +39% 88 39 C24 MCAT 4% 10% 7329 2090 118 +26% 240 +156% 124 +32% 129 +38% 126 +34% 128 +37% 94 40 C24 C21 4% 20% 3390 4063 142 +39% 240 +136% 159 +56% 154 +51% 204 +100% 192 +88% 102 41 GVIO C21 4% 1% 4604 4661 63 +2% 242 +291% 77 +25% 88 +42% 63 +2% 74 +20% 62 42 GVIO C24 4% 1% 6015 2405 63 +24% 242 +374% 66 +29% 75 +48% 65 +27% 65 +27% 51 43 GVIO CCAT 4% 6% 3490 3540 84 +23% 242 +253% 101 +48% 103 +51% 87 +28% 100 +47% 68 44 GVIO ECAT 4% 6% 3156 4464 92 +26% 242 +231% 120 +64% 116 +59% 119 +63% 119 +64% 73 45 GVIO GCRIM 4% 13% 4667 2560 94 +36% 242 +251% 106 +54% 107 +55% 110 +61% 110 +59% 69 46 C13 M12 5% 1% 18998 452 104 +49% 247 +252% 79 +12% 79 +13% 78 +11% 76 +9% 70 47 C13 C15 5% 4% 12039 2243 128 +25% 247 +141% 130 +27% 130 +27% 132 +29% 134 +31% 102 48 C13 GPOL 5% 6% 10068 3383 127 +18% 247 +130% 136 +26% 138 +29% 137 +27% 140 +30% 107 49 C13 M14 5% 7% 16228 1283 116 +27% 247 +170% 105 +15% 108 +18% 105 +15% 108 +18% 91 50 C13 MCAT 5% 14% 11256 2488 135 +20% 247 +118% 147 +30% 150 +32% 152 +34% 148 +31% 113 51 GDIP C31 5% 1% 5321 5393 94 +27% 249 +238% 112 +52% 123 +66% 100 +35% 115 +56% 74 52 GDIP E12 5% 2% 6244 4334 82 +19% 249 +261% 95 +38% 106 +53% 87 +25% 97 +41% 69 53 GDIP CCAT 5% 5% 4060 4562 110 +32% 249 +200% 130 +57% 135 +63% 118 +42% 132 +59% 83 54 GDIP ECAT 5% 17% 3049 6279 150 +50% 249 +148% 178 +77% 184 +83% 199 +98% 181 +81% 101 55 GDIP GPOL 5% 19% 3209 5248 182 +37% 249 +88% 202 +53% 198 +50% 228 +72% 214 +62% 133 56 C31 C151 5% 4% 13069 2079 124 +28% 252 +159% 118 +21% 121 +25% 114 +17% 116 +19% 97 57 C31 C15 5% 10% 10168 2824 142 +23% 252 +119% 153 +33% 154 +34% 150 +30% 154 +34% 115 58 C31 ECAT 5% 11% 6230 3660 158 +21% 252 +93% 175 +34% 178 +36% 187 +43% 181 +39% 131 59 C31 C21 5% 12% 6961 3970 164 +14% 252 +75% 203 +41% 196 +36% 239 +66% 218 +51% 144 60 C31 M14 5% 20% 13516 1873 138 +13% 252 +105% 145 +18% 152 +24% 152 +24% 155 +25% 123 61 C181 C151 5% 2% 8194 4348 86 +19% 253 +249% 95 +32% 109 +50% 92 +28% 104 +44% 72 62 C181 GCAT 5% 5% 6513 4458 105 +34% 253 +221% 118 +50% 130 +65% 114 +45% 122 +55% 79 63 C181 C152 5% 10% 5277 6378 137 +42% 253 +160% 159 +64% 172 +78% 171 +76% 172 +77% 97 64 C181 C15 5% 11% 4647 6369 152 +42% 253 +135% 175 +63% 187 +74% 186 +73% 188 +75% 107 65 C181 C17 5% 12% 6612 4805 120 +29% 253 +171% 137 +47% 145 +56% 159 +71% 149 +60% 93 66 M141 ECAT 5% 1% 1054 5162 54 +21% 253 +467% 81 +81% 81 +81% 80 +80% 80 +78% 45 67 M141 GCAT 5% 4% 1258 3819 68 +39% 253 +417% 87 +77% 89 +81% 96 +95% 86 +76% 49 68 M141 C24 5% 5% 1320 4978 80 +44% 253 +359% 102 +84% 98 +79% 102 +85% 104 +89% 55 69 M141 C31 5% 12% 809 6315 129 +87% 253 +268% 151 +119% 148 +115% 166 +141% 164 +139% 69 70 M141 C21 5% 13% 1047 4413 107 +46% 253 +246% 117 +60% 106 +46% 114 +56% 125 +72% 73 71 M11 ECAT 5% 2% 2790 2704 64 +26% 254 +396% 77 +51% 80 +57% 76 +49% 77 +51% 51 72 M11 C152 5% 4% 1797 5573 121 +45% 254 +205% 144 +74% 149 +79% 155 +87% 150 +80% 83 73 M11 M132 5% 5% 3613 2058 68 +43% 254 +438% 74 +58% 81 +71% 66 +40% 79 +67% 47 74 M11 M13 5% 5% 3349 2883 89 +38% 254 +295% 100 +57% 106 +65% 87 +35% 101 +58% 64 75 M11 CCAT 5% 10% 1561 5486 125 +51% 254 +205% 149 +80% 154 +86% 158 +90% 153 +84% 83 76 E21 C31 5% 1% 5196 4511 78 +23% 254 +302% 94 +49% 104 +65% 87 +38% 103 +63% 63 77 E21 M12 5% 5% 6477 2506 89 +29% 254 +265% 95 +36% 103 +48% 108 +56% 107 +54% 70 78 E21 MCAT 5% 7% 4821 4715 106 +24% 254 +195% 131 +52% 134 +56% 133 +55% 132 +54% 86 79 E21 E12 5% 8% 4539 3272 107 +33% 254 +214% 123 +52% 121 +49% 136 +68% 129 +60% 81 80 E21 GPOL 5% 15% 4176 5495 129 +30% 254 +156% 154 +55% 155 +56% 172 +73% 156 +58% 99 Median values 4598 3145 94 +29% 248 +235% 106 +47% 107 +52% 104 +47% 108 +52% 73 96 to MINECORE (a positive increase means that the baseline generates higher costs than MINECORE). Table 5.3 shows the result obtained by using a sample cost structure (here: Cost- Structure1); Co(D) denotes the cost incurred by the method while ∆ denotes the per- centage increase in cost with respect to MINECORE (e.g., +30% means that the cost of the method is 30% higher than that of MINECORE). For readability we indicate costs in thousands of US$, rounding them to the closest unit; e.g., $272,456 would be indicated as 272. MINECORE is here shortened as “RM” (for “Risk Minimization”), Fully Manual is shortened as “FM”, Fully Automatic as “FA”, Uncertainty Ranking as “UR”, Rele- vance Ranking as “RR”, Active learning via Uncertainty and Relevance as “ALvUS” and “ALvRS” respectively. The last row represents median values across the 120 class pairs. The table reveals that for this cost structure (here: CostStructure1), MINECORE is the least expensive of the seven methods for all 120 class pairs. An overall view of the relative merits of the 7 methods can be obtained by looking at the bottom row of the table, which reports median values computed across the 120 class pairs (throughout this chapter we look at medians, rather than at averages, in order to reduce the impact of outliers). In terms of the median values, the 2nd best method is (surprisingly enough) the FA method, which is 29% more expensive than MINECORE. Other methods are even more expensive, up to 235% more than MINECORE; among these other methods one can note a slight advantage of the uncertainty-based methods (UR and ALvUS) over the relevance-based ones (RR and ALvRS), while there seems to be no substantial difference between the methods which are based on active learning (ALvUS and ALvRS) and the ones which are not (UR and RR). The values of τr range in the [809,18998] interval, corresponding to [0.41%,9.53%] of the total set of 199,328 documents; those of τp range instead in the [389,7942] interval, corresponding to [0.20%,3.98%] of the total set. This shows two important facts. First, MINECORE sanctions that only a small minority of the documents (max 9.53% of the total for responsiveness, max 3.98% for privilege) should be manually reviewed; this is in line with what e-discovery practitioners expect. Second, MINECORE requires many fewer documents to be manually annotated for privilege than for responsiveness; this is a 97 270K   220K   170K   120K   70K   RM   ALvRS   ALvUS   FM   FA   RR   UR   20K   0   10   20   30   40   50   60   70   80   90   100   Class  Pairs   Figure 5.6: Overall costs with CostStructure1 as input consequence (a) of the fact that many documents are ruled out from further consideration on responsiveness grounds alone, and are not further checked for privilege; and (b) of the fact that manually reviewing for privilege is more expensive, and thus more strongly discouraged by MINECORE, than manually reviewing for responsiveness. Figure 5.6 shows the overall costs with CostStructure1 for the 7 methods across the 120 class pairs, with the x axis sorted by decreasing cost for MINECORE (here shortened as “RM”). First, the cost of the FM baseline is quite high, varying in a narrow range in a manner that strictly depends on the prevalence of the responsiveness class. Second, none of the baselines other than FM, while all systematically better than FM, are systematically better or systematically worse than all the other ones, which is shown by the fact that the relative plots keep intersecting each other. Third, MINECORE systematically outperforms all others, often by a substantial margin. Table 5.4 shows the results obtained on a sample class pair (category GPOL as 98 Overall  Cost  in  USD   cr and category CCAT as cp) using the different cost structures of Table 5.2. This is a comparison among the results obtained for the different cost structures on a representative class pair.6 It is immediately obvious that the cost structure has a lot of influence (i) on how many documents get manually reviewed, both for responsiveness and for privilege, (ii) on the total costs incurred by the various methods, and (iii) on the difference in cost between these methods and MINECORE. In general CostStructure2 results in much smaller num- bers of manually reviewed documents than CostStructure1; this is because (see Table 5.2) the misclassification costs are much smaller than in CostStructure1, which makes manual annotation less cost-effective. CostStructure3 is also an interesting limiting case, in that it results in τr = τp = 0; that is, MINECORE decrees that no document is worth manually annotating, and that the decisions of the automatic classifiers should be used, which means that in this case MINECORE coincides with FA. The reason for this behavior lies in the fact that the misclassification costs in Λm are (relatively to the annotation costs in Λa) very low, too low to justify any amount of manual annotation. In general, if the costs in Λm are low and the costs in Λa are high, low values of τr and τp (sometimes as low as 0) will result, since manual annotation is discouraged. Conversely, if the costs in Λm are high and the costs in Λa are low, high values of τr and τp (sometimes as high as |D|) will result, and MINECORE will suggest manual annotation for all documents in D. In general, the higher (resp., lower) the ratio between the costs in Λm and those in Λa, the closer to FM (resp., FA) MINECORE is going to be performance-wise. MINECORE is especially advantageous with respect to both baselines when the cost structure justifies the notion that some (but not all) of the documents in D are worth annotating manually. Figure 5.7 shows the percentage increase (with respect to MINECORE) in the overall cost Co(D) resulting from the 6 baseline methods for each of the 120 class pairs according to the 3 different cost structures. Pairs are listed on the x axis by decreasing cost brought 6In this example responsiveness is simulated by RCV1-v2 class GPOL (“DomesticPolitics”) while priv- ilege is simulated by class CCAT (“Commercial/Industrial”); this class pair was chosen as representative since it is the one for which the median increase in overall cost (+47%) between MINECORE and a high-performing baseline (ALvUS) is obtained. 99 Table 5.4: Results obtained GPOL(as R)-CCAT(as P) class pair FA FM UR RR ALvUS ALvRS RM τp τr Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) CostStructure1 6169 6885 177 +32% 273 +105% 207 +55% 215 +61% 196 +47% 212 +59% 93 CostStructure2 918 1189 57 +3% 273 +397% 63 +14% 64 +16% 57 +3% 63 +14% 55 CostStructure3 0 0 15 +0% 273 +1714% 15 +0% 15 +0% 15 +0% 15 +0% 15 Table 5.5: Results from all cost structures FA FM UR RR ALvUS ALvRS RM Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) ∆ Co(D) CostStructure1 94 +29%† 248 +235%† 106 +47%† 107 +52%† 104 +47%† 108 +52%† 73 CostStructure2 24 +2%† 248 +893%† 26 +10%† 26 +11%† 25 +4%† 25 +7%† 24 CostStructure3 10 +0% 248 +2416% 10 +0% 10 +0% 10 +0% 10 +0% 10 about by MINECORE. For better comparison all figures are displayed across the range [-15%,+145%] on the y axis. In the FM figure (top right) this makes the CostStructure2 and CostStructure3 curves, and most of the CostStructure1 curve, plot above the ceiling. It extends the comparison shown in Table 5.4 to the full set of class pairs. As can be seen, all of the baselines generally incur substantially higher costs than MINECORE with CostStructure1; this difference is instead far smaller for CostStructure2 (as noted above, there is no difference between MINECORE and the other baselines – except FM – for CostStructure3). Finally, Table 5.5 shows the median (across the 120 class pairs) overall cost obtained by each method with each cost structure. This table reveals the results obtained by using the different cost structures of Table 5.2. The results in a given row are the median of the 120 results obtained with the tested 120 class pairs. Boldface indicates the best method, while † indicates a statistically significant (p < 0.01) increase in overall cost with respect to MINECORE, as determined by the Wilcoxon test. For CostStructure2, MINECORE does better by this median measure than all of the baseline methods by smaller margins than are achieved for CostStructure1. For both of those two cost structures, the costs generated by each baseline method is statistically significantly higher according to a Wilcoxon signed rank test for paired samples over the 120 class pairs, at p < 0.01. Concerning CostStructure3, similarly to what happened for the pair showcased in Table 5.4, MINECORE evaluates both τr and τp to 0 for all class pairs, making MINECORE and all the other methods (aside from FM) coincide with FA. 100 Incidentally, one cannot help noticing how the FM fully manual baseline is, by a very wide margin and according to all three cost structures, the worst of all systems. This is a further confirmation of a fact first noted in [36], which reasserts that technology-assisted review is nowadays unavoidable in e-discovery. A first thing to observe is that, in MINECORE, a document can end up being manually annotated only for responsiveness, only for privilege, for both responsiveness and privilege, or for neither responsiveness nor privilege. A second thing to observe is that Phases 2 and 3 are structurally identical, since Phase 2 does for responsiveness what Phase 3 does for privilege. One might thus wonder if we could switch their order without negatively impacting (or perhaps even positively impacting) Co(D). The answer is no, and the reason lies in the fact that, in typical e- discovery scenarios, λap is higher or much higher than λar (we indeed imposed the constraint that λar < λap). This has the consequence that it makes sense to employ the expensive (as characterised by λap) senior reviewers for annotating documents that the cheap (as characterised by λar) junior reviewers have “pre-filtered”. A third observation which is in order is about ranking. During Phase 2 MINECORE clearly separates the set (let us call it Dman2 ) of the τr documents that should be annotated from the set (let us call it Daut2 ) of the (|D|− τr) documents that should not be annotated (the same happens at the end of Phase 3). If the human reviewer annotates all and only the former, one might wonder why is ranking useful at all. While ranking is indeed unnecessary in theory, it is useful in practice, for two reasons: • The choice of which documents to put in Dman2 and which to put in Daut2 is far from perfect, since it relies on automatically generated posterior probabilities. As a result, the human reviewer might find out, at the very moment s/he is invited to stop annotating, that s/he was still finding many mislabeled documents, and s/he might thus want to annotate some more documents in order to be on the safe side; • If, for some reason, the reviewer stops annotating before the stopping condition is reached, the fact that s/he has annotated by following the ranked list guarantees that the cost-effectiveness of her work has been maximized. 101 As a result, we indeed assume that rankings are generated (and followed by the human reviewers) in both Phase 2 and Phase 3. 5.8 Summary During e-discovery, the party performing the review may incur costs of two types. Annotation costs (deriving from the fact that human reviewers need to be paid for their work) and misclassi￿cation costs (deriving from the fact that failing to correctly deter- mine the responsiveness or privilege of a document may adversely affect the interests of the parties in various ways). Relying exclusively on automatic classification would mini- mize annotation costs but could result in substantial misclassification costs, while relying exclusively on manual classification could generate the opposite consequences. Thus, we develop a risk minimization framework called MINECORE, that seeks to strike an optimal balance between these two. In our MINECORE model the documents are first automatically classified for both responsiveness and privilege. In the next step, some of the automatically classified documents are annotated by human reviewers for respon- siveness (typically by junior reviewers) and for privilege (typically by senior reviewers), with the overall goal of minimizing the expected cost (i.e., the risk) of the entire process. Risk minimization is achieved by optimizing, for both responsiveness and privilege, the choice of which documents to manually review. We present a simulation study in which categories from a standard text classification test collection (RCV1-v2) are used to mimic responsiveness and privilege topic. Our findings indicate that MINECORE can yield substantially a lower total cost than any of a set of strong baselines we propose. In our work, we have assumed that lawyers will be able to conceptualize unit anno- tation costs and unit misclassification costs in comparable units. Although this has proven to be a useful, one important insight from our experience is that people often find it dif- ficult to quantify uncertain costs using the same units in which they would express costs that would be incurred. We have assumed for the purposes of our work that some model of costs and risks exists and can be formalized, but in practice the process of designing such models may not be as simple as asking an attorney to assign values to the elements 102 in one of our cost structures. We have also assumed that both costs and risks accumulate linearly. We are confident that our framework will give lawyers more to discuss, since adopting our approach would mean that they would ultimately need to agree on both the cost structure and the way in which error probabilities are estimated. 103 FA   FM   145%   145%   CostStructure1   CostStructure2   125%   125%   CostStructure3   105%   105%   85%   85%   65%   65%   45%   45%   CostStructure1   25%   25%   CostStructure2   CostStructure3   5%   5%   0   20   40   60   80   100   0   20   40   60   80   100   -­‐15%   -­‐15%   Class  Pairs   Class  Pairs   UR   RR   145%   145%   CostStructure1   CostStructure1   CostStructure2   CostStructure2   125%   125%   CostStructure3   CostStructure3   105%   105%   85%   85%   65%   65%   45%   45%   25%   25%   5%   5%   0   20   40   60   80   100   0   20   40   60   80   100   -­‐15%   -­‐15%   Class  Pairs   Class  Pairs   ALvUS   ALvRS   145%   145%   CostStructure1   CostStructure1   CostStructure2   CostStructure2   125%   CostStructure3   125%   CostStructure3   105%   105%   85%   85%   65%   65%   45%   45%   25%   25%   5%   5%   0   20   40   60   80   100   0   20   40   60   80   100   -­‐15%   -­‐15%   Class  Pairs   Class  Pairs   Figure 5.7: Percentage increase in the overall cost 104 Increase  in  Overall  Cost  over  RM   Increase  in  Overall  Cost  over  RM   Increase  in  Overall  Cost  over  RM   Increase  in  Overall  Cost  over  RM   Increase  in  Overall  Cost  over  RM   Increase  in  Overall  Cost  over  RM   Chapter 6: Conclusions E-discovery practices are indeed well suited for the interplay between the humans and the computerized models to identify which documents in a collection are responsive to a production request, and to identify the documents that should be withheld on the basis of privilege. This dissertation can help to inform the legal community that the adoption of predictive coding technique is actually a good option in some litigation cases. Our research aims to provide multiple contributions to the current e-discovery practice. It provides multiple proofs of concept to encourage affordable e-discovery procedures by isolating and studying its key components. Research in e-discovery has been hampered by the lack of publicly available test collections. The only test collection that is publicly accessible for e-discovery privilege classification was created during the TREC 2010 Legal Track. Before we start to design a classifier, as a first step, we ask Is it possible to perform an unbiased classifier evaluation? We start by answering this question because designing any system without having an evaluation plan does not provide any discernible value. Hence we first build and release a useful set of documents to enable unbiased classifier evaluation. To create a labeled unbiased set for evaluation purpose, we remove the selection bias (introduced by the sampling process during TREC 2010) by re-sampling from the biased document categories. We maintain the sampling probability of our re-sampling process to be approximately the same as the rate used during the creation of the test collection. This process resulted in a total of 252 documents as our held-out test set with senior assessor’s judgments making it the Gold standard for evaluation. We ask if the TREC 2010 Legal Track test collection is reliable and reusable? Chap- ter 3 explain the issues in the context of TREC Legal Track 2010 test collection creation 105 process. Since pooling was widely adopted, we identify two types of errors in the collection; sampling errors and measurement errors. One way of understanding the measurement er- rors is by studying the classifier’s sensitivity to assessor errors. To do so, we utilize a subset of document families and also the entire set of documents families that were se- lected for re-assessment by senior attorneys as test sets. We focus separately on estimates of recall and precision. The recall and precision values derived are point estimates, and are subject to random variation. Hence we also provide an indication of the expected range of variability around a point estimate, and account for it when comparing the two scores. We compute 95% confidence interval to identify the range within which the point estimates lies in the entire population. We plot the point estimates and the confidence intervals using the judgments from senior assessors. As the senior assessors’ judgments sample is less than 8% of the size of the full set of official judgments, our results yields fairly large confidence intervals, but the comparison does offer useful insights. A standard way of performing analyses to assess the samples is through system ablation study. We removed the results from a system that participated in the stratification process and then re-score all systems, including the ablated system, then observe the effect on system comparisons. Comparing the post-ablation to pre-ablation results, we see that confidence intervals for precision increase for the ablated system which could be attributed to the difference in sampling probabilities of the strata. We conclude that assessor errors do adversely affect absolute estimates of recall, and we have suggested future work on statistical correction for the effects of those errors. For the task of identifying privileged documents, it is known that the recall measure is more important. Thus, this initial result is promising. Now that we have an evaluation plan for a classifier, as our next step we proceed to design a classifier to identify privileged documents. We build multiple binary classifiers utilizing the email content and metadata features. We further investigate the extent to which the remaining privilege judgments in the TREC Legal Track 2010 test collection obtained by the human reviewers are useful for training. As the difference in reviewer’s expertise adversely affect the absolutely estimates in recall, our research questions RQ3a and RQ3b aim to analyze the influence of annotator expertise and sample selection bias, on classifier training. For studying the effect of training the classifier on different sets of 106 judgments depending on the annotators’ expertise, we develop two classifiers; one with judgments from junior level annotator as training set and the other one with judgments from senior assessors as training set. We then compare the classifier performance for recall and precision values with 95% confidence intervals. We observe a significant increase in the recall measure of the classifier trained on document set with senior assessors’ judgments. The problem of selection bias exists in the TREC Legal Track 2010 collection because of the fact that the participating teams were allowed to challenge the judgments of the junior annotator (for details refer Chapter 2) leading to some chosen sample to be reviewed by senior assessor. To study the effect on the bias caused by that chosen sample, we again build two classifiers; one with those documents that were not chosen for adjudication as training set and the other one that were chosen for adjudication as training set. By comparing the classifier results, we conclude that training classifiers on documents that are not chosen for adjudication yields good result. We explain the findings above by collectively analyzing the classifiers’ privilege predictions on the unbiased test set. After completing the task of building a binary classifier for identifying privileged documents, we reached out to some lawyers to understand how to present the results from a system. We wanted to learn which features helps them to perform the privilege review. We ask the research questions outlined in Chapter 1, section 1.2 as RQ4a, RQ4b and RQ4c. Our aim was to get an understanding about how best to present the results from the classifier. As our first step, we thought to highlight the actors in the email communication. We presented three types of metadata information to the lawyers doing the review; actor privilege importance score which we call as propensity, actor’s organization information and actor’s role information in that organization. We developed an algorithm to score the importance of specific email addresses with the goal to determine their propensity to engage in privileged communication. Both recursive and heuristic techniques are used to estimate the propensity score, ultimately resulting a coverage of 94% of the email addresses. Since litigation is time-sensitive, we provided a graphical display of privileged communication temporal patterns. The last type of information from the automation process that was presented to the lawyers during review was the importance of the term to identify privilege. Only the top 10% of the important terms were highlighted to avoid 107 clutter. We categorized the findings from RQ4a, RQ4b and RQ4c as quantitative and qual- itative results. The results to measure the accuracy and speed are quantitative. The qualitative results are from our interview and from our usability questionnaire. To draw some conclusions about the accuracy of privilege review process, we first select an informa- tive set of judgments as benchmark against which review accuracy can be measured. From our analysis, either senior attorney could reasonably be chosen as a benchmark against which the other participant’s judgments could be measured for accuracy. Using one of the senior attorney’s judgments as benchmark, we conclude that there is a statistically significant improvement in recall. This improvement was noticed across all users except one (who was a novice user). This is a promising result. However, when we measure the our system performance for speed, a paired t-test found no detectable difference in aver- age review speed across the two conditions. This could be attributed to our thinking that lawyers were more careful in reviewing a document when more information was provided to them. During this study, we also evaluated our research prototype for its usability. Our usability questionnaire assigned a higher rating to the overall review experience. Person highlighting feature of the system was reported to be useful (to at least some degree) by five of the six participants, whereas term highlighting and the date graph were each reported to be useful to some degree by only two of the six participants. By performing a user-study with the lawyers, we acquired some important pieces of information. One of the main lessons we learn was that the users were open to adopt predictive coding techniques that help them perform the privilege review. The second conclusion we drew was that, there was no measurable change in the review time. Since time is proportional to money during privilege review, the final questions we answer in our dissertation are about the overall costs incident upon the entire e-discovery process. As one answer to the questions we raise in RQ5a and RQ5b, we develop a risk-based minimization framework. This framework is based on utility theory and relies on cost- sensitive ranking. We formalize our problem on the basis that costs and risks exists and can be characterized. Additionally, that misclassification costs do not exist in isolation (e.g., privilege only), but rather at a two-stage level (i.e., responsiveness and privilege). 108 Hence the two stages are best addressed jointly. Our semi-automated system assumes that a document might be produced to the requesting party even if it has not been manually reviewed to be responsive and nonprivileged. When deciding which document should be manually reviewed, we use our ranking algorithm to determine which document is expected to bring about the smallest cost when produced. Manual annotation time and effort is sparingly utilized to bring about a reduction in the number of documents to be reviewed for responsiveness and privilege. A threshold based stopping criteria is used to indicate when the reviewers should stop annotating. We gathered inputs for our cost structure from three e-discovery experts. We de- velop an algorithm that utilizes the classifier results and the cost structure to deter- mine which document needs a manual review. Then, we ask our final question; will our risk-minimization framework help us save some money for any given litigation. Chapter 5 discusses the methodology, experimentation and the results of our risk-minimization framework by introducing a new evaluation measure. Our conclusions are supported by experimentation. For experiments we need a collection that has judgments for two classes (responsive class and privilege class). We need labeled documents that are; (1) responsive and privileged, (2) responsive and not- privileged, (3) not-responsive and privileged and (4) not-responsive and not-privileged. We overcome the problem of the lack of a such a test collection by running simulated ex- periments on a extensively labeled collection unrelated to e-discovery in which documents can belong to more than one class. We build two binary classifiers, utilize their posterior probabilities with the values from the cost structure to determine which document needs a manual annotation. We propose multiple effective baseline methods for comparison. Some of our baseline methods are human-in-the-loop systems, i.e., their predictions are obtained via some combination of manual annotation work and automatic classification. We run our simulations by picking 120 pairs of classes to play the role of responsive class and privileged class. We obtain the results for seven different methods for each of the 120 pairs of classes. Our models were tested on a collection of nearly 200,000 documents with three different cost structures as inputs. 109 From our findings it is clear that cost structure has a lot of influence (1) on how many documents get manually reviewed, both for responsiveness and for privilege, (2) on the total costs incurred by the various methods, and (3) on the difference in cost between the baseline methods and our semi-automated system. Our results show that all of the baselines generally incur substantially higher costs when compared to our model. The empirical evidence with statistical significance tests show that our semi-automated process systematically achieves a reduction in the overall cost of the e-discovery process for two out of the three litigation cases. 6.1 Contributions This dissertation work shows a positive synergy between the lawyers and machines. Although this research work is specific to the domain of e-discovery, the contribution below could be applied to any domain where the relevant content is intermixed with sen- sitive information (like personal and organization emails, medical records, government records,etc.). The work done in this dissertation can be divided into three categorical contributions; System contributions (S), Practical contributions (P) and Research contri- butions (R). In addition to the the research question and answers, the System contributions highlight the frameworks built with the view to enable other researchers to replicate and continue the work we started.The Practical contributions highlight the value of this dis- sertation work in the e-discovery industry and the Research contributions highlight the domain-specific research advances. The contribution of this thesis includes: 6.1.1 System Contributions • S1 - Release of 252 unbiased families1 from the TREC Legal Track 2010 collection with domain-expert annotations for privilege that could be use as a held-out test set and for evaluation. (Chapter 3, section 3.2) 1A family is an email message along with its attachments. 110 • S2 - Development of a multiple binary classifiers for predicting families which have privileged content. (Chapter 3, section 3.3) • S3 - Development of an algorithm to score the importance of people (in privileged context) in email communications. (Chapter 4, section 4.2.1) • S4 - Development of a methodology to compute term importance utilizing word entropy. (Chapter 4, section 4.2.4) 6.1.2 Practical Contributions • P1 - Development of a research prototype to enable lawyers to perform privilege review. (Chapter 4, section 4.2.6) • P2 - Release the code for a review application to enable lawyers to quantify the e-discovery outcome errors in terms of US Dollars. 6.1.3 Research Contributions • R1 - Representing e-discovery outcomes as a ternary classification problem. (Chapter 5, section 5.1) • R2 - Introducing the idea of quantifying the different kinds of erroneous e-discovery outcomes in terms of US Dollars. (Chapter 5, section 5.1 and section 5.6.3) • R3 - Developing a semi-automated process with risk-based ranking algorithm to determine which document deserves to be reviewed by a human. (Chapter 5, section 5.4 and section 5.4.1) 6.2 Limitations A number of important points should be kept in mind when interpreting the exper- iments and results reported in this dissertation. In particular, we would like to highlight the following limitations: 111 1. We develop a classifier to predict privileged documents. We utilize the test collection created during the TREC 2010 Legal Track to train and evaluate our classifier. During evaluation we use the senior annotator judgments on the 252 families in the test set as gold standard. These 252 labeled families were a result of our re-sampling procedure explained in chapter 3, section 3.2. We were limited to a total of 252 families due to the lack of randomly chosen families that had been judged by the senior assessor. 2. The TREC Legal Track 2010 collection lacked positive training examples especially those that are labeled by the senior attorneys. Our classifier was trained on a limited number of positive labeled examples. 3. This work takes the first step to understand the users’ needs by building an interac- tive user-interface to perform user study. We recruited users who had a law degree due to the nature of the task. In our user study explained in chapter 4, we were limited to only six users who were willing to participate in our study. 4. In our user study discussed in chapter 4, we were limited to only 61 labeled families while evaluating each user’s accuracy because only 61 families were reviewed by all the participants. 5. Our work in chapter 5 assumes that human reviewers do not make mistakes, i.e., the judgment of our human reviewers always coincides with the ground truth. 6. In chapter 5, the evaluation metric used to measure the overall cost, is assumed to be a linear function. 7. Experiments in chapter 5 use a test-collection which is not topically related to e- discovery. This is due to the lack of publicly available test collection of documents that are annotated by both responsiveness and privilege. 8. In chapter 5, we quantify classifier errors in terms of cost value in US Dollars. We represent the misclassification cost values as a 3 by 3 contingency matrix with 6 non- zero positive values. In our work, we limit the structure of the input cost matrix to 112 a 3 by 3 dimension. 9. Experiments in chapter 5 were limited to only three input cost matrices. 10. We limited our experiments to category pairs such that the test set prevalence of responsive documents between 3% - 7% and the prevalence of privileged documents in the responsive set is between 1% - 20%. 6.3 Future Work Our experience working on this thesis also suggests several directions as future work. 1. Our initial efforts in this dissertation focused on building a binary classifier to clas- sify for privileged email communications. More experiments need to be conducted to improve the accuracy of our classifier. In our work we build a classifier with an acceptable recall measure. We stress on the fact that privilege review is a recall problem. As a part of future work, the first thing that we suggest is to improve the overall accuracy of privilege classifiers. We also suggest the use of sophisticated features to build the classifier. We suggest employing neural network architecture that, given a sentence, outputs a host of language processing predictions; such as; part-of-speech tags, chunks, named entity tags, semantic roles, semantically similar words and the likelihood that the sentence makes sense (grammatically and semanti- cally) using a language model. These kind of features that exploit the language could have high potential in predicting privileged communications especially between the attorney and the client. 2. Our privilege review platform was designed thinking about the problem that began with the idea that modeling attributes of people such as their roles and their propen- sity to engage in privileged communication might be particularly important for the privilege review task, and our results provide support for that belief. Our results also indicate that dates, while important for relevance review, may be of less value for privilege review (at least in the way we are doing things now). We have noted that the increases in recall that we observed were often accompanied by substantial 113 declines in precision. A further study will be needed to better characterize this effect and to control for it in future experiments. 3. During our user study, we noted no evidence of improvements in review speed, although of course even our most expert participants were novice users of the par- ticular interface that we presented them with. In future work we may therefore consider longitudinal studies that would allow us to see how the same users behave at different points in their personal learning curve. 4. Our experience after the user study suggests to run small-scale studies to tune specific components. As examples, we could ask; what types of multi-word expressions should be considered for highlighting? how many terms or multi-word expressions should be highlighted? how many categories of term highlighting are useful? Studies along those lines might ultimately lead to test collections that could be used as a basis for tuning and evaluating specific system components; for that we will also need to give thought to intrinsic measures for evaluating the performance of individual components. 5. Another productive research direction would be to explore whether we might pro- ductively replace expensive attorneys in some early studies. Would utilizing law students be suitable? Law librarians? Crowd-sourcing services such as Mechanical Turk? Surely we can go some distance in this direction; the key question is how far can we productively go without compromising the accuracy. 6. Our classifiers use SVM as the learning algorithm. As a part of future work we suggest researchers to extend to use other types of learning algorithms like; Logis- tic Regression, Transductive SVMs, Random Forest, Gradient Boosting Machines, Neural Networks etc. in place of the standard SVMs. 7. We propose to model the cost function as a nonlinear cost functions in place of the unit costs we currently use. 8. Our risk-minimization work in chapter 5 assumes that manual reviewers do not make mistakes, i.e., the judgment of our human reviewers always coincides with the ground 114 truth. In future, we suggest experiments that would study the effect of reviewer’s errors. 9. Finally, we should note that this work could be extended in other settings where search amidst sensitive content is needed. 6.4 Implications Today in e-discovery, automation for relevance review has been a topic of discussion. The decision of whether predictive coding can be employed during production is a choice. Attorneys owe it to their clients to become familiar with this newer technology and to consider whether it should be used. It is likely that predictive coding will become more widely used in the near future as parties gain confidence in its accuracy and as we show some preliminary evidence that it truly reduces costs at least in some litigation cases. As the technology-assisted review tools are deployed and adopted, it is natural to expect larger cases to be tackled. With an increase in the number of relevant documents in the collection, automation of privilege review is going to be one predictable consequence. It therefore seems timely to begin to think seriously about how and to what extent use of predictive coding systems could help the e-discovery process. As the volume of digital information grows every year, the need to adopt automation becomes more and more urgent. The answer to the question how can the costs of manual review be controlled? has become a commonplace. The most promising alternative available today for collections with high prevalence resulting in large-scale manual review process is the use of predictive coding and other computerized categorization strategies that can rank electronic documents by using an al- gorithm that determine which document is, responsive, and/or privileged. Manual review is still required during production. Empirical research suggests that predictive coding is at least as accurate as humans in traditional review. Additionally, there is evidence that significant number of manual review hours could be reduced depending on the nature of the documents and other factors, which would make predictive coding one answer to the critical need of significantly reducing review costs. It is certainly not the sole answer, and 115 any cost savings may be negligible unless litigants first take a holistic approach. But, as- suming that best practices have been followed throughout the e-discovery life cycle, these new techniques presented may be one practical approach. Our conclusions about one way to reduce the overall production expenditures are shaped by the topic prevalence, algorithm and cost structures we included in our analysis. Tasks involving pre-processing of the collection could present a greater cost burden for the producing parties when volumes of digital data are huge. Conversely, computer appli- cations for conducting review are unlikely to be economically viable options when dealing with smaller document sets, in which any savings in attorney hours might be overshad- owed by machine-training costs. Our attempt is thus to encourage the legal community to make the choice that is the best option for the litigation. Our hope is that the work in this dissertation will help inform the e-discovery community about how to adapt the review practices to address concerns about the costs of production. 116 Appendices 117 .1 Appendix A IRB Approval Letter 118 1204 Marie Mount Hall College Park, MD 20742-5125 TEL 301.405.4212 FAX 301.314.1475 irb@umd.edu www.umresearch.umd.edu/IRB INSTITUTIONAL REVIEW BOARD DATE: August 4, 2015 TO: Douglas Oard FROM: University of Maryland College Park (UMCP) IRB PROJECT TITLE: [784030-1] Development and Evaluation of Search Technology for Discovery of Evidence in Civil Litigation REFERENCE #: SUBMISSION TYPE: New Project ACTION: APPROVED APPROVAL DATE: August 4, 2015 EXPIRATION DATE: August 3, 2016 REVIEW TYPE: Expedited Review REVIEW CATEGORY: Expedited review category # 7 Thank you for your submission of New Project materials for this project. The University of Maryland College Park (UMCP) IRB has APPROVED your submission. This approval is based on an appropriate risk/benefit ratio and a project design wherein the risks have been minimized. All research must be conducted in accordance with this approved submission. Prior to submission to the IRB Office, this project received scientific review from the departmental IRB Liaison. This submission has received Expedited Review based on the applicable federal regulations. This project has been determined to be a Minimal Risk project. Based on the risks, this project requires continuing review by this committee on an annual basis. Please use the appropriate forms for this procedure. Your documentation for continuing review must be received with sufficient time for review and continued approval before the expiration date of August 3, 2016. Please remember that informed consent is a process beginning with a description of the project and insurance of participant understanding followed by a signed consent form. Informed consent must continue throughout the project via a dialogue between the researcher and research participant. Unless a consent waiver or alteration has been approved, Federal regulations require that each participant receives a copy of the consent document. Please note that any revision to previously approved materials must be approved by this committee prior to initiation. Please use the appropriate revision forms for this procedure. All UNANTICIPATED PROBLEMS involving risks to subjects or others (UPIRSOs) and SERIOUS and UNEXPECTED adverse events must be reported promptly to this office. Please use the appropriate reporting forms for this procedure. All FDA and sponsor reporting requirements should also be followed. All NON-COMPLIANCE issues or COMPLAINTS regarding this project must be reported promptly to this office. - 1 - Generated on IRBNet Please note that all research records must be retained for a minimum of seven years after the completion of the project. If you have any questions, please contact the IRB Office at 301-405-4212 or irb@umd.edu. Please include your project title and reference number in all correspondence with this committee. This letter has been electronically signed in accordance with all applicable regulations, and a copy is retained within University of Maryland College Park (UMCP) IRB's records. - 2 - Generated on IRBNet .2 Appendix B Questionnaire for Gathering Cost Values 121 Questionnaire on Annotation Costs and Misclassification Costs in e-Discovery Douglas W. Oard, Fabrizio Sebastiani, Jyothi K. Vinjumur Premises: • The questionnaire contains 8 questions. You should answer them all, since partially filled questionnaires will be much less useful to us. • Each question except the first has 3 possible answers. Pick your choice by ticking one (and only one) of the 3 tick boxes. If your answer is either the 1st or the 2nd, you should also fill the additional box with a number higher than 1. • You may make any assumption about the type of case and the amounts at stake in the case, but make the same assumptions for every question. • We heartily thank you for your effort; your contribution is of critical impor- tance to our research in automating the e-discovery process. Assumptions: • Documents that are responsive and nonprivileged are produced (P) to the requesting party; • Documents that are responsive and privileged are reported on the privilege log (L) and not produced; • Documents that are nonresponsive are withheld (W) by the producing party (i.e., they are not produced); • “Mistake X is Z times more serious than mistake Y ” can be interpreted as “The overall cost that the producing party incurs by making many mistakes of type X is Z times higher than the cost the same party would incur by making as many mistakes of type Y ”. 1 Question # 1 Which of the following best describes your background: Senior attorney who has supervised e-discovery reviews Attorney who has participated in e-discovery reviews as a reviewer Other highly knowledgable e-discovery expert Other attorney Other (please describe): Question # 2 Consider two types of mistakes: LP Situation: Document is responsive and nonprivileged (it should thus be produced) Mistake: Document is erroneously reported on the privilege log and not produced PL Situation: Document is responsive and privileged (it should thus be reported on the privilege log and not produced) Mistake: Document is erroneously produced Is mistake LP more serious than mistake PL? Yes, mistake LP is times more serious than mistake PL. No, mistake PL is times more serious than mistake LP. They are equally serious. Question # 3 Consider two types of mistakes: LW Situation: Document is nonresponsive (it should thus be withheld) Mistake: Document is erroneously reported on the privilege log (and not produced) WL Situation: Document is responsive and privileged (it should thus be reported on the privilege log and not produced) Mistake: Document is erroneously deemed nonresponsive (and thus withheld) Is mistake LW more serious than mistake WL? Yes, mistake LW is times more serious than mistake WL. No, mistake WL is times more serious than mistake LW. 2 They are equally serious. Question # 4 Consider two types of mistakes: WP Situation: Document is responsive and nonprivileged (it should thus be produced) Mistake: Document is erroneously considered nonresponsive (and thus withheld) PW Situation: Document is nonresponsive (it should thus be withheld) Mistake: Document is erroneously produced Is mistake WP more serious than mistake PW? Yes, mistake WP is times more serious than mistake PW. No, mistake PW is times more serious than mistake WP. They are equally serious. Question # 5 Consider two types of mistakes: LP Situation: Document is responsive and nonprivileged (it should thus be produced) Mistake: Document is erroneously reported on the privilege log and not produced LW Situation: Document is nonresponsive (it should thus be withheld) Mistake: Document is erroneously reported on the privilege log (and not produced) Is mistake LP more serious than mistake LW? Yes, mistake LP is times more serious than mistake LW. No, mistake LW is times more serious than mistake LP. They are equally serious. Question # 6 Consider two types of mistakes: LW Situation: Document is nonresponsive (it should thus be withheld) Mistake: Document is erroneously reported on the privilege log (and not produced) WP Situation: Document is responsive and nonprivileged (it should thus be produced) Mistake: Document is erroneously considered nonresponsive (and thus withheld) Is mistake LW more serious than mistake WP? 3 Yes, mistake LW is times more serious than mistake WP. No, mistake WP is times more serious than mistake LW. They are equally serious. Question # 7 Consider the following type of mistake: WP Situation: Document is responsive and nonprivileged (it should thus be produced) Mistake: Document is erroneously considered nonresponsive (and thus withheld) Is the cost of annotating a document for responsiveness higher than the cost brought about by a mistake of type WP? Yes, the cost of annotating a document for responsiveness is times higher than the cost brought about by a mistake of type WP. No, the cost brought about by a mistake of type WP is times higher than the cost of annotating a document for responsiveness. The two costs are equal. Question # 8 Is the cost of annotating a document for responsiveness higher than the cost of annotating a document for privilege? Yes, the cost of annotating a document for responsiveness is times higher than the cost of annotating a document for privilege. No, the cost of annotating a document for privilege is times higher than the cost of annotating a document for responsiveness. The two costs are equal. If you wish you may add your name and contact (and possible additional comments) here: Name : Contact : Comments : 4 Bibliography [1] Underwater Storage, Inc. v. United States Rubber Co. 314(Civ. A. No. 751-64):546, 1970. [2] United States v. El Paso Co. 682(No. 81-2484):530, 1982. [3] Armstrong v. Bush, 1989. [4] In re Sealed Case. 877(No. 89-5102):976, 1989. [5] Kleen Products, LLC v. Packaging Corporation of America, 2011. [6] Global Aerospace Inc. v. Landow Aviation, LP, 2012. [7] Moore v. Publicis Groupe SA, 2012. [8] EORHB, inc. v. HOA Holdings LLC, 2013. [9] Alan Agresti and Brent A Coull. Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions. The American Statistician, 52(2):119–126, 1998. [10] Ron Artstein and Massimo Poesio. Inter-coder Agreement for Computational Lin- guistics. Computational Linguistics, 34(4):555–596, 2008. [11] Amittai Axelrod, Xiaodong He, and Jianfeng Gao. Domain Adaptation via Pseudo In-domain Data Selection. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2011. [12] Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P de Vries, and Emine Yilmaz. Relevance Assessment: Are Judges Exchangeable and Does it Matter. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 667–674. ACM, 2008. [13] Jason R Baron. The TREC Legal Track: Origins and Reflections on the First Year. In Sedona Conference, volume 8, pages 251–253, 2007. [14] Jason R Baron. Toward a New Jurisprudence of Information Retrieval: What Con- stitutes a Reasonable Search for Digital Evidence when Using Keywords. Digital Evidence & Elec. Signature L. Rev., 2008. 126 [15] Jason R Baron, David D Lewis, and Douglas W Oard. TREC 2006 Legal Track Overview. In TREC, 2006. [16] Giacomo Berardi, Andrea Esuli, and Fabrizio Sebastiani. Utility-theoretic ranking for semi-automated text classification. ACM Transactions on Knowledge Discovery from Data, 10(1):Article 6, 2015. [17] Alina Beygelzimer, Varsha Dani, Tom Hayes, John Langford, and Bianca Zadrozny. Error Limiting Reductions between Classification Tasks. In Proceedings of the 22nd International Conference on Machine learning, pages 49–56. ACM, 2005. [18] Paul E Black. Ratcliff/Obershelp pattern recognition. Dictionary of Algorithms and Data Structures, 17, 2004. [19] David C Blair and Melvin E Maron. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM, 28(3):289–299, 1985. [20] Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. Bias and the limits of pooling. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 619–620. ACM, 2006. [21] Chris Buckley, Darrin Dimmick, Ian Soboroff, and Ellen Voorhees. Bias and the limits of pooling for large collections. Information retrieval, 10(6):491–508, 2007. [22] LB Calkins. Enron fraud trial ends in 5 convictions. The Washington Post, 2004. [23] Ben Carterette, Virgil Pavlu, Evangelos Kanoulas, Javed A Aslam, and James Al- lan. Evaluation Over Thousands of Queries. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 651–658. ACM, 2008. [24] Ben Carterette and Ian Soboroff. The effect of assessor error on ir system evaluation. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 539–546. ACM, 2010. [25] Jianlin Cheng, Amanda Jones, Caroline Privault, and Jean-Michel Renders. Soft Labeling for Multi-pass Document Review. In Proceedings of the 14th International Conference on Artificial Intelligence and Law, DESI V Workshop, 2013. [26] Gordon V Cormack and Maura R Grossman. Evaluation of Machine-Learning Pro- tocols for Technology-Assisted Review in Electronic Discovery. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Informa- tion Retrieval, pages 153–162. ACM, 2014. [27] Gordon V. Cormack and Maura R. Grossman. Autonomy and reliability of continuous active learning for technology-assisted review. CoRR abs/1504.06868, 2015. [28] Gordon V. Cormack and Maura R. Grossman. Multi-faceted recall of continuous active learning for technology-assisted review. In Proceedings of the 38th ACM Con- ference on Research and Development in Information Retrieval (SIGIR 2015), pages 763–766, Santiago, CL, 2015. 127 [29] Gordon V Cormack, Maura R Grossman, Bruce Hedin, and Douglas W Oard. Overview of the TREC 2010 legal track. In Proc. 19th Text REtrieval Conference, page 1, 2010. [30] Gordon V. Cormack and Mona Mojdeh. Machine learning for information retrieval: TREC 2009 web, relevance feedback and legal tracks. In Proceedings of the 18th Text Retrieval Conference (TREC 2009), Gaithersburg, US, 2009. [31] Yanlei Diao, Hongjun Lu, and Dekai Wu. A comparative study of classification based personal e-mail filtering. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 408–419. Springer, 2000. [32] Pedro Domingos. Metacost: A General Method for Making Classifiers Cost-sensitive. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, pages 155–164. ACM, 1999. [33] Edna Selan Epstein. The Attorney-Client Privilege and the Work-Product Doctrine. ABA 2001. [34] Gregory L Fordham. Using Keyword Search Terms in E-Discovery and How They Relate to Issues of Responsiveness, Privilege, Evidence Standards and Rube Goldberg. Rich. JL & Tech., 15:1, 2008. [35] Manfred Gabriel, Chris Paskach, and David Sharpe. The challenge and promise of predictive coding for privilege. In Proceedings of the 14th International Conference on Artificial Intelligence and Law, DESI V Workshop, 2013. [36] Maura R. Grossman and Gordon V. Cormack. Technology-assisted review in e- discovery can be more effective and more efficient than exhaustive manual review. Richmond Journal of Law and Technology, 17(3):Article 5, 2011. [37] Bruce Hedin, Stephen Tomlinson, Jason R Baron, and Douglas W Oard. Overview of the trec 2009 legal track. Technical report, National Archives And Records Ad- ministration, 2009. [38] Thorsten Joachims. Making large-scale SVM learning practical. In Bernhard Schölkopf, Christopher J. Burges, and Alexander J. Smola, editors, Advances in Ker- nel Methods – Support Vector Learning, chapter 11, pages 169–184. The MIT Press, Cambridge, US, 1999. [39] Thorsten Joachims. Estimating the generalization performance of a SVM efficiently. In Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pages 431–438, Stanford, US, 2000. [40] Ashish Kapoor, Eric Horvitz, and Sumit Basu. Selective supervision: Guiding su- pervised learning with decision-theoretic active learning. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI 2007), pages 877–882, San Francisco, US, 2007. [41] David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network. In Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data mining, 2003. 128 [42] Anne Kershaw. Automated document review proves its reliability. Digital Discovery & e-Evidence, 5(11), 2005. [43] J Richard Landis and Gary G Koch. An Application of Hierarchical Kappa-type Statistics in the Assessment of Majority Agreement Among Multiple Observers. Bio- metrics, pages 363–374, 1977. [44] Renaud LAPLANCHE, Joaquin DELGADO, and Matt TURCK. Concept search technology goes beyond keywords. Information outlook, 8(7), 2004. [45] David D Lewis. The trec-4 filtering track. Harman [7], pages 165–180, 1995. [46] David D. Lewis and William A. Gale. A sequential algorithm for training text clas- sifiers. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR 1994), pages 3–12, Dublin, IE, 1994. [47] David D. Lewis, Yiming Yang, Tony G. Rose, and Fan Li. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004. [48] Adjoa Linzy. Attorney-client Privilege and Discovery of Electronically-Stored Infor- mation, the. Duke L. & Tech. Rev., 2011. [49] Giuseppe Manco, Elio Masciari, Massimo Ruffolo, and Andrea Tagarelli. Towards an adaptive mail classifier. In Procedings of Italian Association for Artificial Intelligence Workshop, 2002. [50] Andrew McCallum, Andrés Corrada-Emmanuel, and Xuerui Wang. The author- recipient-topic model for topic and role discovery in social networks, with application to enron and academic email. In Workshop on Link Analysis, Counterterrorism and Security, pages 33–44, 2005. [51] Stefano Mizzaro. Relevance: The whole history. Journal of the Association for Information Science and Technology, 48(9):810–832, 1997. [52] Alistair Moffat and Justin Zobel. Rank-biased Precision for Measurement of Retrieval Effectiveness. ACM Transactions on Information Systems (TOIS), 27(1):2, 2008. [53] Robert C Moore and William Lewis. Intelligent Selection of Language Model Train- ing Data. In Proceedings of the ACL 2010 conference short papers, pages 220–224. Association for Computational Linguistics, 2010. [54] Doug Oard, Fabrizio Sebastiani, and Jyothi Vinjumur. Minimizing the expected costs of review for responsiveness and privilege in e-discovery. Manuscript under Review, 2018. [55] Douglas W Oard et al. Evaluation of Information Retrieval for E-discovery. Artificial Intelligence and Law, 2010. [56] Douglas W Oard, Bruce Hedin, Stephen Tomlinson, and Jason R Baron. Overview of the trec 2008 legal track. Technical report, University of Maryland, College Park; College of Information Studies, 2008. 129 [57] Douglas W Oard, Jyothi Vinjumur, and Fabrizio Sebastiani. When is it rational to review for privilege? 2017. [58] Douglas W Oard and William Webber. Information retrieval for e-discovery. Foun- dations and Trends in Information Retrieval, 2013. [59] Nicholas M Pace and Laura Zakaras. Where the money goes: Understanding litigant expenditures for producing electronic discovery. RAND Corporation, 2012. [60] George L Paul and Jason R Baron. Information inflation: Can the legal system adapt. Rich. JL & Tech., 13:1, 2006. [61] Emily Pronin. Perception and misperception of bias in human judgment. Trends in cognitive sciences, 11(1):37–43, 2007. [62] Herbert L Roitblat, Anne Kershaw, and Patrick Oot. Document categorization in legal electronic discovery: computer classification vs. manual review. Journal of the Association for Information Science and Technology, 61(1):70–80, 2010. [63] Tanay K. Saha, Mohammad Al Hasan, Chandler Burgess, M. Ahsan Habib, and Jeff Johnson. Batch-mode active learning for technology-assisted review. In Proceedings of the 3rd IEEE International Conference on Big Data (Big Data 2015), pages 1134– 1143, Santa Clara, US, 2015. [64] Tefko Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. part iii: Behavior and effects of relevance. Journal of the Association for Information Science and Technology, 58(13):2126–2144, 2007. [65] Burr Settles, Mark Craven, and Lewis Friedland. Active learning with real annotation costs. In Proceedings of the NIPS Workshop on Cost-Sensitive Learning, Vancouver, CA, 2008. [66] Jitesh Shetty and Jafar Adibi. Discovering important nodes through graph entropy the case of enron email database. In Proceedings of the 3rd international workshop on Link discovery, pages 74–81. ACM, 2005. [67] Karen Sparck Jones and Cornelis Joost van Rijsbergen. Information Retrieval Test Collections. Journal of documentation, 32(1):59–75, 1976. [68] Katrin Tomanek and Udo Hahn. A comparison of models for cost-sensitive active learning. In Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), pages 1247–1255, Beijing, CN, 2010. [69] Stephen Tomlinson, Douglas W Oard, Jason R Baron, and Paul Thompson. Overview of the trec 2007 legal track. In TREC. Citeseer, 2007. [70] Sudheendra Vijayanarasimhan and Kristen Grauman. What’s it going to cost you?: Predicting effort vs. informativeness for multi-label image annotations. In Proceedings of the 15th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2009), pages 2262–2269, Miami, US, 2009. [71] Jyothi K Vinjumur. Evaluating Expertise and Sample Bias Effects for Privilege Classification in E-discovery. In Proceedings of the 15th International Conference on Artificial Intelligence and Law, pages 119–127. ACM, 2015. 130 [72] Jyothi K Vinjumur and Douglas W Oard. Finding the privileged few: Supporting privilege review for e-discovery. Proceedings of the Association for Information Science and Technology, 52(1):1–4, 2015. [73] Jyothi K Vinjumur, Douglas W Oard, and Amittai Axelrod. An AID for Avoiding Inadvertent Disclosure: Supporting Interactive Review for Privilege in E-discovery. In Proceedings of the 2016 ACM on Conference on Human Information Interaction and Retrieval, pages 53–62. ACM, 2016. [74] Jyothi K Vinjumur, Douglas W Oard, and Jiaul H Paik. Assessing the Reliability and Reusability of an e-discovery Privilege Test Collection. In Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2014. [75] Ellen M Voorhees. Variations in Relevance Judgments and the Measurement of Re- trieval Effectiveness. Information Processing & Management, 2000. [76] Ellen M Voorhees. The philosophy of information retrieval evaluation. In Workshop of the Cross-Language Evaluation Forum for European Languages, pages 355–370. Springer, 2001. [77] Ellen M Voorhees, Donna K Harman, et al. TREC: Experiment and evaluation in information retrieval, volume 1. MIT press Cambridge, 2005. [78] Xuerui Wang, Natasha Mohanty, and Andrew McCallum. Group and Topic Discovery from Relations and Text. In International Workshop on Link discovery, 2005. [79] William Webber. Re-examining the Effectiveness of Manual Review. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval for E-Discovery Workshop, page 2, 2011. [80] William Webber, Douglas W Oard, Falk Scholer, and Bruce Hedin. Assessor error in stratified evaluation. In Proceedings of the 19th ACM international conference on Information and knowledge management, pages 539–548. ACM, 2010. [81] William Webber and Laurence AF Park. Score adjustment for correction of pooling bias. In Proceedings of the 32nd international ACM SIGIR Conference on Research and Development in Information Retrieval, pages 444–451. ACM, 2009. [82] William Webber and Jeremy Pickens. Assessor disagreement and text classifier ac- curacy. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 929–932. ACM, 2013. [83] William Webber, Bryan Toth, and Marjorie Desamito. Effect of written instructions on assessor agreement. In Proceedings of the 35th international ACM SIGIR Confer- ence on Research and Development in Information Retrieval, pages 1053–1054. ACM, 2012. [84] Wenpu Xing and Ali Ghorbani. Weighted Pagerank Algorithm. In Communication Networks and Services Research, 2004. Proceedings. Second Annual Conference on, pages 305–314. IEEE, 2004. 131 [85] Emine Yilmaz and Javed A Aslam. Estimating Average Precision with Incomplete and Imperfect Judgments. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 102–111. ACM, 2006. [86] Emine Yilmaz, Javed A Aslam, and Stephen Robertson. A New Rank Correlation Coefficient for Information Retrieval. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 587–594. ACM, 2008. [87] Emine Yilmaz, Evangelos Kanoulas, and Javed A Aslam. A Simple and Efficient Sampling Method for Estimating AP and NDCG. In Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 603–610. ACM, 2008. [88] Dong Zhang, Daniel Gatica-Perez, Deb Roy, and Samy Bengio. Modeling interac- tions from email communication. In Multimedia and Expo, 2006 IEEE International Conference on, pages 2037–2040. IEEE, 2006. [89] Justin Zobel. How Reliable are the Results of Large-scale Information Retrieval Ex- periments? In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 307–314. ACM, 1998. 132