ABSTRACT
Title of proposal: Gathering Language Data Using Experts
Denis Peskov, 2022
Dissertation directed by: Professor Jordan Boyd-Graber
Department of Computer Science
College of Information Studies
Language Science Center
Institute for Advanced Computer Studies
Natural language processing needs substantial data to make robust predictions.
Automatic methods, unspecialized crowds, and domain experts can be used to collect
conversational and question answering nlp datasets. A hybrid solution of combining
domain experts with the crowd generates large-scale, free-form language data.
A low-cost, high-output approach to data creation is automation. We create
and analyze a large-scale audio question answering dataset through text-to-speech
technology. Additionally, we create synthetic data from templates to identify limi-
tations in machine translation. We conclude that the cost-savings and scalability of
automation come at the cost of data quality and naturalness.
Human input can provide this degree of naturalness, but is limited in scale.
Hence, large-scale data collection is frequently done through crowd-sourcing. A
question-rewriting task, in which a long information-gathering conversation is used
as source material for many stand-alone questions, shows the limitation of using
this methodology for generating data. Certain users provide low-quality rewrites?
removing words from the question, copy and pasting the answer into the question?if
left unsupervised. We automatically prevent unsatisfactory submissions with an in-
terface, but the quality control process requires manually reviewing 5,000 questions.
Therefore, we posit that using domain experts for data generation can cre-
ate novel and reliable nlp datasets. First, we introduce computational adaptation,
which adapts, rather than translates, entities across cultures. We work with native
speakers in two countries to generate the data, since the gold label for this is sub-
jective and paramount. Furthermore, we hire professional translators to assess our
data. Last, in a study on the game of Diplomacy, community members generate a
corpus of 17,000 messages that are self-annotated while playing a game about trust
and deception. The language is varied in length, tone, vocabulary, punctuation,
and even emojis. Additionally, we create a real-time self-annotation system that
annotates deception in a manner not possible through crowd-sourced or automatic
methods. The extra effort in data collection will hopefully ensure the longevity of
these datasets and galvanize other novel nlp ideas.
However, experts are expensive and limited in number. Hybrid solutions pair
potentially unreliable and unverified users in the crowd with experts. We work
with Amazon customer service agents to generate and annotate of goal-oriented
81,000 conversations across six domains. Grounding the conversation with a reliable
conversationalist?the Amazon agent?creates free-form conversations; using the
crowd scales these to the size needed for neural networks.
Gathering Natural Language Processing Data Using Experts
by
Denis Peskov
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2022
Advisory Committee:
Professor Jordan Boyd-Graber, Chair
Professor Philip Resnik, Dean?s Representative
Professor Michelle Mazurek
Professor Katie Shilton
Professor John Dickerson
?c Copyright by
Denis Peskov
2022
Dedication
The spirit is willing, but the flesh is weak. ? Russian ? . . .
The vodka is strong, but the meat is rotten.
-Georgetown-IBM, 1950
The spirit is willing, but the flesh is weak.
-Google Translate, 2021
The spirit desires, but the flesh is weak.
-Yandex Translate, 2021
The spirit is willing, but the flesh is weak.
-DeepL Translate, 2021
If as one people speaking the same language they have begun
to do this, then nothing they plan to do will be impossible
for them.
ii
Acknowledgments
Thank you to my advisor Jordan Boyd-Graber; my committee members Philip
Resnik, Michelle Mazurek, Katie Shilton, and John Dickerson; Joe Barrow and peers
from the University of Maryland clip lab; Alex Fraser, Hinrich Sch?tze, and oth-
ers from Ludwig-Maximilians-Universit?t M?nchen; Benny Cheng, Sander Schul-
hoff, and other students I?ve supervised; arlis, the daad, Amazon Research, 3M,
Raytheon bbn, and other sources of funding or collaboration; Tom Hurst and the
Computer Science Department; the Language Science Center; Tim Beach, Terrence
Reynolds, and other undergraduate professors that inspired me; my College Park
roommate Kodjo Aflagah; my family; and my friends.
I could not have done this without you.
iii
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables viii
List of Figures xii
1 The Case for Upfront Investment in Data 1
1.1 Defining Data: Annotation and Generation . . . . . . . . . . . . . . . . . . 2
1.2 Quantity over Quality as a Paradigm . . . . . . . . . . . . . . . . . . . . . . 2
1.3 The Nuance of Using Text as Data . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Data Quality as a New Paradigm . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Natural Language Processing Depends on Data 8
2.1 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 What is a Task? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.2 Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Data Collection Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Finding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Crowd-Sourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.4 Expert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.5 Hybrid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Models & Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.2 Neural Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 Deep Averaging Network . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.3.4 Sequence Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
iv
3 Automatic Data Generation from a Found Source 35
3.1 Automated Data Creation for Question Answering . . . . . . . . . . . . . . 35
3.2 Automatically Generating a Speech Dataset . . . . . . . . . . . . . . . . . . 38
3.2.1 Why Question Answering is challenging for asr . . . . . . . . . . . . 39
3.3 Mitigating Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 ir Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Forced Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Confidence Augmented dan . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Qualitative Analysis & Human Data . . . . . . . . . . . . . . . . . . 46
3.5 Confidence in Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.1 Can Question Answering Audio be Automated? . . . . . . . . . . . 47
3.6 Implications of Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4 Automatic Data Generation without a Source 50
4.1 Evaluating Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 Meaningful Model Evaluation in Machine Translation . . . . . . . . . . . . . 51
4.3 Why is Coreference Resolution Relevant? . . . . . . . . . . . . . . . . . . . 54
4.4 Do Androids Dream of Coreference Translation Pipelines? . . . . . . . . . . 56
4.5 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6 ContraPro: Adversarial Attacks on an Adversarial Dataset . . . . . . . . . . 57
4.6.1 About ContraPro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.6.2 Adversarial Attack Generation . . . . . . . . . . . . . . . . . . . . . 58
4.6.2.1 Phrase Addition . . . . . . . . . . . . . . . . . . . . . . . . 59
4.6.2.2 Possessive Extension . . . . . . . . . . . . . . . . . . . . . . 60
4.6.2.3 Synonym Replacement . . . . . . . . . . . . . . . . . . . . . 60
4.6.3 Quality Assessment of the Automatic Attacks by an Expert . . . . . 61
4.6.4 Evaluating Adversarial Attacks . . . . . . . . . . . . . . . . . . . . . 62
4.7 Contracat: A Fine-Grained Adversarial Dataset . . . . . . . . . . . . . . . 62
4.7.1 Template Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.7.2 Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.7.3 Markable Detection with a Humanness Filter . . . . . . . . . . . . . 65
4.7.4 Coreference Resolution . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.7.5 Translation to German . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.8 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.8.1 Augmentation Improves Coreference Accuracy . . . . . . . . . . . . . 70
4.8.2 ContraPro Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.8.3 Contracat Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.9 Our Dataset in Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.10 Implications for Machine Translation and Automation . . . . . . . . . . . . 73
5 Crowd-Sourced Generation 75
5.1 Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Dataset Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.2.1 Anaphora Resolution and Coreference . . . . . . . . . . . . . . . . . 80
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
v
6 Expert Annotation and Evaluation 84
6.1 When Translation Misses the Mark . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Wer ist Bill Gates? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 . . . and why Bill Gates? . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3 Adaptation from a Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . 89
6.4 An Alternate Embedding Approach . . . . . . . . . . . . . . . . . . . . . . 91
6.5 Comparing Automation to Human Judgment . . . . . . . . . . . . . . . . . 93
6.5.1 Adaptation by Locals . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.5.2 Are the Adaptations Plausible? . . . . . . . . . . . . . . . . . . . . . 96
6.5.3 Why Adaptation is Difficult . . . . . . . . . . . . . . . . . . . . . . . 96
6.5.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.6 Generating New Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.6.1 Adaptation is not Trivial . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.7 A New Computational Task . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 Expert Generation 104
7.1 Where Does One Find Long-Term Deception? . . . . . . . . . . . . . . . . . 105
7.2 Diplomacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.2.1 A game walk-through . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.2.2 Defining a lie . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2.3 Annotating truthfulness . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.3 Broader Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4 Engaging a Community of Liars . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.4.1 Seamless Diplomacy Data Generation . . . . . . . . . . . . . . . . . 114
7.4.2 Building a player base . . . . . . . . . . . . . . . . . . . . . . . . . . 115
7.4.3 Data overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.4.4 Demographics and self-assessment . . . . . . . . . . . . . . . . . . . 117
7.4.5 An ontology of deception . . . . . . . . . . . . . . . . . . . . . . . . 119
7.5 Detecting Lies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5.1 Metric and data splits . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.5.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.5.3 Neural . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.6 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.7 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8 Quantity and (Mostly) Quality Through Hybridization 128
8.1 The Goal of Creating Goal-Oriented Dialog . . . . . . . . . . . . . . . . . . 129
8.2 Existing Dialog Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.3 MultiDoGO Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.3.1 Defining Dialog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
8.3.2 Data Collection Procedure . . . . . . . . . . . . . . . . . . . . . . . . 134
8.4 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.4.1 Annotated Dialog Tasks . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.4.2 Annotation Design Decisions . . . . . . . . . . . . . . . . . . . . . . 137
8.4.3 Quality Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
8.4.4 Dataset Characterization and Statistics . . . . . . . . . . . . . . . . 140
8.5 Dialog Classification Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.5.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
vi
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
9 Conclusions on Natural Language Processing Data 149
9.1 Hybridization of Diplomacy: Diplomacy2.0 . . . . . . . . . . . . . . . . . . . 150
9.1.1 Data for Communication . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.1.2 Data for Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.1.3 Evaluation Through Human Studies . . . . . . . . . . . . . . . . . . 153
9.2 Understanding Organizations with Economic and Legal Experts . . . . . . . 155
9.2.1 The World Trade Organization . . . . . . . . . . . . . . . . . . . . . 156
9.2.2 The Federal Reserve Board . . . . . . . . . . . . . . . . . . . . . . . 157
9.3 Creating Timeless Natural Language Processing Datasets . . . . . . . . . . 158
A Adaptation 160
A.1 Wikipedia Q&A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
A.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
B Diplomacy 182
B.1 Further Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
B.2 A Full Game Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
C MultiDoGO 215
C.1 Conversational Biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
C.2 Agent Dialogue Acts Schema . . . . . . . . . . . . . . . . . . . . . . . . . . 217
C.3 Customer Intent Classes Schema . . . . . . . . . . . . . . . . . . . . . . . . 218
C.4 Slot Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
vii
List of Tables
1.1 A tabular summary of our projects. Our thesis is organized in increasing
order of data source complexity. . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Three questions from trec 2000 data that are believably varied. The test
questions were carefully crafted by experts. . . . . . . . . . . . . . . . . . . 10
2.2 The paper examples from squad. Unlike Table 2.1, these questions are done
through crowd-sourcing and Wikipedia and are not carefully planned. . . . 11
2.3 A tabular summary of dialog datasets. The datasets described as hybrid
all scrape or use naturally-occurring language and then supplement it with
crowd-sourced annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 A tabular summary of key dialog datasets. . . . . . . . . . . . . . . . . . . . 14
2.5 In contrast to the previous conversations involving crowd workers, conver-
sations involving experts generate creative, and even humorous, language.
Additionally, the annotation of truthfulness is not possible with crowd-
sourcing, since it requires the generator?s real-time knowledge. This con-
versation snippet is from the Diplomacy project (Chapter 7). . . . . . . . . 22
3.1 As original data are translated through asr, it degrades in quality. One-
best output captures per-word confidence. Full lattices provide additional
words and phone data captures the raw asr sounds. Our confidence model
and forced decoding approach could be used for such data in future work. . 42
3.2 Both forced decoding (fd) and the best confidence model improve accu-
racy. Jeopardy only has an At-End-of-Sentence metric, as questions are one
sentence in length. Combining the two methods leads to a further joint
improvement in certain cases. ir and dan models trained and evaluated on
clean data are provided as a reference point for the asr data. . . . . . . . . 46
3.3 Variation in different speakers causes different transcriptions of a question
on Oxford. The omission or corruption of certain named entities leads to
different answer predictions, which are indicated with an arrow. . . . . . . 46
4.1 A hypothetical cr pipeline that sequentially resolves and translates a pronoun. 55
4.2 Template examples targeting different cr steps and substeps. For German,
we create three versions with er, sie, or es as different translations of it. . . 64
4.3 Examples of training data augmentations. The source side of the augmented
examples remains the same. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
viii
5.1 Manual inspection of 50 rewritten context-independent questions from ca-
nard suggests that the new questions have enough context to be indepen-
dently understandable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Not all rewrites correctly encode the context required to answer a ques-
tion. We take two failures to provide examples of the two common issues:
Changed Meaning (top) and Needs Context (middle). We provide an exam-
ple with no issues (bottom) for comparison. . . . . . . . . . . . . . . . . . . 80
5.3 An example that had over ten flagged proper nouns in the history. Rewriting
requires resolving challenging coreferences. . . . . . . . . . . . . . . . . . . . 82
6.1 WikiData and unsupervised embeddings (3CosAdd) generate adaptations
of an entity, such as Bill Gates. Human adaptations are gathered for eval-
uation. American and German entities are color coded. . . . . . . . . . . . . 86
6.2 If we consider human adaptations as correct, where do they land in the rank-
ing of automatic adaptation candidates? In this recall-oriented approach,
learned mappings (which use a small number of training pairs), rate highest. 98
6.3 A hypothetical qa pipeline that adapts a question. . . . . . . . . . . . . . . 100
7.1 An annotated conversation between Italy(white) and Germany(gray) at a
moment when their relationship breaks down. Each message is annotated by
the sender (and receiver) with its intended or perceived truthfulness; Italyis
lying about . . . lying. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.2 Summary statistics for our train data (nine of twelve games). Messages are
long and only five percent are lies, creating a class imbalance. . . . . . . . . 116
7.3 Examples of messages that were intended to be truthful or deceptive by the
sender or receiver. Most messages occur in the top left quadrant (Straight-
forward). Figure 7.4 shows the full distribution. Both the intended and
perceived properties of lies are of interest in our study. . . . . . . . . . . . 118
7.4 An example of an actual lie detected (or not) by both players and our
best computational model (Context lstm + Power) from each quadrant.
Both the model and the human recipient are mostly correct overall (Both
Correct), but they are both mostly wrong when it comes to specifically
predicting lies (Both Wrong). . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.5 Conditioning on only lies, most messages are now identified incorrectly by
both our best model (Context lstm + Power) and players. . . . . . . . . . 124
8.1 A segment of a dialog from the airline domain annotated at the turn level.
This data is annotated with agent dialog acts (DA), customer intent classes
(IC), and slot labels (SL). Roles C and A stand for ?Customer? and ?Agent?. 129
8.2 Inter Source Annotation Agreement (isaa) scores quantifying the agreement
of crowd sourced and professional annotations. . . . . . . . . . . . . . . . . 138
8.3 Total number of conversations per domain: raw conversations Elicited;
Good/Excellent is the total number of conversations rated as such by the
agent annotators; (IC/SL) is the number of conversations annotated for
Intent Classes and Slot Labels only; (DA/IC/SL) is the total number of
conversations annotated for Dialog Acts, Intent Classes, and Slot Labels. . . 139
ix
8.4 Number of conversations per domain collected with specific biases. Fast
Food had the maximum number of biases. MultiIntent and SlotChange are
the most used biases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.5 MultiDoGO is several times larger in nearly every dimension to the pertinent
datasets as selected by Budzianowski et al. (2018). We provide counts for
the training data, except for frames, which does not have splits. Our
number of unique tokens and slots can be attributed to us not relying on
carrier phrases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.6 Data statistics by domain. Conversation length is in average (median) num-
ber of turns per conversation. Inter-annotator agreement (iaa) is measured
with Fleiss? ? for the three annotation tasks: Agent da (da), Customer ic
(ic), and Slot Labeling (sl). . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.7 Inter-annotator agreement (iaa) is measured with Fleiss? ? for the three
annotation tasks: Agent DA (DA), Customer IC (IC), and Slot Labeling
(SL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.8 Dialog act (da), intent class (ic), and slot labeling (sl) F1 scores by domain
for the majority class, lstm, and elmobaselines on data annotated at the
sentence (S) and turn (T) level. Bold text denotes the model architecture
with the best performance for a given annotation granularity, i.e., sentence
or turn level. Red highlight denotes the model with the best performance
on a given task across annotation granularities. . . . . . . . . . . . . . . . . 142
8.9 Joint training of ELMo on all agent DA data leads to a slight increase in test
performance. However, we expect stronger joint models that use transfer
learning should see a larger improvement. Bold text denotes the training
strategy, i.e., single domain (Base) or multi-domain (Joint), with the best
performance for a given annotation granularity. Red highlight denotes the
strategy with the highest DA F1 score across annotation granularities. . . . 143
A.1 Veale NOC German?American adaptations. . . . . . . . . . . . . . . . . . 163
A.2 Veale NOC American?German adaptations. . . . . . . . . . . . . . . . . . 165
A.3 Top Wikipedia German?American adaptations. . . . . . . . . . . . . . . . 170
A.4 Top Wikipedia American?German adaptations. . . . . . . . . . . . . . . . 175
A.5 We show top-5 predictions out of the top-100 for American?German adap-
tations on the Veale NOC subset using WikiData. These are compared to
our human annotations in our results. . . . . . . . . . . . . . . . . . . . . . 177
A.6 We show top-5 predictions out of the top-100 for American?German adap-
tations on the Veale NOC subset using 3CosAdd. These are compared to
our human annotations in our results. . . . . . . . . . . . . . . . . . . . . . 179
A.7 We show top-5 predictions out of the top-100 for American?German adap-
tations on the Veale NOC subset with our Learned Adaptation approach.
These are compared to our human annotations in our results. . . . . . . . . 181
B.1 Users optionally provide free response descriptions of the game. This can
be used for qualitative analysis or potentially for algorithmic summarization. 183
B.2 Examples of persuasion from the games annotated with tactics from Cialdini
and Goldstein (2004). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
B.3 The word lists used for our Harbingers (Niculae et al., 2015) logistic regres-
sion models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
x
B.4 This is a full game transcript of a game between Germanyand Italy. Oc-
casional messages that did not receive a Suspected Lie annotation by the
receiver are annotated as None. . . . . . . . . . . . . . . . . . . . . . . . . . 214
C.1 Conversational biases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
C.2 Agent dialogue act schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
C.3 Customer intent class schema, by domain . . . . . . . . . . . . . . . . . . . 225
C.4 Customer slot label schema, by domain . . . . . . . . . . . . . . . . . . . . . 229
xi
List of Figures
2.1 Deng et al. (2009) popularizes Mechanical Turk use for Computer Science.
Simple annotation tasks can be completed reliably with crowd-sourcing
since selecting if an image belongs to a WordNet category (e.g., car, bi-
cycle, delta) is a relatively objective and straightforward task, despite the
occasional gray area (i.e., the image contains both a car and a bicycle or
the bicycle is in a cubist painting). However, many nlp tasks are not so
clear-cut as a different contemporary study showed (Snow et al., 2008). . . . 17
2.2 Crowd-sourcing can also be used to generate large-scale nlp data. How-
ever, generation creates a quality issue not present in annotation. In
this particular example, Choi et al. (2018) highlight that the teacher does
not provide quality responses. However, the student?s conversation is quite
unnatural and has grammatical issues. . . . . . . . . . . . . . . . . . . . . . 20
2.3 Hybrid approaches try to control the quality of language generated by the
crowd. MultiWoz (Budzianowski et al., 2018), creates a rigid template for
the user conversation, avoiding the worst quality issues at the expense of
user creativity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1 asr errors on qa data: original spoken words (top of box) are garbled
(bottom). While many words become into ?noise??frequent words or the
unknown token?consistent errors (e.g., ?clarendon? to ?clarintin?) can help
downstream systems. Additionally, words reduced to <unk> (e.g., ?kermit?)
can be useful through forced decoding into the closest incorrect word (e.g.,
?hermit? or even ?car?). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.1 The concat model predicts a lower percentage of coreferences correctly
when faced with our three adversarial ContraPro attacks. ?Attacks con-
cat? shows the drop that our adversarial templates have on ?ContraPro
concat?. Phrase: prepending ?it is true: . . . ?. Possessive: replacing
original antecedent A with ?Maria?s A?. Synonym: replacing the original
antecedent with different-gender synonyms.1 . . . . . . . . . . . . . . . . . . 61
4.2 Results comparing the sentence-level baseline to concat on Contracat.
Pronoun translation pertaining to World Knowledge and language-specific
Gender Knowledge benefits the most from additional context. . . . . . . . . 68
xii
4.3 Results comparing unaugmented and augmented concat on ContraPro and
same 3 attacks as in Figure 4.1. Results with non-augmented concat are
the same as Figure 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 ContraCAT results with unaugmented and augmented concat. We specu-
late that readjusting the prior over genders in augmented concat explains
the improvements on Markables and Overlap. . . . . . . . . . . . . . . . . . 71
5.1 Our question-in-context rewriting task. The input to each step is a question
to rewrite given the dialog history which consists of the dialog utterances
(questions and answers) produced before the given question is asked. The
output is an equivalent, context-independent paraphrase of the input ques-
tion. Crowd-workers are needed to provide these missing details as the
omissions are non-formulaic. . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 The interface for our task guides workers in real-time. . . . . . . . . . . . . 78
5.3 Human rewrites are longer, have fewer pronouns, and have more proper
nouns than the original quac questions. Rewrites are longer and contain
more proper nouns than our Pronoun Sub baseline and trained Seq2Seq model. 81
6.1 Our interface provides users with information about the entity and asks
them to select an option from possible Wikipedia pages . . . . . . . . . . . 95
6.2 Our Qualtrics survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 We validate adaptation strategies with expert translators on a five-point
Likert scale. The human-generated adaptations are rated best?between
?related? (3) and ?similar? (4). These human adaptations become the refer-
ence for evaluation in Table 6.2. . . . . . . . . . . . . . . . . . . . . . . . . 98
7.1 Counts from one game featuring an Italy(green) adept at lying but who
does not fall for others? lies. The player?s successful lies allow them to gain
an advantage in points over the duration of the game. In 1906, Italylies to
Englandbefore breaking their relationship. In 1907, Italylies to everybody
else about wanting to agree to a draw, leading to the large spike in successful
lies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7.2 Every time they send a message, players say whether the message is truthful
or intended to deceive. The receiver then labels whether incoming messages
are a lie or not. Here Italyindicates they believe a message from Englandis
truthful but that their reply is not. . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Individual messages can be quite long, wrapping deception in pleasantries
and obfuscation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.4 Most messages are truthful messages identified as the truth. Lies are often
not caught. Table 7.3 provides an example from each quadrant. . . . . . . 119
7.5 Test set results for both our actual lie and suspected lie tasks. We
provide baseline (Random, Majority Class), logistic (language features, bag
of words), and neural (combinations of a lstm with bert) models. The
neural model that integrates past messages and power dynamics approaches
human F1 for actual lie (top). For actual lie, the human baseline is
how often the receiver correctly detects senders? lies. The suspected lie
lacks such a baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
xiii
8.1 Crowd sourced annotators select an intent and choose a slot in our custom-
built Mechanical Turk interface. Entire conversations are provided for ref-
erence. Detailed instructions are provided to users, but are not included in
this figure. Options are unique per domain. . . . . . . . . . . . . . . . . . . 133
8.2 Agents are provided with explicit fulfillment instructions. These are quick-
reference instructions for the Finance domain. Agents serve as one level of
quality control by evaluating a conversation between Excellent and Unusable.143
B.1 The board game as implemented by Backstabbr. Players place moves on
the board and the interface is scraped. . . . . . . . . . . . . . . . . . . . . . 183
xiv
Chapter 1: The Case for Upfront Investment in Data
Computation can solve tasks across multiple areas of scientific inquiry: nat-
ural language processing, computer vision, biology. Solving tasks for each of these
domains?translating a sentence between languages, distinguishing a cat from a
dog, classifying a mutation?has two abstract and intertwined dependencies: model-
building and data collection.1 The relationship is intertwined since today?s mod-
els are optimized to draw statistical conclusions from significant amounts of data
through machine learning. But, even the most cutting edge modeling techniques are
heavily dependent on having realistic and accurate data for solving a task. These
large datasets are primarily gathered from online repositories or created through
low-cost crowd-sourcing (Deng et al., 2009; Rajpurkar et al., 2016; Budzianowski
et al., 2018), which are often artificial or inaccurate. We argue that high-quality,
expert-reliant data collection can lead to long-term improvements in Natural Lan-
guage Processing (nlp) and enable complex, novel tasks.
1 Mitchell (1997) defines a machine learning model as, ?A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P if its performance
at tasks in T, as measured by P, improves with experience E?. This E depends on data collection.
1
1.1 Defining Data: Annotation and Generation
In the overview, we discuss the two tasks necessary for data collection and
explain the importance of data quality for computer science as a field.
Data creation can be broadly categorized into two categories: generation and
annotation. We define generation as the creation of a data item?sequencing a
genome, creating a new image, gathering a new sentence from a user, or automat-
ically creating a sentence (Atkins et al., 1992; Goodfellow et al., 2014; Zhu et al.,
2018)?that is not previously found elsewhere (Section 2.2.1). We define annota-
tion as the application of a label to an existing data item (e.g., classifying a part
of the genome, labeling an image as a cat, or describing the sentiment of a sen-
tence) (Deng et al., 2009; Finin et al., 2010; Kozomara and Griffiths-Jones, 2014).
In many fields, data must be both generated to be representative of the task and
then accurately annotated to be effective.
1.2 Quantity over Quality as a Paradigm
The demand of neural models for quantity has caused models to be trained
on large, noisy data (Brown et al., 2020). The building blocks of other research
areas?gene sequences in biology and individual pixels in computer vision?are not
readily human interpretable by default. Even in more human-intuitive fields, like
natural language processing, data have reached the scale where their veracity?
the certainty and completeness of the data?cannot be assumed (Qiu et al., 2016),
2
despite the early assertions by Atkins et al. (1992). They posit that, ?there is in
fact little danger of obfuscation for the major parameters that characterize a corpus:
its size (in numbers of running words), and gross characterizations of its content.?2
However, the objectivity of size is questionable; a corpus consisting of the same word
repeated a million times clearly differs from one with a million unique words. Yet
size remains a primary consideration.
This focus on quantitative metrics evaluation metric has shaped nlp data
creation during the past decade (Rodriguez et al., 2021). A dataset paper will
comment on the amount of words, sentences, questions, etc., but with no assessment
of their quality. But, the sheer quantity of data masks biases and artifacts, as they
are no longer obvious to the naked human eye (Pruim et al., 2015; Gururangan et al.,
2018; Gor et al., 2021a). Since current approaches to machine learning often obscure
how decisions are made by a model, the quality of the data is not immediately
questioned as a culprit when a false prediction is made.
The current paradigm of crowd-sourcing??the act of a company or institution
taking a function once performed by employees and outsourcing it to an undefined
(and generally large) network of people in the form of an open call? ? (Howe et al.,
2006)?for dataset creation has been the main impetus of unreliability in data.
Specifically, natural language processing has generally depended on low-cost crowds
following an exploration of crowd-sourcing for five nlp tasks (Snow et al., 2008)
and ImageNet (Deng et al., 2009). However, ImageNet?s entirely crowd-sourced
annotations still have notable problems after a decade of updates (Yang et al., 2020)
2Additionally, they crucially comment that the evaluation of corpora has not been standardized.
3
and should serve as a cautionary tale. Incorrect annotations of a cat in place of a
dog may be a trivial mistake in a stand-alone context, but other unchecked errors?
intentional or even hostile?can percolate into significant real world problems: racist
suggestions and hate speech chatbots (Mac, 2021; Hunt, 2016). Hence, some tasks
benefit from large-scale data, some require high quality data, and some require
both quality and quantity. A re-prioritization to working with users that have a
reputation incentive to generate realistic and reliable data would benefit tasks in
the latter two categories.
1.3 The Nuance of Using Text as Data
We introduce the Natural Language Processing tasks covered in our work,
challenges faced in nlp due to trade-offs of annotation speed and quality, and the
distinction between generation and annotation.
A large focus of nlp is on building models that exploit patterns in language
data to solve a variety of tasks: question answering, conversational agents, machine
translation, information extraction, etc. Our research builds three models?logistic
regression (Section 2.3.1), deep averaging networks (Section 2.3.3) and long short-
term memory networks (Section 2.3.4)?for two tasks?question answering (Sec-
tion 2.1.2) and dialog (Section 2.1.3). However, in the current paradigm of machine
learning, models answer questions or predict dialog acts based on existing training
data. This makes realistic data a prerequisite for any model that aims to realistically
solve a language task.
4
But, the prevalence of neural models in nlp has prioritized data size over
realism. Chapter 2 describes the relationship between data and natural language
processing tasks. At the extreme end, gpt-3 is trained on 499 billion tokens, de facto
training a neural model based on the entire Internet (Brown et al., 2020). However,
not everything on the Internet is relevant or accurate! This is significant since
training data containing low-quality data unsurprisingly leads to models learning
controversial or false conclusions, with high levels of confidence (Wolf et al., 2017;
Wallace et al., 2019a). Therefore, missing or false data in the data generation
process undermines the ability of nlp to realistically solve language tasks.
Furthermore, many tasks in nlp depend on accurate annotation of the raw
data. As a thought experiment, if all verbs are labeled as nouns and all nouns are
labeled as verbs in the training data, a perfectly designed language model would be
confidently wrong in its predictions. Crowd-sourcing with generalists (Buhrmester
et al., 2011) assumes that enough unspecialized workers will answer a question cor-
rectly. This is a valid assumption for unambiguous, multiple-choice annotation with
a large amount pool of annotators. However, many annotation tasks, such as span-
annotation or candidate selection, have so many parameters that they are akin to
language generation, which cannot be easily verified through iaa (Karpinska et al.,
2021). The annotation can be error prone due to the amount of items that need
annotation: discourse may require annotating ellipses or co-referential links that
could be easily missed by different users (Ora?san, 2003). Or each item requiring
annotation may have thousands of options, such as an open-ended question (i.e.,
answering ?Who was a famous mathematician?? from a potential candidate pool of
5
all named entities in Wikipedia). Therefore, nlp annotation needs to be accurate,
at least in aggregate.
1.4 Data Quality as a New Paradigm
Investing in reliable data?as defined by its generation and annotation
dimensions?upfront has two benefits. First, this improvement in the quality and
diversity of data is a prudent long-term investment as high-quality datasets can have
shelf-lives of decades (Marcus et al., 1993; Miller, 1995a) while model architectures
are frequently supplanted (Vaswani et al., 2017; Peters et al., 2018; Devlin et al.,
2019a). Second, using experts for data generation can enable tasks not otherwise
possible; generalists cannot annotate medical images nor generate sentences in a
language which they do not speak.
We use experts in three experiments to collect nlp corpora and contrast
them with past automated and crowd-sourced ones. First, we show the limita-
tions of using automated methods of data collection (Chapters 3 and 4). Second, we
show that crowd-sourcing can generate data flexibly but inaccurately (Chapter 5).
Third, we show the merits of using experts as annotators for data evaluation for
a subjective and novel named entity adaptation task (Chapter 6). Fourth, we de-
scribe an experiment that uses experts for both generation and annotation to
study deception through the medium of a board-game (Chapter 7). Last, we dis-
cuss a hybrid approach?using verified experts paired with external, low-cost data
sources (Vukovic and Bartolini, 2010) (Chapter 8) that can mitigate some of the
6
Chapter Data Source Data Type Task
3 Automation Generation Question Answering
4 Automation Generation Dialog
5 Crowd-Sourced Generation Question Answering
6 Expert Annotation Question Answering
7 Expert Generation, Annotation Dialog
8 Hybrid Generation, Annotation Dialog
Table 1.1: A tabular summary of our projects. Our thesis is organized in increasing
order of data source complexity.
accuracy issues while scaling in size and cost. We classify our work by data source,
data type, and applicable task (Table 1.1).
7
Chapter 2: Natural Language Processing Depends on Data
This chapter discusses several nlp tasks that require data and the types of
data collection discussed in this thesis.
Each nlp task requires its own bespoke training data, such as text in multiple
languages for machine translation. Certain tasks within the subfields of question
answering and dialog (Section 2.1) are unable to be solved with naturally-found
data and require dataset creation.
Different types of users can generate and annotate the data needed for
these language models. Unspecialized users can be asked to solve tasks through
crowd-sourcing and automated methods can generate data at scale (Section 2.2.3).
Experts can gather and annotate data (Section 2.2.4). Last, hybrid approaches
combine anonymous crowd users with experts that verify the results (Section 2.2.5).
We provide the necessary background and past work relevant to these three user
types (Section 2.2). We explain the models and metrics that are used in solving
these tasks (Section 2.3).
8
2.1 Tasks
Language models can be created for different nlp tasks, but each requires a
different type of training data. We focus on two nlp tasks in our research: Question
Answering and Dialog.
2.1.1 What is a Task?
According to Resnik (2022), a task is an abstraction connected with a real-
world problem that enables us to compare potential solutions in order to assess
progress and consists of:
1. the real-world problem you care about
2. the dataset that?s going to be used
3. the definitions of input/output that methods will need to use
4. the measurements that will be used to quantify performance
5. the criteria for what constitutes "better", i.e. progress
2.1.2 Question Answering
Question answering (qa) is one task heavily dependent on training data. The
five-fold task tuple (Section 2.1.1) for qa is:
1. real world problem: information retrieval
2. data: datasets such as those found in Table 6.3
3. input/output: input of free form text and an output of a selected span from
9
Questions
What is the English meaning of caliente?
What is the meaning of caliente (in English)?
What is the English translation for the word ?caliente??
Table 2.1: Three questions from trec 2000 data that are believably varied. The
test questions were carefully crafted by experts.
existing text or a free form answer
4. evaluation: answer accuracy, perhaps weighted by question difficulty
5. standard for progress: higher accuracy and the ability to answer more
complicated types of questions
In the current machine learning paradigm, qa generally answers a question
with a previously seen answer. Therefore, the coverage of questions and answers
is important as models trained on trivia questions are unlikely to answer inquiries
about medical symptoms, and vice versa. We discuss the relevant history of question
answering and review the most relevant datasets.
The Text Retrieval Conference established qa as an annual, formalized task
(Voorhees et al., 1999). The questions were carefully curated every year and mod-
ifications to the question answering task were made. Table 2.1 shows examples of
questions that are intended to fool systems reliant on literal information extraction.
Machine reading comprehension, ?a task introduced to test the degree to which
a machine can understand natural languages by asking the machine to answer ques-
tions based on a given context? (Li et al., 2019), ushered in larger more diverse
qa datasets, with squad (Rajpurkar et al., 2016, 2018) being the most popular
leaderboard for models. The amount of questions went from being measured in the
10
Questions Answers
?Which laws faced significant opposition?? later laws
?What was the name of the 1937 treaty?? Bald Eagle Protection Act
Table 2.2: The paper examples from squad. Unlike Table 2.1, these questions are
done through crowd-sourcing and Wikipedia and are not carefully planned.
hundreds to being measured in the hundreds of thousands. Example questions are
provided in Table 2.2. Large influential question answering datasets include squad
1.0 (Rajpurkar et al., 2016), squad 2.0 (Rajpurkar et al., 2018), MS Marco (Bajaj
et al., 2016), TriviaQA (Joshi et al., 2017) quac (Choi et al., 2018), Quizbowl (Ro-
driguez et al., 2019), and Natural Questions (Kwiatkowski et al., 2019). We sum-
marize the size of these datasets and their user pools in Table 6.3.
Computers can read a question and select the answer from a passage of text.
This format of qa is called machine reading comprehension (Rajpurkar et al., 2016,
mrc), and has been a popular choice for dataset design. However, qa models strug-
gle to generalize when questions do not look like the standalone questions systems
in training data: e.g., new genres, languages, or closely-related tasks (Yogatama
et al., 2019). Unlike mrc, conversational question answering requires models
to link questions together to resolve the conversational dependencies between them:
each question needs to be understood in the conversation context. For example,
the question ?What was he like in that episode?? cannot be understood without
knowing what ?he? and ?that episode? refer to, which can be resolved using the con-
versation context. CoQA is a conversational question answering dataset addressing
different domains?Wikipedia, children?s stories, news articles, Reddit, literature,
and science articles?created by pairing Mechanical Turk crowd-sourced workers
11
Dataset # of Questions Data Source
CoQA 8,000 Crowd
squad 1.0 100k Crowd
squad 2.0 50k Crowd
quac 100k Crowd
TriviaQA 95k Hybrid
Quizbowl 100k Hybrid
Natural Questions 300k Hybrid
MS Marco 1000k Found
trec-8 200 Expert
Trick Me 651 Expert
Table 2.3: A tabular summary of dialog datasets. The datasets described as hybrid
all scrape or use naturally-occurring language and then supplement it with crowd-
sourced annotation.
together (Reddy et al., 2019).
Recent work acknowledges that certain community practices around crowd-
sourcing may not be optimal for qa. Boyd-Graber (2020) question the paradigm
of using crowd-sourced workers as the measure for human baselines, rather than
evaluating through a play test. As one alternative to the crowd, Wallace et al.
(2019b) work with the Quizbowl community to rewrite questions to be adversarial
(and evaluate with a play test). At the intersection of question answering and ma-
chine translation, Clark et al. (2020a) emphasize that natural speakers of a language
must be used to write authentic questions in languages outside of English.1
2.1.3 Dialog
The five-fold task tuple (Section 2.1.1) for dialog is:
1. real world problem: automating conversation
1Although the source of these speakers is still crowd-sourced unverified users as they do not
have other scalable access to speakers of typologically diverse languages.
12
2. data: datasets such as those found in Table 2.4
3. input/output: input and output of free form text
4. evaluation: naturalness of response or completion of end goal
5. standard for progress: comparisons to human dialogs such as the Turing
Test (Turing, 1950), in which a computer tries to fool a human into thinking
its a human through textual communication
Existing found conversational data has been repurposed as nlp datasets.
Ubuntu threads provide millions of conversations of technical support (Lowe et al.,
2015). Reddit, a collection of threaded comments about diverse subjects, and Open-
Subtitles, collections of movie and television subtitles, provide millions of sentences
as training data (Henderson et al., 2019).
However, found datasets cannot cover all domains and languages. For exam-
ple, the audio data needed to automatically generate subtitles are unlikely to exist
in low-resource languages, customer service data for training a chat bot is propri-
etary, and defendants are unlikely to carefully annotate sentences where they are
lying in a court deposition (nor is the court likely to release the court deposition in a
machine readable format). Therefore, generating conversational datasets becomes
a nlp need. The Dialog State Tracking Challenge (Henderson et al., 2014) creates
several relatively-small, crowd-sourced datasets focusing on different conversational
areas on an annual basis. MultiWOZ proposes a framework for simulated conver-
sations, which is necessary for domains containing sensitive data that cannot be
released (Budzianowski et al., 2018).
13
Dataset # of Questions Data Source
DSTC2 1,612 Found
Ubuntu Dialog 930,000 Found
Reddit 256,000,000 Found
OpenSubtitles 316,000,000 Found
DSTC2 1,612 Crowd
CoQA 8,000 Crowd
MultiWOZ 8,438 Crowd
Table 2.4: A tabular summary of key dialog datasets.
2.2 Data Collection Type
Data for machine learning can come from one of five sources: finding data,
automation, crowd-sourcing, experts, and a hybrid combination thereof. We discuss
representative works for each of these data pools.
2.2.1 Finding
Reusing existing text through scraping websites or forums and re-purposing
historical documents can create datasets with little effort. We define the this type
of data as found.
The Internet contains troves of data, but this data is noisy due to having
a low barrier to entry for contributors. Amazon reviews (McAuley et al., 2015),
Twitter (Banda et al., 2021), and Wikipedia (Vrande?i? and Kr?tzsch, 2014) provide
language from aliased and often anonymous users.
In contrast, organizations that have an incentive to control or report their data
release accurate, or at least authentic, datasets. EuroParl is collected from pro-
fessionally translated official parliamentary proceedings (Koehn, 2005). Literature
14
comes from a verified author (Iyyer et al., 2016), as do Reuters news articles (Lewis
et al., 2004). The United Nations maintains detailed datasets about global popu-
lations. The World Trade Organization releases a comprehensive collection of legal
disputes. Enron released authentic emails sent by verifiable employees when its
problems spilled out into the public domain (Klimt and Yang, 2004).
What is the common denominator of these datasets? This data is sourced
from experts (e.g., World Trade Organization lawyers and translators) or unverified
online users (e.g., Reddit users). However, since this data was not originally created
for nlp, further data processing and annotation are often required.
2.2.2 Automation
Data generation is necessary as the data necessary for nlp cannot always be
found. Synthetic data can be created according to fixed rules or templates, which we
refer to as automation. Augmentation is a frequent phrasing of this way of creating
data (Kafle et al., 2017). This method can create datasets of any scale, but it does
not guarantee their authenticity.
Templates can create datasets unlimited in scale, but dubious in realism. Fi-
latova et al. (2006) generate questions using specific verbs for various domains:
airplane crashes, earthquakes, presidential elections, terrorist attacks. In their own
words, their automatically created templates are ?not easily readable by human an-
notators? and the evaluation requires a lengthy discussion. Examples of questions
generated though templates include the following ambiguous questions about specific
15
earthquakes:
? Is it near a fault line?
? Is it near volcanoes?
Chapter 3 describes our project in which text-to-speech creates a dataset of
500,000 audio files. While large, our dataset is limited to a single female voice and
read in a notably different cadence than that of realistic Quizbowl experts. Addi-
tionally, our automation method depends on the existence expert-written questions
in the first place. However, to create a dataset of the same size with human experts
would require thousands of hours. Mozafari et al. (2014) propose using active learn-
ing to minimize the human effort needed to gather large-scale datasets; one gathers
annotations for a subset of the data and then extrapolates those labels to similar
unlabeled data. This serves as a segue into the next type of data creation method:
crowd-sourcing.
2.2.3 Crowd-Sourcing
We define crowd-sourcing techniques, explain their history, and comment on
the repercussions of the wide-spread use of this data pool in nlp today. Crowd-
sourcing is, ?the act of a company or institution taking a function once performed
by employees and outsourcing it to an undefined (and generally large) network of
people in the form of an open call? (Howe et al., 2006). Crowd-sourcing, in the
applied sense, relies on unspecialized users and is the most popular way to create
new annotated datasets in nlp today.
16
Figure 2.1: Deng et al. (2009) popularizes Mechanical Turk use for Computer
Science. Simple annotation tasks can be completed reliably with crowd-sourcing
since selecting if an image belongs to a WordNet category (e.g., car, bicycle, delta) is
a relatively objective and straightforward task, despite the occasional gray area (i.e.,
the image contains both a car and a bicycle or the bicycle is in a cubist painting).
However, many nlp tasks are not so clear-cut as a different contemporary study
showed (Snow et al., 2008).
ImageNet became an influential work in computer science that used crowd-
sourcing for cheap annotation. Deng et al. (2009) build ImageNet by crowd-sourcing
image annotations for WordNet. Visual classification tasks are maximally simple in
nature since annotators are asked to decide if an image contains a Burmese cat.
Figure 2.1 shows their interface.
Despite their effort to simplify and explain the task, disagreement is a major
problem and a minimum of 10 users are used to guarantee a level of confidence.
Even with constant updates, the dataset still has limitations a decade later from
the initial scaling methodology used to create it (Yang et al., 2020). Would training
and rewarding the annotators upfront have saved time and money in the long-run?
Crowd-sourcing spread to disciplines other than machine vision as a source
17
for research data. Mechanical Turk, the platform used for ImageNet, became the
largest crowd-sourcing marketplace by making it easier for individuals and businesses
to outsource their processes and jobs to a distributed workforce who can complete
these tasks virtually (Amazon, 2021). Buhrmester et al. (2011) claim that Amazon
Mechanical Turk gathers ?high-quality data inexpensively and rapidly? for psychol-
ogy. The average psychology experiment is conducted using university students that
require hourly compensation and usually come from a concentrated geographic area
and socio-economic background. However, the evidence for this claim stems from
having participants fill out a survey and is primarily evaluated on the time required,
rather than the quality of the final result. In their survey, users report that their
motivation for using Mechanical Turk is higher on a Likert scale for enjoyment than
for payment. Given that nearly every nlp task requires that users complete a large
amount of previous tasks (1000+) and with a nearly perfect accuracy (90%+), this
claim seems unlikely to hold for the average producer of nlp data. As a note of
caution, Mason and Suri (2012) claim that spammers are likely to target surveys on
Mechanical Turk.
Crowd Flower, renamed as Figure Eight, is a platform similar to Mechanical
Turk, but with a focus on quality control. While Mechanical Turk keeps track of
Human Intelligence Tasks (hit)?the name for each individual task?accuracy
rates, this metric depends on task providers to manually evaluate the data and
provide feedback about the worker. This level of oversight is unlikely to occur
for thousands of tasks. Crowd Flower?s innovation is to include a test set with
each task which monitors that users? responses correspond to gold labels. As early
18
adopters of crowd-sourcing, Finin et al. (2010) use Crowd Flower for annotating
named entities in Twitter. However, most annotations are completed by a few prolific
workers, which opens up the dataset to potential biases. Furthermore, creating a
crowd-sourced dataset with Crowd Flower is possible for annotation but not for
generation.
The success of computer vision annotations led researches to use crowd-sourcing
for collecting annotations in natural language processing tasks such as word sense
disambiguation and machine translation (Callison-Burch et al., 2015). Snow et al.
(2008) demonstrate that (on average) four non-expert workers can emulate an expert
for five nlp tasks: affect recognition, word similarity, textual entailment, temporal
event recognition, and word sense disambiguation. Using a nonprofessional user pool
is the default manner for collecting large datasets for nlp as it can generated and
annotated quickly and cheaply. As one example, large question answering datasets
involving Wikipedia and search engines?squad, SearchQA?use crowd-sourcing to
generate questions (Rajpurkar et al., 2016; Dunn et al., 2017).
The two main benefits to this data source are the cost and the rapid rate of
data collection. The cost is unquestionably lower for an employer or researcher to
use the crowd rather than internal employees. Crowd workers are paid a fraction
of what full-time employees would receive for the same task and do not receive
any benefits (Whiting et al., 2019).2 Largely due to the variations in cost-of-living
around the world and flexibility of the work, the pay is appealing to some workers.
The demographics of the platform more accurately model the United States than the
2This clearly is not a pro from the worker?s perspective.
19
average college student, at least for psychology experiments (Buhrmester et al., 2011;
Difallah et al., 2018). For perspective, Amazon Mechanical Turk has over a hundred-
thousand workers, thousands of which are available at any moment (Difallah et al.,
2018). Modular tasks can be completed in hours in crowd-sourcing, as thousands of
temporary workers complete tasks faster than a handful of employees.
Figure 2.2: Crowd-sourcing can also be used to generate large-scale nlp data. How-
ever, generation creates a quality issue not present in annotation. In this par-
ticular example, Choi et al. (2018) highlight that the teacher does not provide
quality responses. However, the student?s conversation is quite unnatural and has
grammatical issues.
The con to crowd-sourcing is that quality control becomes the central chal-
20
lenge for crowd-sourcing nlp data. Zaidan and Callison-Burch (2011) show that
data gathered from crowd-sourcing for machine translation nets a bleu score nearly
half the size of professional translators, and only one point higher than an auto-
matic machine translation approach. Other studies have shown that users tend
to voluntarily provide inaccurate data (Suri et al., 2011) and misrepresent their
background (Chandler and Paolacci, 2017; Sharpe Wessling et al., 2017).3 Last,
there is an upper-bound to the complexity of crowd-sourced tasks. Crowd workers
become less reliable and efficient for tasks that are not straightforward (Finnerty
et al., 2013). Figure 2.2 shows that more complicated nlp task instructions are not
followed in good faith. For classification tasks, average accuracy needs to exceed
50% for reliable annotators to overcome their noisy peers (Kumar and Lease, 2011).
This is not a threshold that is always achievable since answers for certain tasks are
sparse.
Chapter 3 reveals quality issues in this technique through a project that crowd-
sources question. We use the crowd to rewrite sequential questions into a standalone
format. However, extensive manual review is necessary to remove the low-quality
contributions from the data pool. Experts are accountable in ways the crowd-user
is not and do not require the same level of post-collection quality control.
Chapter 6 uses a crowd-sourced project, WikiData, for its modeling. Wiki-
Data (Vrande?i? and Kr?tzsch, 2014) is a structured, human-annotated represen-
tation of Wikipedia entities that is actively developed. The method proves less
3As a tangential consideration, legal regulation may ultimately limit the effectiveness of this
technique, since it is completely unregulated by current employment practices (Wolfson and Lease,
2011).
21
Message Sender?s Receiver?s
intention perception
If I were lying to you, I?d smile and say ?that sounds
great.? I?m honest with you because I sincerely thought Lie Truth
of us as partners.
You agreed to warn me of unexpected moves, then didn?t
. . . You?ve revealed things to England without my permis- Truth Truth
sion, and then made up a story about it after the fact!
. . . I have a reputation in this hobby for being sincere. Not
being duplicitous. It has always served me well. . . . If you Lie Truth
don?t want to work with me, then I can understand that
. . .
(Germany attacks Italy)
Well this game just got less fun Truth Truth
For you, maybe Truth Truth
Table 2.5: In contrast to the previous conversations involving crowd workers, con-
versations involving experts generate creative, and even humorous, language. Ad-
ditionally, the annotation of truthfulness is not possible with crowd-sourcing, since
it requires the generator?s real-time knowledge. This conversation snippet is from
the Diplomacy project (Chapter 7).
accurate than expected due to the entities being unevenly populated.
2.2.4 Expert
We define ?experts?, provide a brief summary of relevant datasets, and intro-
duce a dataset generated and annotated by domain experts. We use the definition
from Weinstein (1993):
?An individual is an expert in the ?performative? sense if and only if he
or she is able to perform a skill well.?
Defining expertise is a tricky and subjective goal; for example, ?well? is highly
subjective in this definition. Bourne et al. (2014) conclude that psychology is the
appropriate framework for evaluating expertise, which ?results from practice and
22
experience, built on a foundation of talent, or innate ability?. For nlp, we require
that the person has both the incentive and skill to accurately, as opposed to quickly,
complete their task. A degree of accountability, rather than full anonymity, is im-
portant as it prevents intentional fraud (Teitcher et al., 2015). Therefore, we require
that experts be identifiable, in at least some capacity during the data collection pro-
cess. Such experts can be trained or they can be found in specialized communities
of interest.
The social sciences have traditionally relied on a limited amount of trained
experts for establishing quantitative support for a theory. Grounded theory is ?the
discovery of theory from data? (Glaser and Strauss, 2017). Annotation is called cod-
ing and is used to systematically categorize and then analyze content (Neuendorf,
2017; Krippendorff, 2018). Baumer et al. (2017) find that the, ?grounded theory
analysis took two researchers several hours of work per week over roughly two and
a half months. . .With grounded theory, every single response was read and reread
multiple times.? This level of commitment does not easily scale.
Unsurprisingly, the amount of datasets created with this attention to detail
for nlp are limited due to the high cost associated with hiring experts and quality
assurance. As an alternative, skilled citizen scientists may generate high-quality
language in the pursuit of a hobby such as journalism, writing, or debate (Silvertown,
2009; Rymes and Leone, 2014). Given the increasing investment and interest in
the field, this route for data collection will be the best long-term investment. We
discuss existing sources of this kind of data, methods for generating language data,
and methods for annotating language data.
23
Language recorded naturally for other purposes has led to datasets that have
withstood the test of time. The United Nations, New York City, and the World
Trade Organization are all organizations that release reliable large-scale data, as
discussed in Section 2.2.1. These organizations hire professionals such as translators
and lawyers to generate language.
However, existing, or found, data sources do not cover all nlp tasks and do-
mains. Therefore, generation by experts is necessary. Examples of this include
adversarial questions written by trivia players (Wallace et al., 2019b), Document
Understanding Conference summaries (Over, 2003), code summaries written by de-
velopers (Badihi and Heydarnoori, 2017), and story generation (Akoury et al., 2020).
Annotations are possible to collect from non-experts, but often at the ex-
pense of their accuracy. Programmers can self-annotate their code for easier future
accessibility (Shira and Lease, 2011). Hate speech annotation is more accurate with
expert annotators than amateur ones (Waseem, 2016). In the security field, privacy
policies are complicated to understand (and to annotate) for the lay user (Jensen
and Potts, 2004; Audich et al., 2018). In the medical field, the lack of expert anno-
tation poses a barrier to large-scale nlp clinical solutions (Chapman et al., 2011).
Unsurprisingly, doctor annotation is more accurate than online generalist annotation
for medical diagnoses (Cheng et al., 2015).
The quality of crowd-sourced work relative to expert work has been disputed in
multiple studies. Mollick and Nanda (2016) compare expert to crowd judgment for
the funding of theater productions. They conclude that most decisions are aligned
between the two pools, but that crowds are more swayed by superficial presentation
24
Figure 2.3: Hybrid approaches try to control the quality of language generated by
the crowd. MultiWoz (Budzianowski et al., 2018), creates a rigid template for the
user conversation, avoiding the worst quality issues at the expense of user creativity.
than underlying quality. Leroy and Endicott (2012) compare annotations of text
difficulty between a medical librarian and a non-expert user and do not see a large
difference on a small sample size.
Chapter 7 presents a project that works with the Diplomacy, a popular board-
game, community to generate and annotate a natural conversational dataset for
the task of deception. The language in this dataset is realistic and impossible to
generate with unspecialized crowd users. An example conversation is provided in
Table 7.1.
2.2.5 Hybrid
Hybrid approaches aim to enhance crowd-sourcing by overseeing unspecial-
ized labor or automatic methods with expert knowledge. This combination lowers
cost and allows for data scaling, while maintaining a certain level of quality control.
25
We define hybrid user pools and discuss past projects.
We define hybrid data collection sources as any that combine a cost-saving
pool, such as crowd-sourcing or automation, with expert supervision. This is a
natural extension of crowd-sourcing and does not require as detailed of a historical
overview: once quality issues were noted, attempts were made to remedy them.
For generation, crowd-sourced workers can be combined with trained agents to
create data for a given nlp task. For annotation, crowd-sourced workers can be
supervised by trained experts.4
As an illustrative example, Zaidan and Callison-Burch (2011) propose an
oracle-based approach to identify the high quality crowd-sourced workers and rely
on their judgments. The paper claims that crowd-sourcing can lead to a notable re-
duction in cost without a complete loss in quality. Their approach crucially depends
on having expert (professional) translations as a reference point.
Hybrid approaches improve quantity and quality for other nlp tasks. Kochhar
et al. (2010) use a hierarchical system for database, specifically Freebase, slot filling.
First, an item is populated by automatic methods, then issues are escalated to vol-
unteer users, and any remaining issues are escalated to trained experts. Ade-Ibijola
et al. (2012) design a system for essay-grading that allows for teacher oversight and
compare their results to area experts. Hong et al. (2018) optimize the productivity
of medical field experts by providing additional reference resources and standard-
izing databases. fever (Thorne et al., 2018a) relies on super-annotators on one
percent of the data as a comparison point for all other annotations for fever. Er-
4Automation can replace the crowd for simple tasks in this hybrid approach.
26
rors made by crowd-sourced workers on Named Entity Recognition can be clustered
and identified, which in turn can be escalated to a skilled arbitrator to improve task
guidance (Nguyen et al., 2019). Having an expert-written template that crowd work-
ers must follow eliminates the worst-quality submissions (Budzianowski et al., 2018).
This example is provided in Figure 2.3. A combination of trained and untrained
workers can be used for generating Wizard-of-Oz personal assistant dialog (Byrne
et al., 2019).
Furthermore, some crowd-sourcing platforms rely on this hybrid approach.
Crowd Flower, mentioned in Section 2.2.3, attempts to bolster the reliability the
crowd by requiring the task master to create gold-standard test questions, which
are interspersed among the data being collected (Vakharia and Lease, 2015). While
not necessarily using experts, this provides an automatic quality filter that down-
weights the reliability of annotations made by the least accurate?as determined by
the gold-standard test set?annotators. Crucially, this approach can only work for
annotation, as generation quality cannot be quickly assessed. ODesk is a crowd-
sourcing platform that provides a hybrid approach, as it relies on crowd-sourcing
from the Internet, but vets the participants to have a matching skill-set for the
task (Vakharia and Lease, 2015). Prolific and Upwork are two other platforms that
place additional emphasis on vetting reliable users.56
5http://www.prolific.co
6http://www.upwork.com
27
2.3 Models & Metrics
Data is a prerequisite for machine learning. We summarize popular models
that can trained from data. Additionally, we discuss the metrics used to evaluate
these models. This emphasis on the model, and not the underlying data, evaluation
is a limitation in nlp.
2.3.1 Logistic Regression
According to Ng and Jordan (2002), the logistic regression is a basic dis-
criminative model, meaning that it can classify items into one of several classes. It
relies on using features x to predict class y by learning a vector of weights, w~ , and
a bias term, b according to:
z = w~ ? ~x+ b (2.1)
The variable z is then passed through a sigmoid function to transform the values to
a probability:
1
?(z) =
(1 + e?
(2.2)
z)
Additionally, the loss function tells the logistic regression how quantitatively
wrong a prediction is. Popular loss functions include Cross Entropy Loss?often used
for logistic regression and classification tasks? and Mean Squared Error (Sammut
and Webb, 2010).
There are two phases to logistic regression: training and testing. During
28
training, stochastic gradient descent and cross-entropy loss learn the optimal weights
of w~ and b. Cross-entropy loss calculates the difference between the predicted y? and
the true y. The gradient descent algorithm (Bottou, 2010; Ruder, 2016) finds the
minimum loss.
At test time, for each example the highest probability label is predicted in y.
Multinomial logistic regression allows for the prediction of more than two classes.
The logistic regression model is interpretable since the weight of each feature is
transparent in the final prediction. Certain features have higher weights than other
ones. A feature weight of close to zero would indicate that the feature is not essential
for the model; conversely the highest weighted feature is important in the task. This
has made the logistic regression a popular baseline model for machine learning as
its straightforward interpretability contrasts with the current state-of-the-art model:
neural networks.
2.3.2 Neural Models
Neural networks are a more powerful classifier than logistic regressions and
can be shown to learn any function due to a hidden layer. The hidden layer is a
layer that applies a, usually, nonlinear transformation to an input to generate a new
output. As a result they often avoid dependence on carefully crafted features and
learn their own representations for the task (Jurafsky and Martin, 2000).
All neural networks depend on backpropagation. The hidden layer(s)
allows nonlinear transformations but needs to be trained to produce a desirable
29
output. This is done through backpropagation, which percolates weight adjust-
ment with the chain rule throughout the entire network. The gradient of the loss
function is calculated one layer at a time, iterating backwards from the last layer
(hence backpropagation).
We focus on neural architectures applicable to nlp: Deep Averaging Net-
works (Section 2.3.3) and Recurrent Neural Networks (Section 2.3.4).
2.3.3 Deep Averaging Network
The Deep Averaging Network, or dan, classifier proposes a simple archi-
tecture with comparable results to more complicated neural models. Unlike Logistic
Regression, the dan adapts to linguistic versatility by using embeddings in lieu of
specific word features. It has three sections: a ?neural-bag-of-word? (nbow) encoder,
which composes all the words in the document into a single vector by averaging the
word vectors; a series of hidden transformations, which give the network depth and
allow it to amplify small distinctions between composed documents; and a softmax
predictor that outputs a class.
The encoded representation r is the averaged embeddings of input words. The
word vectors exist in an embedding matrix E, from which we can look up a specific
word w with E[w]. The length of the document is N . To compute the composed
representation r, the dan averages all of the word embeddings:
?N E[w
r i
]
= i (2.3)
N
30
The network weightsW, consist of a weight-bias pair for each layer of transfor-
mations (W(hi), b(hi)) for each layer i in the list of layers L. To compute the hidden
representations for each layer, the dan linearly transforms the input and then ap-
plies a nonlinearity: h = ?(W(h0 0)r + b(h0)). Successive hidden representations hi
are: hi = ?(W(hi)h (hi-1 + b i)). The final layer in the dan is a softmax output:
o = softmax(W(o)hL + b(o)). This model is used and modified in Chapter 3.
2.3.4 Sequence Models
Unlike the dan, Recurrent Neural Networks (Elman, 1990, rnn) take into
account the sequence of the input, which is important given the ordered nature of
language. The long short-term memory (Gers et al., 2000, lstm) modifies the
rnn by allowing it to discard past information.
According to Goldberg (2017), Sequence to Sequence refers to a model that
ingests a sequence of text and then generates a sequence of text, rather than a single
classification, as an output. The architecture necessary for this is called Encoder-
Decoder, as the text input is first encoded?meaning a sequence of text has been
transformed into a numerical representation?and then decoded?this representa-
tion is then transformed back into text.
Machine translation is a clear example where Sequence to Sequence applies.
If a sentence in German needs to be transformed into English, then the German
sentence is first encoded into a numerical representation and then decoded into an
English sentence.
31
Attention (Bahdanau et al., 2015) is a modification of the lstm that looks
at different parts of the encoded sequence at each stage in the decoding process.
Visualizing attention provides a mild level of interpetability as the model looks at a
specific part of the input. We use these models in Chapters 7 and 8, as the current
state of the art for nlp.
Additionally, rather than relying on n-gram language models, neural language
models reference prior context as embeddings that represent the word(s). This
means that the neural network can understand that ?cat? and ?dog? are simi-
lar, and can be treated similarly, whereas a n-gram model assumes independence.
word2vec (Mikolov et al., 2013b) and GloVe (Pennington et al., 2014) embeddings
are commonly used pre-trained embeddings. This powerful innovation precipitated
the current state-of-the-art dependence on Transformers, which are used in Chap-
ters 7 and 8.
The Transformer model simplifies the architecture and dispenses with recur-
rences and convolutions (Vaswani et al., 2017), relying instead entirely on attention.
elmo (Peters et al., 2018), used in Chapter 8, improves on GloVe embeddings (Pen-
nington et al., 2014) by allowing a word?s embedding to adjust to the context, rather
than being committed to having a single word sense. bert improves the embeddings
further by looking at context bidirectionally, meaning that words that follow a word
influence its embedding. These pre-trained embeddings can be further fine-tuned to
accommodate a specific domain?s context.
32
2.3.5 Evaluation
But how does one evaluate a model, or the underlying quality of data? Model
evaluation is specific to a task: classifying images correctly for ImageNet or an-
swering a question for squad. There is a goal of achieving the highest quantitative
accuracy on a particular task (Wang et al., 2019a); qualitative analysis of what was
answered correctly in contrast to another model is an after-thought (Linzen, 2020).
Data evaluation is necessary for crowd-sourcing. For annotation, one can com-
pare the annotations of users to one another using Inter-Annotator Agreement
(iaa) (Artstein and Poesio, 2008), the most popular of which is Cohen?s Kappa (Co-
hen, 1960). Nowak and R?ger (2010) show that for simple image classification tasks,
the majority vote of unspecialized users is comparable to expert annotation. Passon-
neau and Carpenter (2014) confirm these results comparing trained undergraduates
with the crowd for word sense annotation. Additionally, having a large amount
of annotators allows them to establish a confidence in the label accuracy for each
individual word.
However, there is no obvious metric to compute iaa for generation. Machine
translation uses metrics such as BLEU (Papineni et al., 2002), METEOR (Banerjee
and Lavie, 2005), and TERp (Snover et al., 2009) as an automatic approximation of
target quality; however, the quality of the source data?which must be generated by
human users?is never evaluated. In question answering, one may limit the possible
answers to existing pages in Wikipedia, or some other finite source, to avoid string
matching problems. But, language is complex and multiple users could write equally
33
valid questions that do not appear similar at the character level. Table 2.1 is one
such example. Deng et al. (2021) propose a unified set of metrics for compression,
transduction, and creation tasks as a first step in systematically assessing language
generation quality.
The pivot to language models, and later the neural revolution in natural lan-
guage processing precipitated an ever-increasing race for data; the largest dataset,
not the best model architecture may be the key differentiating factor for solving
a nlp task. But how to evaluate the influence of data rather than architecture
is an open research question. Since this is a broad question, we focus on two ar-
eas of nlp that are data dependent: question answering and dialog. Four possible
sources of data are presented and compared: found/automatic (Chapters 3 and 4),
crowd-sourced (Chapter 5), expert-sourced (Chapters 6 and 7), and hybrid
(Chapter 8). A large-scale data project explores the limitations of relying on model
accuracy without data verification (Chapter 3).
34
Chapter 3: Automatic Data Generation from a Found Source1
The fastest method of creating large neural-scale datasets is through automatic
generation of synthetic data. This chapter discusses a large audio dataset created
with Text-To-Speech for the task of question answering (Section 2.1.2), and its
limitations (Section 3.1). The dataset, while large, is not realistic and would be
supplanted by a similar human-generated dataset (Chapter 5). Furthermore, both
datasets ultimately depend on experts for validation (Chapters 6 and 7).
3.1 Automated Data Creation for Question Answering
Progress on question answering (qa) has claimed human-level accuracy. How-
ever, most factoid qa models are trained and evaluated on clean text input, which
becomes noisy when questions are spoken due to automatic speech recognition (asr)
errors. This consideration is disregarded in trivia match-ups between machines and
humans: ibm Watson (Ferrucci, 2010) on Jeopardy! and Quizbowl (qb) matches
between machines and trivia masters (Boyd-Graber et al., 2018) provide text data
for machines while humans listen. An Artificial Intelligence needs to process speech
1Denis Peskov, Joe Barrow, Pedro Rodriguez, Graham Neubig, and Jordan Boyd-Graber. 2019.
Mitigating noisy inputs for question answering. In Conference of the International Speech Com-
munication Association. Peskov is responsible for the data creation, the gathering of recordings
from users, running the neural models, figure and table design, and paper writing.
35
input, akin to how a typical human would process sound, to play the trivia game
without this outside assistance.2
Unlike a typical human, the computer needs a model to decode the audio into
text and answer the question. Unfortunately, there are no large spoken corpora of
factoid questions with which to train models; text-to-speech software can be used
as a method for generating training data at scale for question answering models
(Section 5.1). Although synthetic data is less realistic than true human-spoken
questions it easier and cheaper to collect at scale, which is important for training.
These synthetic data are still useful; models trained on synthetic data are applied
to human spoken data from qb tournaments and Jeopardy! (Section 3.4.1).
Noisy asr is particularly challenging for qa systems (Figure 3.1). While
humans and computers might know the title of a ?revenge novel centering on Edmund
Dantes by Alexandre Dumas?, transcription errors may mean deciphering ?novel
centering on edmond dance by alexander <unk>? instead. Dantes and Dumas are
low-frequency words in the English language and hence likely to be misinterpreted
by a generic asr model; however, they are particularly important for answering the
question. Additionally, the introduction of distracting words (e.g., ?dance?) causes
qa models to make errors (Jia and Liang, 2017). Key terms like named entities are
often missing, which is detrimental for qa (Section 3.2.1).
Previous approaches to mitigate asr noise for answering mobile queries (Mishra
2An audibly impaired person would be delivered questions in a non-audio medium, but would
still experience a perceptual delay, unlike a machine. The end goal of a trivia qa robot should
be considered: a trivia robot that wins every match by buzzing at superhuman speeds does not
provide an optimal trivia game for humans. Therefore, robots should be tested on their knowledge,
not their response time, unless the goal is to test different robots against one another.
36
Figure 3.1: asr errors on qa data: original spoken words (top of box) are garbled
(bottom). While many words become into ?noise??frequent words or the unknown
token?consistent errors (e.g., ?clarendon? to ?clarintin?) can help downstream sys-
tems. Additionally, words reduced to <unk> (e.g., ?kermit?) can be useful through
forced decoding into the closest incorrect word (e.g., ?hermit? or even ?car?).
and Bangalore, 2010) or building bots (Leuski et al., 2009) typically use unsupervised
methods, such as term-based information retrieval. Our datasets for training and
evaluation can produce supervised systems that directly answer spoken questions.
Machine translation for speech (Sperber et al., 2017) also uses asr confidences; we
evaluate similar methods on qa.
Specifically, some accuracy loss from noisy inputs can be mitigated through
a combination of forcing unknown words to be decoded as the closest option (Sec-
tion 3.3.2), and incorporating the uncertainties of the asr model directly in neural
37
models (Section 3.3.3). The forced decoding method reconstructs missing terms by
using terms audibly similar to the transcribed input. Word-level confidence scores
incorporate uncertainty from the asr system into a Deep Averaging Network (Sec-
tion 2.3.3). These methods are compared against baseline methods on our synthetic
and human speech datasets for Jeopardy! and qb (Section 3.4).
3.2 Automatically Generating a Speech Dataset
Neural networks require a large training corpus, but recording hundreds of
thousands of questions is not feasible. Methods for collecting large scale audio
data include Generative Adversarial Networks (Donahue et al., 2018) and manual
recording (Lee et al., 2018). For manual recording, crowd-sourcing with the required
quality control (speakers who say ?cyclohexane? correctly) is prohibitively expensive.
As an alternative, we generate a data-set with Google Text-to-Speech on 96,000
factoid questions from a trivia game called Quizbowl (Boyd-Graber et al., 2018),
each with 4?6 sentences for a total of over 500,000 sentences.3 We then decode
these utterances using the Kaldi chain model (Peddinti et al., 2015), trained on
the Fischer-English dataset (Cieri et al., 2004) for consistency with past results on
mitigating asr errors in mt (Sperber et al., 2017). This model decodes enough
noise into our data to test mitigation strategies.4
3http://cloud.google.com/text-to-speech
4This model has a Word Error Rate (wer) of 15.60% on the eval2000 test set (Jurafsky et al.,
1997). The wer increases to 51.76% on our qb data, which contains out of domain vocabulary.
Since there is no past work in question answering, we use machine translation as proxy for determin-
ing an appropriate Word Error Rate, as intentional noise has been added to this subdomain (Michel
and Neubig, 2018; Belinkov and Bisk, 2018). The most bleu improvement in machine translation
under noisy conditions could be found in this middle wer range, rather than in values below 20%
or above 80% (Sperber et al., 2017). Retraining the model on the qb domain would mitigate this
38
3.2.1 Why Question Answering is challenging for asr
Question Answering (qa) requires the system to provide a correct answer out
of many candidates based on the question?s wording. asr changes the features of the
recognized text in several important ways: the overall vocabulary is quite different
and important words are corrupted. First, it reduces the overall vocabulary. In
our dataset, the number of unique words drops from 263,271 in the original data
to a mere 33,333. This is expected, as our asr system only has 42,000 words in
its vocab, so the long tail of the Zipf?s curve is lost. Second, unique words?which
may be central to answering the question?are lost or misinterpreted; over 100,000
of the words in the original data occur only once. Finally, asr systems tend to
unintentionally delete words, which makes the sentences shorter. In our qb data,
the average number of words decreases from 21.62 to 18.85 per sentence.
The decoding system expresses uncertainty by predicting <unk>. These ac-
count for slightly less than 10% of all our word tokens, but is a top-2 prediction
for 30% of the 263,271 words in our dataset. For qa, words with a high tf-idf
measure are valuable. While some words are lost, others can likely be recovered:
?hellblazer? becoming ?blazer?, ?clarendon? becoming ?claritin?. We evaluate this by
fitting a tf-idf model on the Wikipedia dataset and then comparing the average
tf-idf per sentence between the original and the asr data. The average tf-idf
score drops from 3.52 to 2.77 per sentence, meaning that on average the amount of
unique words has decreased. Examples of this change can be seen in Figure 3.1.
noise; however, in practice one is often at the mercy of a pre-trained recognition model due to
changes in vocabularies or speakers.
39
For generalization, we test the effect of noise on two types of distinct questions.
qb questions, which are generally four to six sentences long, test a user?s depth of
knowledge; early clues are challenging and obscure but they progressively become
easy and well-known. Competitors can answer these types of questions at any point.
Computer qa is competitive with the top players (Yamada et al., 2018). Jeopardy!
questions are single sentences and can only be answered after the question ends. To
test this alternate syntax, we use the same method of data generation on a dataset
of over 200,000 Jeopardy questions (Dunn et al., 2017).
3.3 Mitigating Noise
This section discusses two approaches to mitigating the effects of missing
and corrupted information caused by asr systems. The first approach?forced
decoding?exploits systematic errors to arrive at the correct answer. The second
uses confidence information from the asr system to down-weight the influence of
low-confidence terms. Both approaches improve accuracy over a baseline dan model
and show promise for short single-sentence questions. However, a third ir approach,
specifically using an inverted search index, is more effective on long questions since
noisy words are completely avoided during the answer selection process.
3.3.1 ir Baseline
The ir baseline reframes Jeopardy! and qb qa tasks as document retrieval
tasks with an inverted search index. We create one document per distinct answer;
40
each document has a text field formed by concatenating all questions with that
answer together. At test time new, unseen questions are treated as queries, and
documents are scored using bm25 (Ramos, 2003; Robertson et al., 2009). We im-
plement this baseline with Elastic Search and Apache Lucene.
3.3.2 Forced Decoding
We have systematically lost information due to asr decoding. We could pre-
dict the answer if we had access to certain words in the original question and further
postulate that wrong guesses are better than knowing that a word is unknown. For
example, ?Language is a process of recreation [free creation]? is possible to decipher,
while ?Language is a process of <unknown>? is not.5
As a first step, we explored commercial solutions?Bing, Google, ibm, Wit?
with low transcription errors. However, their apis ensure that an end-user often
cannot extract anything more than one-best transcriptions, along with an aggre-
gate confidence for the sentence. Additionally, the proprietary systems are moving
targets, harming reproducibility.
Therefore, we use Kaldi (Povey et al., 2011) for all experiments. Kaldi is a
commonly-used, open-source tool for asr; its maximal transparency enables ap-
proaches that incorporate uncertainty into downstream models. Kaldi provides not
only top-1 predictions, but also confidences of words, entire lattices, and phones
(Table 3.1). Each item in the sequence represents a word and has corresponding
5Providing the full lattice, as in Table 3.1, would grant even more information to the model.
However, we did not see an improvement from using the full lattice, likely due to the increased
complexity of the data.
41
Clean For 10 points, name this revenge novel centering on Edmond Dantes, written
by Alexandre Dumas
1-Best for0.935 ten0.935 points0.871 same0.617 this1 . . . revenge novel centering on
<unk> written by alexander <unk> . . .
?Lattice? for0.935 [eps]0.064 pretend0.001 ten0.935 . . . pretend point points point name
same named name names this revenge novel . . .
Phones f_B0.935 er_E0.935 t_B0.935 eh_I1 n_E0.935 . . . p_B oy_I n_I t_I s_E sil
s_B ey_I m_E dh_B ih_I s_E r_B iy_I v_I eh_I n_I jh_E n_B aa_I
v_I ah_I l_I . . .
Table 3.1: As original data are translated through asr, it degrades in quality. One-
best output captures per-word confidence. Full lattices provide additional words
and phone data captures the raw asr sounds. Our confidence model and forced
decoding approach could be used for such data in future work.
confidence in range [0, 1].
The typical end-use of an asr system wants to know when when a word is
not recognized. Under the hood, the asr system is a graph of possible phrases. In
addition tokens for decoded sounds (e.g., ?oy?, ?ah?, . . . ), the graph will have a token
that represents an unknown; in Kaldi, this becomes <unk>. At a human-level, one
would want to know that an out of context word happened.
However, when the end-user is a downstream model, a systematically wrong
prediction may be better than a generic statement of uncertainty. So by removing all
reference to <unk> in the model, we force the system to decode ?Louis Vampas? as
?Louisiana? rather than <unk>.6 The risk we run with this method is introducing
words not present in the original data. For example, ?count? and ?mount? are similar
in sound but not in context embeddings. Hence, we need a method to downweight
incorrect decodings.
6More specifically, <unk> is removed from the Finite State Transducer, which sets the in-
put/output for the asr system.
42
3.3.3 Confidence Augmented dan
We modify the original dan model (Section 2.3.3) to use word-level confidences
from the asr system as a feature and be robust to corrupted phrases stemming from
these incorrect decodings. In increasing order of complexity, the variations are: a
Confidence Informed Softmax dan, a Confidence Weighted Average dan, and a
Word-Level Confidence dan. We represent the confidences as a vector c, where
each cell ci contains the asr confidence of word wi.
The simplest model averages the confidence across the whole sentence and
adds it as a feature to the final output classifier. For example in Table 3.1, ?for ten
points? averages to 0.914. We introduce an additional weight in the output Wc,
which adjusts our prediction based on the average confidence of each word in the
question. This phrase will not affect the question answering system. But, the
following words ?revenge novel? have high enough confidences to be decoded, while
?Dumas? drops enough to become ?<unk>?.
However, most words have high confidence, and thus the average confidence
of a sentence or question level is high. To focus on which words are uncertain we
weight the word embeddings by their confidence attenuating uncertain words before
calculating the dan average. In the previous example??for ten points???for? and
?ten? are frequently occuring words and have a confidence of .935, while ?points? has
a lower confidence of .871. The next word??same??should be ?name? and hence
the embedding referenced is incorrect. But, the lower confidence of .617 for this
prediction decreases the overall weight of the embedding in the model.
43
Weighting by the confidence directly removes uncertain words, but this is too
blunt an instrument, and could end up erasing useful information contained in low-
confidence words, so we instead learn a function based on the raw confidence from
our asr system. Thus, we recalibrate the confidence through a learned function f :
f(c) = W(c)c+ b(c) (3.1)
and then use that scalar in the weighted mean of the dan representation layer:
?N
** i E[wi] ? f(ci)r = . (3.2)
N
In this model, we replace the original encoder r with the new version r** to
learn a transformation of the asr confidence that down-weights uncertain words
and up-weights certain words. This final model is called our ?Confidence Model?.
Architectural decisions are determined by hyperparameter sweeps. They in-
clude: having a single hidden layer of 1000 dimensionality for the dan, multiple
drop-out, batch-norm layers, and a scheduled adam optimizer. Our dan models
train until convergence, as determined by early-stopping. Code is implemented in
PyTorch (Paszke et al., 2017), with TorchText for batching.7
3.4 Results
Achieving 100% accuracy on this dataset is not a realistic goal, as not all test
questions are answerable (specifically, some answers do not occur in the training
7Code, data, and additional analysis available at https://github.com/DenisPeskov/QBASR
44
data and hence cannot be learned by an ir-like system). Baselines for the dan
(Table 3.2) establish realistic goals: a dan trained and evaluated on the same train
and dev set, only in the original non-asr form, correctly predicts 54% of the answers.
Noise drops this to 44% with the best ir model and down to ? 30% with neural
approaches.
Since the noisy data quality makes full recovery unlikely, we view any im-
provement over the neural model baselines as recovering valuable information. At
the question-level, strong ir outperforms the dan by around 10%.
There is additional motivation to investigate this task at the sentence-level.
Computers can beat humans at the game by knowing certain questions immediately;
the first sentence of the qb question serves as a proxy for this threshold. Our
proposed combination of forced decoding with a neural model led to the highest test
accuracy results and outperforms the ir one at the sentence level.
A strong tf-idf ir model can top the best neural model at the multi-sentence
question level in qb; multiple sentences are important because they progressively
become easier to answer in competitions. However, our models improve accuracy
on the shorter first-sentence level of the question. This behavior is expected since
ir methods are explicitly designed to disregard noise and can pinpoint the handful
of unique words in a long paragraph; conversely they are less accurate when they
extract words from a single sentence.
45
qb Jeopardy!
Synth Human Synth Human
Model Start End Start End
Methods Tested on Clean Data
ir 0.064 0.544 0.400 1.000 0.190 0.050
dan 0.080 0.540 0.200 1.000 0.236 0.033
Methods Tested on Corrupted Data
ir base 0.021 0.442 0.180 0.560 0.079 0.050
dan 0.035 0.335 0.120 0.440 0.097 0.017
fd 0.032 0.354 0.120 0.440 0.102 0.033
Confidence 0.036 0.374 0.120 0.460 0.095 0.033
fd+Conf 0.041 0.371 0.160 0.440 0.109 0.033
Table 3.2: Both forced decoding (fd) and the best confidence model improve accu-
racy. Jeopardy only has an At-End-of-Sentence metric, as questions are one sentence
in length. Combining the two methods leads to a further joint improvement in cer-
tain cases. ir and dan models trained and evaluated on clean data are provided as
a reference point for the asr data.
Speaker Text
Base John Deydras, an insane man who claimed to be Edward II, stirred up
trouble when he seized this city?s Beaumont Palace.
S1 unk an insane man who claimed to be the second unk trouble when he
sees unk beaumont ? Richard_I_of_England
S2 john dangerous insane man who claims to be the second stirring up
trouble when he sees the city?s beaumont ? London
S3 unk dangerous insane man who claim to be unk second third of trouble
when he sees the city?s unk palace ? Baghdad
Table 3.3: Variation in different speakers causes different transcriptions of a question
on Oxford. The omission or corruption of certain named entities leads to different
answer predictions, which are indicated with an arrow.
3.4.1 Qualitative Analysis & Human Data
While the synthetic dataset facilitates large-scale machine learning, we ulti-
mately we care about performance on human data. For qb we record questions
read by domain experts at a competition. To account for variation in speech, we
record five questions across ten different speakers, varying in gender and age; this
set of fifty questions is used as the human test data. Table 3.3 provides examples of
46
variations. For Jeopardy! we manually parsed a complete episode.
The predictions of the regular dan and the confidence version can differ. As
one example, input about The House on Mango Street, which contains words like
?novel?, ?character?, and ?childhood? alongside a corrupted name of the author, the
regular dan predicts The Prime of Miss Jean Brodie, while our version predicts the
correct answer. As another example the model in Table 3.3 predicts ?London? if
?beaumont? and ?john? are preserved, but ?Baghdad? if the proper nouns, but not
?palace? and ?city?, are lost.
3.5 Confidence in Data Quality
Confidences are a readily human-interpretable concept that may help build
trust in the output of a system. Transparency in the quality of up-stream content
can lead to downstream improvements in a plethora of nlp tasks.
Exploring sequence models or alternate data representations may lead to fur-
ther improvement. Including full lattices may mirror past results for machine trans-
lation (Sperber et al., 2017) for the task of question answering. Using unsupervised
approaches for asr (Wessel and Ney, 2004; Lee et al., 2009) and training asr models
for decoding qb or Jeopardy! words are avenues for further exploration.
3.5.1 Can Question Answering Audio be Automated?
Question answering, like many nlp tasks are impaired by noisy inputs. Intro-
ducing asr into a qa pipeline corrupts the data. A neural model that uses the asr
47
system?s confidence outputs and systematic forced decoding of words rather than
unknowns improves qa accuracy on qb and Jeopardy! questions. Our methods are
task agnostic and can be applied to other supervised nlp tasks. Larger human-
recorded question datasets and alternate model approaches would ensure spoken
questions are answered accurately, allowing human and computer trivia players to
compete on an equal playing field. Text-to-Speech technology can create a large
dataset, but the unvarying pronunciation, speed, and voice?every single tts voice
is female?ultimately inhibits this approach from being a gold-standard.
We focus on question answering, but speech data can used for other purposes.
As one example, speech-to-speech machine translation will require sizable amounts
of training data (Zhang et al., 2004). These techniques can ultimately be used in
areas such as call centers (Zweig et al., 2006). But in both translation and in a call
center dialog, the quality of the speech is important.
3.6 Implications of Automation
The advantages of this method are cost and scalability, which is demanded by
the current paradigm of neural models. This however comes at the expense of qual-
ity. A limitation of our past work in automation is generalization: text-to-speech
centers around a handful of primarily female voices that are consistently decoded,
while the voices of real humans are decoded with large variations. Unseen data points
are likely to confound a model trained on unnatural data. Furthermore, emotionally
realistic speech requires appropriate scope, naturalness, and context (Douglas-Cowie
48
et al., 2003). Automatic methods, such as text-to-speech, will not be able to address
these criteria.
Additionally, automated data creation still depends on having quality source
data, that often has to come from expert users. In this project, we record found
questions that were already written by Quizbowl experts. Writing hundreds of
thousands of our questions would not have been tractable. Hence, expert design
is necessary for automation, as implemented in our other automatically-created
dataset, which evaluates co-reference (Chapter 4).
49
Chapter 4: Automatic Data Generation without a Source1
Chapter 3 introduces automation for generating data. However, in that
project there was a source of found data. How can one automate data generation
without an existing source? This chapter uses an expert to design a series of rules
to automatically generate a dataset. Specifically, we create an evaluation dataset
for coreference resolution, a sub-task of dialog (Section 2.1.3). The limitations of
the automatic data will motivate using crowd-sourcing in Chapter 5. The merit of
the expert in the process will motivate using them directly in Chapter 6.
4.1 Evaluating Data
Genuinely varied, realistic data is necessary to create models that are robust
to minor variations (Neumann et al., 2019). However, equally robust evaluation
methodologies are important in ascertaining the quality of the data (Jones, 1994).
Current methods, like iaa, focus on quantitative assessments that may inadvertently
assess the annotation, but not the generation, quality of a dataset. Since most
datasets are evaluated on the same types of data?squad test data is comparable
1Equal effort between Benno Krojer, Dario Stojanovski, Denis Peskov, and supervised by Alex
Fraser. 2020. In International Conference on Computational Linguistics. Peskov is responsible for
part of template design, selecting concrete nouns for the templates, paper writing, and the video.
50
to the training data?the linguistic variation of a dataset is not readily captured
by standard quantitative metrics like accuracy or F1. Furthermore, a model that
has memorized several key answers upon which it is then tested is not necessarily
learning ; raw analysis of data overlap confirms this risk (Lewis et al., 2020). Datasets
meant to effectively and robustly evaluate trained datasets can determine how much
of a problem this poses ex-post-facto.
As one solution to this limitation, Checklist (Ribeiro et al., 2020) creates a
task-agnostic methodology for testing NLP models. The check is done by replacing
words with their synonyms and seeing if task accuracy decreases. We extend this
work to a specific task in machine translation. There does not exist a dataset that
can serve as a found source, unlike our past automation work (Chapter 3). The
dataset we create is designed by experts: specifically native German and native
English speakers, and scaled through automation. While a similar dataset of the
same size could be created without knowledge of either language, the templates used
as test data would prove be nonsensical or unnatural.
4.2 Meaningful Model Evaluation in Machine Translation
Machine translation is a classic and complex nlp task that requires diverse
linguistic knowledge and data in multiple languages (Section 2.3). Classic datasets
were often gathered through extensive collaboration with experts. However, recent
ones are often created through crowd-sourcing or automaticmethods. Therefore,
this is an area well-suited to our evaluation techniques.
51
We focus on German-English coreference resolution as a representative task.
The seemingly straightforward translation of the English pronoun it into German
requires knowledge at the syntactic, discourse and world knowledge levels for proper
pronoun coreference resolution (cr). A German pronoun can have three genders,
determined by its antecedent: masculine (er), feminine (sie) and neuter (es). The
nuance of this work requires native knowledge of both English and German.
Accuracy in machine translation is at an all-time high with the rise of neural
architectures (Wu et al., 2016) but this metric alone is insufficient for evaluation.
Previous work (Hardmeier and Federico, 2010; Miculicich Werlen and Popescu-Belis,
2017; M?ller et al., 2018) proposes evaluation methods for specifically pronoun trans-
lation. Context-aware neural machine translation (nmt) models are capable of using
discourse-level information and are prime candidates for this evaluation. We ask:
Are transformers (Vaswani et al., 2017) truly learning this task, or are
they exploiting simple heuristics to make a coreference prediction?
To empirically answer this question, we propose extending a contrastive challenge set
for automatic English?German pronoun translation evaluation, ContraPro (M?ller
et al., 2018) (Section 4.6.1), by making small adversarial changes in the contextual
sentences.
Our adversarial attacks?inputs that are almost indistinguishable from nat-
ural data and yet classified incorrectly by the network (Madry et al., 2017)?on
ContraPro show context-aware Transformer nmt models can easily be misled by
simple and unimportant changes to the input. However, interpreting the results
52
obtained from adversarial attacks can be difficult. In our case, trivial changes in
language cause incorrect predictions, but both the changes and the prediction would
not be noticed by somebody without a mastery of German. nmt uses brittle heuris-
tics to solve cr if trivial changes in pronouns and nouns fool a coreference corpus
like ContraPro. However, this will not identify which heuristics these are.
For this reason, we propose a new dataset, created from templates (Sec-
tion 4.7.1), to systematically evaluate which heuristics are being used in corefer-
ential pronoun translation. Inspired by previous work on cr (Raghunathan et al.,
2010; Lee et al., 2011) and language model probing (Ettinger et al., 2016), we create
templates tailored to evaluating the specific steps of an idealized cr pipeline. We
call this collection Contracat, Contrastive Coreference Analytical Templates. The
construction of templates is controlled, enabling us to easily create large number of
coherent test examples and provide unambiguous conclusions about the cr capabil-
ities of nmt. While this methodology depends on automation, a technique shown
to be unrealistic for speech in Chapter 3, the templates are written in collaborations
between a native German speaker and native English speakers. Since automation
is subject to quality control issues, this level of expertise is necessary if the adver-
sarial dataset is to be reflective of actual language used by English and German
speakers. The procedure used can be adapted to many language pairs with little ef-
fort. We also propose a simple data augmentation approach using fine-tuning. This
methodology should not change the way cr is handled by nmt and support the
hypothesis that automated data techniques have limited applicability. We release a
new dataset, ContraCAT, and the adversarial modifications to ContraPro.
53
ContraCAT applies only to coreference, but the investigation of heuristics is an
important research direction in nlp that can measure the issues noted with auto-
matic (Chapter 3) and crowd-sourced (Chapter 5) datasets. Heuristics are accu-
rate if there are underlying data limitations; this implies that the training data and
the evaluation data resemble one another in superficial ones. Therefore, exposing
the brittleness in current datasets motivates the need for higher-quality evaluation
data?to observe limitations?and varied training data?to overcome them.
We introduce coreference resolution as a task in Section 4.3, the idealized coref-
erence pipeline in Section 4.3, and the transformer model in Section 4.5. We discuss
ContraPro in Section 4.6.1, and explain our proposed templates in Section 4.6.2.
4.3 Why is Coreference Resolution Relevant?
Evaluating discourse phenomena is an important first step in evaluating mt.
Apart from document-level coherence and cohesion, anaphoric pronoun translation
has proven to be an important testing ground for the ability of context-aware nmt
to model discourse. Anaphoric pronoun translation is the focus of several works in
context-aware nmt (Bawden et al., 2018; Voita et al., 2018; Stojanovski and Fraser,
2018; Miculicich et al., 2018; Voita et al., 2019; Maruf et al., 2019).
The choice of an evaluation metric for cr is nontrivial. bleu (Papineni et al.,
2002) is the standard metric for machine translation that compares sentence simi-
larity at a word level between two sentences. bleu-based evaluation is insufficient
for measuring improvement in cr (Hardmeier, 2012) without carefully selecting
54
Start: The cat and the actor were hungry.
Original Sentence It (?) was hungrier.
Step 1: The cat and the actor were hungry.
Markable Detection It (?) was hungrier.
Step 2: The cat and the actor were hungry.
Coreference Resolution It was hungrier.
Step 3: Der Schauspieler und die Katze waren
Language Translation hungrig.
Er / Sie / Es war hungriger.
Table 4.1: A hypothetical cr pipeline that sequentially resolves and translates a
pronoun.
or modifying test sentences for pronoun translation (Voita et al., 2018; Stojanovski
and Fraser, 2018). Alternatives to bleu include F1, partial credit, and oracle-guided
approaches (Hardmeier and Federico, 2010; Guillou and Hardmeier, 2016; Miculi-
cich Werlen and Popescu-Belis, 2017). However, Guillou and Hardmeier (2018) show
that these metrics can miss important cases and propose semi-automatic evaluation.
In contrast, our evaluation will be completely automatic.
We focus on scoring-based evaluation (Sennrich, 2017), which works by creat-
ing contrasting pairs and comparing model scores. Accuracy is calculated as how
often the model chooses the correct translation from a pool of alternative minimal
edit distance incorrect translations.2 We are able to scale the size of our adversarial
evaluation due to the metric being automatic.
Our work is related to adversarial datasets for testing robustness used in Nat-
ural Language Processing tasks such as studying gender bias (Zhao et al., 2018;
Rudinger et al., 2018; Stanovsky et al., 2019), natural language inference (Glockner
et al., 2018) and classification (Wang et al., 2019b).
2Specifically, these alternatives are the two other possible pronouns in German.
55
4.4 Do Androids Dream of Coreference Translation Pipelines?
Imagine a hypothetical coreference pipeline that generates a pronoun in a tar-
get language, as illustrated in Table 4.1. First, tag markables, entities that can be
referred to by pronouns, in the source sentence since semantics affect binding (Bach
and Partee, 1980).3 Then, detect the subset of animate entities, and separate human
entities from other animate ones, since it usually cannot refer to a human entit. Sec-
ond, resolve coreferences in the source language. This entails addressing phenomena
such as world knowledge, pleonastic it, and event references. Third, translate the
pronoun into the target language. This requires selecting the correct gender given
the referent (if there is one), and selecting the correct grammatical case for the tar-
get context (e.g., accusative, if the pronoun is the grammatical object in the target
language sentence).
This idealized pipeline would produce the correct pronoun in the target lan-
guage and allow a human to understand why the pronoun decision was made.
These coreference steps resemble the rule-based approach implemented in Stan-
ford Corenlp?s CorefAnnotator (Raghunathan et al., 2010; Lee et al., 2011) and
superficially resemble the three-pronged formulation of Discourse Prominence The-
ory (Gordon and Hendrick, 1998). However, nmt models are unable to decouple
the individual steps of this pipeline, even if they are able to produce the correct
pronoun. We propose to isolate each of these steps through targeted examples to
understand where the nmt made its decision.
3We restrict ourselves to concrete entities as concepts are incompatible with many verbs.
56
4.5 Model
We use a transformer model (Section 2.3.4) for all experiments. The context-
aware model in our experimental setup is a concatenation model (Tiedemann and
Scherrer, 2017) (concat) which is trained on a concatenation of consecutive sen-
tences. concat is a standard transformer model and it differs from the sentence-
level model only in the way that the training data is supplied to it. Previously,
attention-based models discarded information outside of sentence boundaries. Tiede-
mann and Scherrer (2017) do not modify the model architecture but concatenate
preceding and subsequent sentences to the sentence being translated. We train a
sentence-level model without any additional concatenation as a baseline.4
4.6 ContraPro: Adversarial Attacks on an Adversarial Dataset
ContraPro (M?ller et al., 2018), a contrastive challenge set (Section 4.2), has
limitations that our new dataset, Contracat, will address.
4.6.1 About ContraPro
ContraPro is a contrastive challenge set for English?German pronoun trans-
lation evaluation. The set consists of English sentences containing an anaphoric
pronoun it and the corresponding German translations (e.g., ?Give me your hand,
4The training examples for this model are modified by prepending the previous source and
target sentence to the main source and target sentence. The previous sentence is separated from the
main sentence with a special token <SEP>, on both the source and target side. This also applies
to how we prepare the ContraPro and Contracat data. We train the concatenation model on
OpenSubtitles2018 data prepared in this way. We remove documents overlapping with ContraPro.
57
ah, it?s soft and hot, and it feels pleasant??? ?Gib deine Hand, ah, sie ist weich und
warm, und wohlig f?hlt sie sich an.?). It contains three contrastive translations, dif-
fering based on the gender of the translation of it : er, sie, or es. The challenge set
artificially balances the amount of sentences where it is translated to each of these
three German pronouns. The appropriate antecedent may be in the main sentence
or in a previous sentence. For evaluation, a model needs to produce scores for all
three possible translations, which are compared against ContraPro?s gold labels.
There may be an inherent skew in the data since it is found in movie dialogues
rather than being generated for specifically testing neural coreference resolution.
To ferret out any bias learned by the model from this training set , we generate
automatic adversarial attacks on ContraPro that modify the theoretically inconse-
quential parts of the context sentence before the occurrence of it. Coreference ac-
curacy degrades from this adversarial attack suggesting that our transformer model
is affected by inconsequential priming and that the original dataset did not have an
equal distribution across the three pronouns.
4.6.2 Adversarial Attack Generation
Our three modifications, summarized here, are explained in detail in the fol-
lowing sections:
1. Phrase Addition: Appending and prepending phrases containing implausi-
ble antecedents:
The Church is merciful but that?s not the point. It always welcomes the mis-
58
guided lamb.
2. Possessive Extension: Extending original antecedent with possessive noun
phrase:
I hear her the doctor?s voice! It resounds to me from heights and chasms a
thousand times!
3. Synonym Replacement: Replacing original German antecedent with syn-
onym of different gender:
The curtain rises. It rises. ? Der Vorhang Die Gardine geht hoch. Er Sie
geht hoch.5
Phrase Addition can be applied to all 12,000 ContraPro examples. The second
and third attack can only be applied to 3,838 and 1,531 examples, due to the required
sentence contingencies.
4.6.2.1 Phrase Addition
This attack modifies the previous sentence by appending phrases such as
?. . . but he wasn?t sure? and also prepending phrases such as ?it is true:. . . ?. A range
of other simple phrases can be used, which we leave out for simplicity. All phrases
we tried provided lower scores. These attacks either introduce a human entity or
an event reference it (e.g., ?it is true?) which are both not plausible antecedents for
the anaphoric it.
5der Vorhang (masc.) and die Gardine (fem.) are synonyms meaning curtain
59
4.6.2.2 Possessive Extension
This attack introduces a new human entity by extending the original an-
tecedent A with a possessive noun phrase e.g., ?the woman?s A?. Only two-thirds of
the 12,000 ContraPro sentences are linked to an antecedent phrase. Grammar and
misannotated antecedents exclude half of the remaining phrases. We put pos-tag
constraints on the antecedent phrases before extending them. This filters our sub-
set to 3,838 modified examples. Our possessive extensions can be humans (e.g., the
woman?s), organizations (e.g., the company?s) and names (e.g., Maria?s) and each
is applied to the pertinent examples.
4.6.2.3 Synonym Replacement
The Synonym Replacement attack gets to the core of whether nmt uses cr
heuristics as understanding the pronoun-noun relationship is paramount to predict-
ing the correct pronoun. This attack modifies the original German antecedent by
replacing it with a German synonym of a different gender. For this we first identify
the English antecedent and its most frequent synset in WordNet (Miller, 1995b). We
obtain a German synonym by mapping this WordNet synsets to GermaNet (Hamp
and Feldweg, 1997) synsets. Finally, we modify the correct German pronoun trans-
lation to correspond to the gender of the antecedent synonym. Approximately one
quarter of the nouns in our ContraPro examples are found in GermaNet; in 1,531
of these cases, a synonym of different gender could be identified.
6The adversarial attacks modify the context, therefore the baseline model?s results on the attacks
are unchanged and we omit them. Results for Phrase Addition are computed based on all 12,000
60
Figure 4.1: The concat model predicts a lower percentage of coreferences cor-
rectly when faced with our three adversarial ContraPro attacks. ?Attacks con-
cat? shows the drop that our adversarial templates have on ?ContraPro concat?.
Phrase: prepending ?it is true: . . . ?. Possessive: replacing original antecedent A
with ?Maria?s A?. Synonym: replacing the original antecedent with different-gender
synonyms.6
4.6.3 Quality Assessment of the Automatic Attacks by an Expert
We evaluate a random sample of 100 auto-modified examples as a quality
control metric. There are 11 issues with semantically-inappropriate synonyms.
The model switches from correct to incorrect predictions because of synonym-
replacement in 10 of the remaining 89 appropriate examples.7
A correct synonym replacement example is:
Es gab einen Brief. Und er war von Sergis Bauer. ?
Es gab ein Schreiben. Und es war von Sergis Bauer.
which both mean ?There was a letter. It was from Sergis Bauer.?
One such incorrect synonym replacement that German expert evaluation un-
covered is:
ContraPro examples, while for Possessive Extension and Synonym Replacement we only use the
suitable subsets of 3,838 and 1,531 ContraPro examples.
7Four switches occur from semantically-inappropriate synonyms.
61
Mein Tisch war so sch?n gedeckt. Oh, er war h?bsch. ?
Meine Tabelle war so sch?n gedeckt. Oh, sie war h?bsch.
which means ?My table was neatly decorated. It was pretty?. Both Tisch and Tabelle
translate to table, but one is furniture while the other is a matrix. This does not un-
dermine the coreference evaluation itself, since the antecedent is correctly referenced,
but it does unintentionally create a semantically implausible sentence.8
4.6.4 Evaluating Adversarial Attacks
Intuitively, the adversarial attacks should not contribute to large drops in
scores, since no meaningful changes are being made. If the model accuracy drops
some, but not all the way to the original sentence-level baseline (Section 4.5), we can
conclude that the concatenation model handles cr, but likely with brittle heuristics.
If the model accuracy drops all the way to the baseline, then the model is memorizing
the inputs. The changes in accuracy suggest issues, but do not ascertain what they
are. This problem in pronoun translation evaluation cannot be addressed with
simple adversarial attacks on existing general-purpose challenge sets.
4.7 Contracat: A Fine-Grained Adversarial Dataset
We propose Contracat, a more systematic approach that targets each of the
previously outlined cr pipeline steps with data synthetically generated from corre-
sponding templates.
8Our templates do not allow for antecedent disagreement to be created in the first place, so
there are no direct coreference issues.
62
Automatic adversarial attacks offer less freedom than templates as many sys-
tematic modifications cannot be applied to the average sentence. Thus, our Con-
tracat templates are built on the hypothetical coreference pipeline in Section 4.4
that target each of the three steps: 1) Markable Detection, 2) Coreference Resolu-
tion and 3) Language Translation. Our minimalistic templates draw entities from
sets of animals, human professions (McCoy et al., 2019), foods, and drinks, along
with associated verbs and attributes. We use these sets to fill slots in our templates.
Animals and foods are natural choices for subject and object slots referenced by it.
Restricting our sets to interrelated concepts with generically applicable verbs?all
animals eat and drink?ensures semantic plausibility. Other object sets, such as
buildings, would cause semantic implausibility with certain verbs.
4.7.1 Template Generation
Our templates consist of a previous sentence that introduces at least one entity
and a main sentence containing the pronoun it. We use contrastive evaluation to
judge anaphoric pronoun translation accuracy for each template; we create three
translated versions for each German gender corresponding to an English sentence,
e.g., ?The cat ate the egg. It rained.? and the corresponding ?Die Katze hat das Ei
gegessen. Er/Sie/Es regnete?. To fill a template, we only draw pairs of entities with
two different genders, i.e., for animal a and food f : gender(a) 6= gender(f). This
way we can determine whether the model has picked the right antecedent.
First, we create templates that analyze priors of the model for choosing a
63
Template Target Example
Priors
Grammatical Role The cat ate the egg. It (cat/egg) was big.
Order I stood in front of the cat and the dog. It (cat/dog)
was big.
Verb Wow! She unlocked it.
Markable Detection
Filter Humans The cat and the actress were happy. However it (cat)
was happier.
Coreference Resolution
Lexical Overlap The cat ate the apple and the owl drank the water.
It (cat/ dog) ate the apple quickly.
World Knowledge The cat ate the cookie. It (cat) was hungry.
Pleonastic it The cat ate the sausage. It was raining.
Event Reference The cat ate the carrot. It came as a surprise.
Language Translation
Antecedent Gender I saw a cat. It(cat) was big. ?
Ich habe eine Katze gesehen. Sie (cat) war gro?.
Table 4.2: Template examples targeting different cr steps and substeps. For Ger-
man, we create three versions with er, sie, or es as different translations of it.
pronoun when no correct translation is obvious. Then, we create templates with
correct translations, guided by the three broad coreference steps. Table 4.2 provides
examples for our templates.
4.7.2 Priors
Our templates that test prior biases do not have a correct answer but reveal
the model?s biases. We expose three priors with our templates: 1) grammatical roles
prior (e.g., subject) 2) position prior (e.g., first antecedent) and 3) a general prior
if no antecedent and only a verb is present.
For the first prior, we create a Grammatical Role template where both subject
and object are valid antecedents.
64
For the second prior, we create a Position template where two objects are
enumerated as shown in Table 4.2. We create an additional example where the
entities order is reversed and test if there are priors for specific nouns or alternatively
positions in the sentence.
For the third prior, we create a Verb template, expecting that certain transitive
verbs trigger certain object gender choice. We use 100 frequent transitive verbs and
create sentences such as the example in Table 4.2.
4.7.3 Markable Detection with a Humanness Filter
Before doing the actual cr, the model needs to identify all possible entities
that it can refer to. We construct a template that contains a human and animal
which are in principle plausible antecedents, if not for the condition that it does not
refer to people. For instance, the model should always choose cat in ? The actress
and the cat are hungry. However it is hungrier.?.
4.7.4 Coreference Resolution
Having determined all possible antecedents, the model chooses the correct one,
relying on semantics, syntax, and discourse. The pronoun it can in principle be used
as an anaphoric (referring to entities), event reference or pleonastic pronoun (Lo?i-
ciga et al., 2017). For the anaphoric it, we identify two major ways of identifying the
antecedent: lexical overlap and world knowledge. Our templates for these categories
are meant to be simple and solvable.
65
Overlap: Broadly speaking the subject, verb, or object can overlap from the
previous sentence to the main sentence, as well as combinations of them. This
gives us five templates: subject-overlap, verb-overlap, object-overlap, subject-verb-
overlap and object-verb-overlap.
We always use the same template for the context sentence, e.g., ?The cat ate
the apple and the owl drank the water.?. For the object-verb-overlap we would then
create the main sentence ?It ate the apple quickly.? and expect the model to choose
cat as antecedent. To keep our overlap templates order-agnostic, we vary the order
in the previous sentence by also creating ?The owl drank the water and the cat ate
the apple.?
World Knowledge: cr has been traditionally seen as challenging as it re-
quires world knowledge. Our templates test simple forms of world knowledge by
using attributes that either apply to animal or food entities, such as cooked for food
or hungry for animals. We then evaluate whether the model chooses e.g., cat in
?The cat ate the cookie. It was hungry.?
Pleonastic and Event Templates: For the other two ways of using it, event
reference and pleonastic-it, we again create a default previous sentence (?The cat
ate the apple.?). For the main sentence, we used four typical pleonastic and event
reference phrases such as ?It is a shame? and ?It came as a surprise?. We expect
the model to correctly choose the neuter es as a translation every time.
66
4.7.5 Translation to German
After cr, the decoder has to translate from English to German. In our con-
trastive scoring approach the translation of the English antecedent to German is
already given. However the decoder is still required to know the gender of the Ger-
man noun to select between er, sie or, es. We test this with a list of concrete nouns
selected from Brysbaert et al. (2014), which we filter for nouns that occur more than
30 times in the training data. This selects 2051 nouns that are substituted for N
in: ?I saw a N . It was {big, small}.?.
4.7.6 Results
The concat model becomes less accurate when actual cr is required. It
frequently falls back to choosing the neuter es or preferring a position (e.g., first
of two entities) for determining the gender. For Markable Detection the model
always predicts the neuter es regardless of the actual genders of the entities.
In the Overlap template, the model fails to recognize the overlap and has a
general preference for one of the two clauses. In the case of verb-overlap, the model
has an accuracy of 64.1% if the verb overlaps with the first clause (?The cat ate and
the dog drank. It ate a lot.?), but a low accuracy of 39.0% when the verb overlaps
with the second clause (?The cat ate and the dog drank. It drank a lot.?.) The overall
accuracy for the overlap templates is 47.2%, with little variation across the types of
overlap. Adding more overlap, e.g., by overlapping both the verb and object (?It ate
the apple happily?), yields no improvement. Overall, the model pays little attention
67
Markables Overlap World Pleonastic Event Gender
100 100 100 100 96 94
75
Baseline
50 57
44 44 47 CONCAT
25 33 32
23
0
Figure 4.2: Results comparing the sentence-level baseline to concat on Contracat.
Pronoun translation pertaining to World Knowledge and language-specific Gender
Knowledge benefits the most from additional context.
Antecedent-free augmentation
Source You let me worry about that. <SEP> How much you take for it?
Reference Lassen Sie das meine Sorge sein. <SEP> Wie viel kostet er?
Augmentation 1 Lassen Sie das meine Sorge sein. <SEP> Wie viel kostet sie?
Augmentation 2 Lassen Sie das meine Sorge sein. <SEP> Wie viel kostet es?
Table 4.3: Examples of training data augmentations. The source side of the aug-
mented examples remains the same.
to overlaps when resolving pronouns.
The model occasionally predicts answers that require world knowledge, but
most predictions are guided by a prior for choosing the neuter es or a prior for the
subject. An accuracy of 55.7% is slightly above the heuristic of randomly choosing an
entity (= 50.0%). This same neuter es bias causes the model to have a high accuracy
of 96.2% for event reference and pleonastic templates, where es is always the correct
answer. Based on the high accuracy on the Gender template in Section 4.7.5, we
conclude the model consistently memorized the gender of concrete nouns. Hence,
cr mistakes stem from Step 1 or Step 2, suggesting that the model failed to learn
proper cr.
68
4.8 Augmentation
We present an approach for augmenting ContraPro to improve cr. Augmen-
tation systematically expands the data to improve a model?s robustness (Kafle et al.,
2017). While challenging for nlp, we focus on a narrow problem which lends itself
to easier data manipulation. Figure 4.2 shows that our model is capable of modeling
the gender of nouns. However, there is a strong prior for translating it to es and
hence little intelligent cr capability. Our goal with the augmentation is to alter the
prior and test if this can improve cr in the model.
We augment our training data and call it antecedent-free augmentation (afa).
We identify candidates for augmentation as sentences where a coreferential it refers
to an antecedent not present in the current or previous sentence (e.g., I told you
before. <SEP> It is red. ? Ich habe dir schonmal gesagt. <SEP> Es ist rot.). We
create augmentations by adding two new training examples where the gender of the
German translation of ?it? is modified (e.g., the two new targets are ?Ich habe dir
schonmal gesagt. <SEP> Er ist rot.? and ?Ich habe dir schonmal gesagt. <SEP>
Sie ist rot.?). The source side remains the same. Table 4.3 provides an additional
example. Antecedents and coreferential pronouns are identified using a cr tool
(Clark and Manning, 2016a,b). We fine-tune our already trained concatenation
model on a dataset consisting of the candidates and the augmented samples. As
a baseline, we fine-tune on the candidates to confidently say that any potential
improvements come from the augmentations.
69
4.8.1 Augmentation Improves Coreference Accuracy
Augmentation improve coreference accuracy on both ContraPro and Con-
tracat. Details are provided in separate sections.
Phrase Addition Possessive Extension Synonym Replacement
100
75 75 73
50 64 58 64 70
25
0 Attacks CONCAT
100
95 93 ContraPro CONCAT75 81 85 85
50 67
25
0
Figure 4.3: Results comparing unaugmented and augmented concat on ContraPro
and same 3 attacks as in Figure 4.1. Results with non-augmented concat are the
same as Figure 4.1.
4.8.2 ContraPro Results
afa provides large improvements, scoring 85.3% on ContraPro (Figure 4.3).
Since the datasets themselves are slightly different due to the augmentation, we
must recompute the baseline. The afa baseline (fine-tuning on the augmentation
candidates only) is higher by 1.94%, presumably because many candidates consist of
coreference chains of ?it? and the model learns they are important for coreferential
pronouns. This improvement in the baseline is small compared to afa improvements
in the full models.
Prediction accuracy on er and sie is substantially increased, suggesting that
the augmentation removes the strong bias towards es. Although, the adversarial
70
Unaugmented Augmented
Markables Overlap World Pleonastic Event Gender
100 100 97 96 94 94
75
71 Unaugmented
50 57
44 47 51
54 46 Augmented
25
0
Figure 4.4: ContraCAT results with unaugmented and augmented concat. We
speculate that readjusting the prior over genders in augmented concat explains
the improvements on Markables and Overlap.
attacks lower afa scores, in contrast to concat, the model is more robust and
the accuracy degradation is substantially lower (except on the synonym attack).
We experiment with different learning rates during fine-tuning and present results
with the lr that obtain the best baseline ContraPro score. Furthermore, concat
and afa obtain 31.5 and 32.2 BLEU on ContraPro, showing that this fine-tuning
procedure, which is tailored to pronoun translation, does not lead to any degradation
in overall translation quality.
4.8.3 Contracat Results
The prior in Contracat over gender pronouns is less concentrated on es than
in ContraPro. This provides for a more even distribution on the Position and Role
Prior templates.
The augmented model has higher accuracy onMarkable Detection, improv-
ing by 27.6%. Results for the templates are in Figure 4.4.
No improvements are observed on the World Knowledge template. Pleonastic
71
cases are still accurate, although not perfect as with concat. The Event template
identifies a systematic issue with our augmentation. We presume this is due to the
cr tool marking cases where it refers to events. We do not apply any filtering and
augment these cases as well, thus creating wrong examples (an event reference it
cannot be translated to er or sie). As a result, the scores are lower compared to
concat. This issue with our model is not visible on ContraPro and the adversarial
attacks results. In contrast, the Event template easily identifies this problem.
afa has a similar accuracy to the unaugmented baseline on the Gender tem-
plate. However, despite increasing by 3.8%, results on Overlap are still underwhelm-
ing. Our analysis shows that augmentation helps in changing the prior. We believe
this provides for improved cr heuristics which in turn provide for an improvement
in coreferential pronoun translation. Nevertheless, the Overlap template shows that
augmented models still do not solve cr in a fundamental way.
4.9 Our Dataset in Context
Addressing discourse phenomena is important for high-quality mt. Apart from
document-level coherence and cohesion, anaphoric pronoun translation has proven
to be an important testing ground for the ability of context-aware nmt to model
discourse. Anaphoric pronoun translation is the focus of several works in context-
aware nmt (Bawden et al., 2018; Voita et al., 2018; Stojanovski and Fraser, 2018;
Miculicich et al., 2018; Voita et al., 2019; Maruf et al., 2019).
72
Bawden et al. (2018) manually create such a contrastive challenge set for
English?French pronoun translation. ContraPro (M?ller et al., 2018) follows this
work, but creates the challenge set in an automatic way. We show that making small
variations in ContraPro substantially changes the accuracy scores, precipitating our
new dataset.
Jwalapuram et al. (2019) propose a model for pronoun translation evaluation
trained on pairs of sentences consisting of the reference and a system output with
differing pronouns. However, as Guillou and Hardmeier (2018) point out, this fails
to take into account that often there is not a 1:1 correspondence between pronouns
in different languages and that a system translation may be correct despite not
containing the exact pronoun in the reference, and incorrect even if containing the
pronoun in the reference, because of differences in the translation of the referent.
Moreover, introducing a separate model which needs to be trained before evaluation
adds an extra layer of complexity in the evaluation setup and makes interpretability
more difficult. In contrast, templates can easily be used to pinpoint specific issues of
an nmt model. Our templates follow previous work where similar tests are proposed
for diagnosing language models (Marelli et al., 2014; Ettinger et al., 2016; Ribeiro
et al., 2018; McCoy et al., 2019; Ribeiro et al., 2020).
4.10 Implications for Machine Translation and Automation
In this work, we study how and to what extent cr is handled in context-aware
nmt. This work shows that standard challenge sets can easily be manipulated with
73
adversarial attacks that cause dramatic drops in performance, suggesting that nmt
uses a set of heuristics to solve the complex task of cr. Attempting to diagnose
the underlying reasons, we propose targeted templates which systematically test
the different aspects necessary for cr. This analysis shows that while some type
of cr such as pleonastic and event cr are handled well, nmt does not solve the
task in an abstract sense. We also propose a data augmentation approach to see if
simple data modifications can improve model accuracy. This methodology illustrates
the dependence on data by models, and strengthen our claims that low-cost data
generation techniques are creating datasets that approximate rather than solve nlp
tasks. Having identified limitations in existing models, we argue for concrete data
extensions for coreference resolution. This methodology?creating an adverserial
dataset testing the understanding of a model?can be applied to most nlp tasks.
This project introduces using an expert, in this case a native German speaker,
in designing the dataset. However, we use templates rather than experts to auto-
matically scale the size of the dataset. While we can create large datasets, they
end up (literally) formulaic. Solving tasks like coreference, rather than just not-
ing shortcomings of current datasets, will require building complex and nuanced
datasets that allow a model to learn the edge cases of the task. These datasets
will ultimately have to built by humans and not automation: can the crowd be a
reliable source of language?
74
Chapter 5: Crowd-Sourced Generation1
Chapters 3 and 4 use automation to provide data to solve a task; how-
ever, some data cannot be automatically generated from templates and require hu-
man assistance. Crowd-sourcing platforms (Section 2.2.3), specifically Mechanical
Turk (Buhrmester et al., 2011), are a cost-efficient, scalable pool for human input.
We summarize a data collection project, canard, that uses non-expert workers to
advance question answering (Section 2.1.2) through rewriting trivia questions.
Conversational question answering (cqa) questions differ from machine read-
ing comprehension (mrc) ones in format (Section 2.1.2); however, cqa questions
can be rewritten as stand-alone mrc questions to generate additional training data.
We reduce challenging, interconnected cqa examples to independent, stand-alone
mrc to create canard?Context Abstraction: Necessary Additional Rewritten
Discourse?a new dataset that rewrites quac (Choi et al., 2018) questions.2 Lan-
guage models train on these stand-alone questions with greater flexibility than on
cqa ones. Decoupling them allows for new training and test splits. Additionally,
1Ahmed Elgohary, Denis Peskov, and Jordan Boyd-Graber. 2019. Can You Unpack That?
Learning to Rewrite Questions-in-Context. In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing and the 9th International Joint Conference on Natural
Language Processing. Peskov is responsible for manual quality control in the data collection
process, analysis of the data and model predictions, part of paper writing, and figure&table design.
2http://canard.qanta.org
75
Question Rewriting
Q1: What happened in 1983? What happened to Anna Vissi in 1983?
A1: In May 1983, she marries Nikos Karvelas, a composer
Did they have any Did  Anna Vissi and Q2: children? Nikos Karvelas have any children together? 
A2: In November, she gave birth to her daughter Sofia
Did she have any other Did Anna Vissi have any Q3: children? other children than her daughter Sofia?  
A3: I don?t know
Figure 5.1: Our question-in-context rewriting task. The input to each step is a ques-
tion to rewrite given the dialog history which consists of the dialog utterances (ques-
tions and answers) produced before the given question is asked. The output is an
equivalent, context-independent paraphrase of the input question. Crowd-workers
are needed to provide these missing details as the omissions are non-formulaic.
successfully rewriting questions to be independent precipitates rewriting questions
to be novel. We crowd-source context-independent paraphrases of quac questions
and use the paraphrases to train and evaluate question-in-context rewriting. In the
process, we observe the behavior of crowd users and the quality of their output.
Section 5.1 constructs canard, a new dataset of question-in-context with
corresponding context-independent paraphrases. Section 7.6 analyzes our rewrites
(and the underlying methodology) to understand the linguistic phenomena that
make cqa and using crowd-sourcing for generation difficult.
76
Characteristic Ratio
Answer Not Referenced 0.98
Question Meaning Unchanged 0.95
Correct Coreferences 1.0
Grammatical English 1.0
Understandable w/o Context 0.90
Table 5.1: Manual inspection of 50 rewritten context-independent questions from
canard suggests that the new questions have enough context to be independently
understandable.
5.1 Dataset Construction
We elicit paraphrases from human crowdworkers to make previously context-
dependent questions unambiguously answerable. Through this process, we resolve
difficult coreference linkages and create a pair-wise mapping between ambiguous and
context-enriched questions. We derive canard from quac (Choi et al., 2018), a
sequential question answering dataset about specific Wikipedia sections. quac uses
a pair of workers?a ?student? and a ?teacher??to ask and respond to questions. The
?student? asks questions about a topic based on only the title of the Wikpedia article
and the title of the target section. The ?teacher? has access to the full Wikipedia
section and provides answers by selecting text that answers the question. With
this methodology, quac gathers 98k questions across 13,594 conversations. We take
their entire dev set and a sample of their train set and create a custom JavaScript
task in Mechanical Turk that allows workers to rewrite these questions. JavaScript
hints help train the users and provide automated, real-time feedback.
We provide workers with a comprehensive set of instructions and task examples
(Figure 5.2). We ask them to rewrite the questions in natural sounding English while
77
Figure 5.2: The interface for our task guides workers in real-time.
preserving the sentence structure of the original question. We discourage workers
from introducing new words that are unmentioned in the previous utterances and
ask them to copy phrases when appropriate from the original question. These in-
structions ensure that the rewrites only resolve conversation-dependent ambiguities.
Thus, we encourage workers to create minimal edits.
We display the questions in the conversation one at a time, since the rewrites
should include only the previous utterance. After a rewrite to the question is sub-
mitted, the answer to the question is displayed. The next question is then displayed.
This repeats until the end of the conversation. Figure 5.2 displays the full set of
instructions and the data collection interface.
We apply quality control throughout our collection process, given the known
generation issues (Section 2.2.3). During the task, JavaScript checks automat-
ically monitor and warn about common errors: submissions that are abnormally
short (e.g., ?why?), rewrites that still have pronouns (e.g., ?he wrote this album?),
or ambiguous words (e.g., ?this article?, ?that?). Many quac questions ask about
?what/who else? or ask for ?other? or ?another? entity. For that class of questions, we
78
ask workers to use a phrase such as ?other than?, ?in addition to?, ?aside from?, ?be-
sides?, ?together with? or ?along with? with the appropriate context in their rewrite.
We gather and review our data in batches to screen potentially compromised
data or low quality workers. A post-processing script flags suspicious rewrites and
workers who take and abnormally long or short time. We flag about 15% of our data.
Every flagged question is manually reviewed by one of the authors and an entire hit
is discarded if one is deemed inadequate. We reject 19.9% of submissions and the
rest comprise canard. Additionally, we filter out under-performing workers based
on these rejections from subsequent batches. To minimize risk, we limit the initial
pool of workers to those that have completed 500 hits with over 90% accuracy and
offer competitive payment of $0.50 per hit.
We verify the efficacy of our quality control through manual review. A ran-
dom sample of fifty questions sampled from the final dataset is reviewed for desirable
characteristics by a native English speaker in Table 5.1. Each of the positive traits
occurs in 90% or more of the questions. Based on our sample, our edits retain
grammaticality, leave the question meaning unchanged, and use pronouns unam-
biguously. In one question, a part of the answer is introduced in the rewrite. In five
questions, some of the context is under-specified. These infrequent mistakes should
not affect our models. We provide examples of failures in Tables 5.2.
We use the rewrites of quac?s development set as our test set (5,571 question-
in-context and corresponding rewrite pairs) and use a 10% sample of quac?s training
set rewrites as our development set (3,418); the rest are training data (31,538).
79
ORIGINAL: Was this an honest mistake by the media?
REWRITE: Was the claim of media regarding Leblanc?s room come to true?
ORIGINAL: What was a single from their album?
REWRITE: What was a single from horslips? album?
ORIGINAL: Did they marry?
REWRITE: Did Hannah Arendt and Heidegger marry?
Table 5.2: Not all rewrites correctly encode the context required to answer a ques-
tion. We take two failures to provide examples of the two common issues: Changed
Meaning (top) and Needs Context (middle). We provide an example with no issues
(bottom) for comparison.
5.2 Dataset Analysis
We analyze our discuss our datasets with automatic metrics. We compare our
dataset to the original quac questions and with automatically generated questions.
We generate the questions with a pronoun substitution baseline that substitutes the
Wikipedia title for the pronoun and a simple seq2seq model (Section 2.3.4).3
Then, we manually inspect the sources of rewriting errors by the model. Fur-
ther improvements for the asr dataset and canard are possible.
5.2.1 Anaphora Resolution and Coreference
Our rewrites are longer, contain more nouns and fewer pronouns, and have
more word types than the original data. Machine output lies in between the two
human-generated corpora, but quality is difficult to assess. Figure 5.3 shows these
statistics. We motivate our rewrites by exploring linguistic properties of our data.
3We use a bidirectional lstm encoder-decoder model with shared the word embeddings between
the encoder and the decoder. We initialize the embeddings with GloVE (Pennington et al., 2014).
We construct the input sequence by concatenating all utterances in the history, prepending them
to the message, and adding a special separator token between utterances. Our collected data is
split between a training, dev, and test set.
80
10
7.5
5
2.5
0
0.5
0.4
0.3
0.2
0.1
0
2.5
2
1.5
1
0.5
0
Original Pronoun Sub Reference Seq2Seq
Figure 5.3: Human rewrites are longer, have fewer pronouns, and have more proper
nouns than the original quac questions. Rewrites are longer and contain more
proper nouns than our Pronoun Sub baseline and trained Seq2Seq model.
Anaphora resolution and coreference are two core nlp tasks applicable to this
dataset.
Pronouns occur in 53.9% of quac questions. Questions with pronouns are
more likely to be ambiguous than those without. Only 0.9% of these have pronouns
that span more than one category (e.g., ?she? and ?his?). Hence, pronouns within a
single sentence are likely unambiguous. However, when viewing the question as an
aggregate of sentences, 75.0% of the full questions have pronouns and 27.8% have
mixed category pronouns. Therefore, pronoun disambiguation potentially becomes
81
Average Length Pronoun Ratio Proper Noun Ratio
Label Text
QUESTION How long did he stay there?
REWRITE How long did Cito Gaston stay at the Jays?
Cito Gaston
Q: What did Gaston do after the world series? . . .
HISTORY Q: Where did he go in 2001?
A: In 2002, he was hired by the Jays as special assistant to
president and chief executive officer Paul Godfrey.
Table 5.3: An example that had over ten flagged proper nouns in the history. Rewrit-
ing requires resolving challenging coreferences.
a problem for a quarter of the original data. For example, ?Did they argue?? is
impossible to answer without context. However, filling this question in with the
appropriate context??Did Johnson and Bird argue???allows basketball superfans
to answer with a resounding ?yes?. A full example is provided in Table 5.3.
Approximately one-third of the questions generated by our pronoun-replacement
baseline are within 85% string similarity to our rewritten questions. Automatic
methods (Chapters 3 and 4) can quickly but somewhat inaccurately replace pro-
nouns with a default phrase. That leaves two-thirds of our data that cannot be
solved with pronoun resolution alone. These are unable to be done without a human-
in-the-loop.
5.3 Conclusion
Rewriting questions is a challenging stand-alone task and has obvious benefits
for question answering. Question rewriting has been formalized as Conversational
Question Reformulation (Lin et al., 2020; Vakulenko et al., 2020). (Qu et al., 2020)
use canard in expanding quac for open domain question answering.
More broadly, canard is representative of crowd-sourcing for generation.
82
The clear limitation of generalist crowd-sourcing is the inability to automatically
quality control generated data. Our work requires manual analysis of each sen-
tence submitted by the crowd; this is time-intensive and subject to error. Addition-
ally, it requires real-time task monitoring and user exclusion as otherwise malicious
users can quickly contribute a large part of your crowd-sourced task. However, this
method generates more diverse and lengthy sentences than comparable automa-
tion projects (Chapters 3 and 4). One way to handle the quality control issue is by
using an expert for both generation and for quality assessment (Chapter 6).
83
Chapter 6: Expert Annotation and Evaluation1
We introduce a new computational task, adaptation, where the gold standard
is subjective and all-important, thereby requiring authoritative experts, rather than
the anonymous crowd (Chapter 5). Vinay and Darbelnet (1995) define adaptation
as translation in which the relationship not the literal meaning between the receiver
and the content needs to be recreated. The literal translation of named entities such
as Anthony Fauci or Dunkin? Donuts into German would keep them the same even
if receiver may not have any familiarity with them.
We can use this task to identify named entities (Kasai et al., 2019; Arora
et al., 2019; Jain et al., 2019) and for understanding other cultures through creating
culturally-centered training data for qa. (Katan and Taibi, 2004). The five-fold task
tuple (Section 2.1.1) for adaptation is:
1. real world problem: domain adaptation and communication
2. data: entities from a given culture
3. input/output: an entity from one culture and a corresponding entity belong-
ing to another culture
1Denis Peskov, Viktor Hangya, Jordan Boyd-Graber, Alexander Fraser. 2021. In Findings of
Empirical Methods in Natural Language Processing, 2021. Peskov is responsible for selecting the
entities, designing and running the human generation and the human evaluation, the WikiData
work, and writing the paper.
84
4. evaluation: human assessment of relevance in their native culture
5. standard for progress: automated adaptations are understandable by rele-
vant humans
We propose two computational methods to find such named entities across
American and German culture. However, neither method can be evaluated without
a gold standard, which is collected from human annotators. Annotation for the hu-
man method requires specialized knowledge: familiarity with German or American
culture. We use experts for this task: domestically educated German and Ameri-
can citizens. Furthermore, evaluation requires knowledge of both cultures. We hire
German translators to assess the computational and human-annotated candidates.
This new task is a stepping-stone to automatically generating questions in languages
outside of English, the dominant language in the field, and to understanding the
perspectives of other cultures (Section 6.6). As one application, health surveys have
to be verified for accuracy in translation as medical terminology or demographic
may be incorrectly adapted for an audience (Ferrari et al., 2010; Lopes and Trelha,
2013). This chapter explores the use of experts for an annotation task. Chapter 7
will use them for generation.
6.1 When Translation Misses the Mark
Imagine reading a translation from German, ?I saw Merkel eating a Berliner
from Dietsch on the ice?. This sentence is opaque without cultural context.
An extreme cultural adaptation for an American audience could render the sen-
85
Bill Gates
Top Adaptations:
WikiData 3CosAdd Human
F. Zeppelin congstar A. Bechtolsheim
G?nther Jauch Alnatura Dietmar Hopp
N. Harnoncourt GMX Carl Benz
Table 6.1: WikiData and unsupervised embeddings (3CosAdd) generate adapta-
tions of an entity, such as Bill Gates. Human adaptations are gathered for evalua-
tion. American and German entities are color coded.
tence as ?I saw Biden eating a Boston Cream from Dunkin? Donuts on the Acela?,
elucidating that Merkel is in a similar political post to Biden; that Dietsch (like
Dunkin? Donuts) is a mid-range purveyor of baked goods; both Berliners and Boston
Creams are filled, sweet pastries named after a city; and ice and Acela are slightly
ritzier high-speed trains.2 Human translators make this adaptation when it is ap-
propriate to the translation (Gengshen, 2003).
Because adaptation is understudied, we leave the full translation task, which
requires generation, to future work (Section 6.6). Instead, we focus on the task of
cultural adaptation, akin to annotation, of entities: given an entity in a source,
what is the corresponding entity in English? Most Americans would not recognize
Christian Drosten, but the most efficient explanation to an American would be to
say that he is the ?German Anthony Fauci? (Loh, 2020). We provide top adapta-
tions suggested by algorithms and humans for another American involved with the
pandemic response, Bill Gates, in Table 6.1.
Can machines reliably find these analogs with minimal supervision? We gen-
erate these adaptations with structured knowledge bases (Section 6.3) and word
2We color-code German and American entities throughout.
86
embeddings (Section 6.4). We elicit human adaptations (Section 6.5.1) to evaluate
whether our automatic adaptations are plausible. Expert evaluation (Section 6.5.2)
validates the merit of our verified annotators relative to computational methods
(Section 6.5.3).
6.2 Wer ist Bill Gates?
You could formulate our task as a traditional analogy Drosten::Germany as
Fauci::United States (Turney, 2008; Gladkova et al., 2016), but despite this su-
perficial resemblance (explored in Section 6.4), traditional approaches to analogy
ignore the influence of culture and are typically within a language. Hence, analo-
gies are tightly bound with culture; humans struggle with analogies outside their
culture (Freedle, 2003).
Machine translation is another similar task that usually translates words liter-
ally; however, this does not necessarily apply in a cultural context as certain named
entities may be relevant in one culture but not another. Statistical machine trans-
lation (Koehn, 2009) retains an explicit connection between words in the target
and source language. In contrast, neural machine translation (Kalchbrenner and
Blunsom, 2013) learns a representation of the source in lieu of preserving the origi-
nal words or phrases. French-English text from the Canadian parliament could be
used to train more flexible models than previously possible (Berger et al., 1994).
Literature and movie captions (Varga et al., 2007), librettos (D?rr, 2005), medical
information (Del?ger et al., 2009), and the Internet (Resnik and Smith, 2003; Smith
87
et al., 2013) can all be sources of parallel data for machine translation.
Creating questions in languages other than English is a current research direc-
tion. mlqa and xquad automatically generate paired questions through machine
translation (Lewis et al., 2019; Artetxe et al., 2019). As an alternative, TyDi (Clark
et al., 2020b) gives crowd-sourced users prompts from Wikipedia articles to create
questions in a wide range of languages. In all of this work, the goal is to preserve
the literal meaning of the source as accurately as possible. We propose to adapt the
meaning to identify new entities and ultimately create new questions.
6.2.1 . . . and why Bill Gates?
This task requires a list of named entities adaptable to other cultures. Our en-
tities come from two sources: a subset of the top 500 most visited German/English
Wikipedia pages and the non-official characterization list (Veale, 2016, noc), ?a
source of stereotypical knowledge regarding popular culture, famous people (real
and fictional) and their trade-mark qualities, behaviours and settings?. Wikipedia
contains a plethora of singers and actors; we filter the top 500 pages to avoid a pop
culture skew.3 We additionally select all Germans and a subset of Americans from
the Veale noc list as it is human-curated, verified, and contains a broader historical
period than popular Wikipedia pages. Like other semantic relationships (Boyd-
Graber et al., 2006), this is not symmetric. Thus, we adapt entities in both direc-
tions; while Berlin is the German Washington, DC, there is less consensus on what
3We discuss the applicability of using Wikipedia (i.e., what proportion of the English Wikipedia
is visited from the United States) in Appendix A.1.
88
is the American Berlin, as Berlin is both the capital, a tech hub, and a film hub. A
full list of our entities is provided in Appendix A.2.
6.3 Adaptation from a Knowledge Base
We first adapt entities with a knowledge base. We use WikiData (Vrande?i?
and Kr?tzsch, 2014), a structured, human-annotated representation of Wikipedia
entities that is actively developed.4 This resource is well-suited to the task as fea-
tures are standardized both within and across languages.
Many knowledge bases explicitly encode the nationality of individuals, places,
and creative works. Entities in the knowledge base are a discrete sparse vector,
where most dimensions are unknown or not applicable (e.g., a building does not
have a spouse). For example, Angela Merkel is a human (instance of), German
(country of citizenship), politician (occupation), Rotarian (member of), Lutheran
(religion), 1.65 meters tall (height), and has a PhD (academic degree). How would
we find the ?most similar? American adaptation to Angela Merkel? Intuitively, we
should find someone whose nationality is American.
Some issues immediately present themselves; contemporary entities will have
more non-zero entries than older entities. Some characteristics are more impor-
tant than others: matching unique attributes like ?worked as journalist? is more
important than matching ?is human?.
Each entity in WikiData has ?properties?, which we can think about as the
4We focus on named entities as they are more culturally centered and often have clear location
attributes such as place of birth. However, general entities such as ?doughnut? and ?Berliner? have
pages that could be compared.
89
dimension of a sparse vector and ?values? that those properties can take on. For
example, Merkel has the properties ?occupation? and ?academic degree?. Values for
those properties are that her ?occupation? is ?politician? and her ?academic degree?
is a ?doctorate?. To match entities across cultures, we focus on matching properties
rather than values; many of the values are more relevant inside a culture. We cannot
find American politicians who belong to the Christian Democratic Union, but we
can find politicians who have an academic degree and a dissertation title.
As a toy example, if Beethoven, Merkel, and Bach all have only two properties :
Beethoven has an ?occupation? and ?genre?, Merkel has an ?Erd?s number? and
?political party?, and Bach has a ?occupation? and ?genre?, then Beethoven and
Bach have a distance of zero from one another and are the closest entities while
Merkel has a distance of two since {?Erd?s number?, ?political party?} is two away
from {?occupation?, ?genre?}.
First, we bifurcate WikiData into two sets: an American set A for items
which contain the value ?United States of America? and a German set D for those
with German values.5 This is a liberal approximation, but it successfully excludes
roughly seven out of the eight million items in WikiData. Then we explore the
properties from WikiData. We create entity vectors with dimensions corresponding
to frequently-occurring properties.
The properties are discrete and categorical; Merkel either has an ?occupation?
or she does not. Each entity then has a sparse vector. We calculate the similarity of
5While the geopolitical definition of American is straightforward, the German nation state is
more nuanced (Schulze, 1991). Following Green (2003), we adopt members of the Zollverein or
the German Confederation as ?German? as well as their predecessor and successor states. This
approach is a more inclusive (Gro?deutschland) definition of?German? culture.
90
the vectors with Faiss?s L2 distance (Johnson et al., 2017) and for each vector in A
find the closest vector in D and vice versa.
So who is the American Angela Merkel? One possible answer is Woodrow
Wilson, a member of a ?political party?, who had a ?doctoral advisor? and a ?religion?,
and ended up with ?awards?. This answer may be unsatisfying as it was Barack
Obama who sat across from Merkel for nearly a decade. To capture these more
nuanced similarities, we turn to large text corpora in Section 6.4.
6.4 An Alternate Embedding Approach
While the classic nlp vector example (Mikolov et al., 2013d) isn?t as magical
as initially claimed (Rogers et al., 2017), it provides useful intuition. We can use
the intuitions of the clich?:
??? ??? ?????? ?????
King?Man+Woman = Queen (6.1)
to adapt between languages.
This, however, requires relevant embeddings. First, we use the entire Wikipedia
in English and German, preprocessed using Moses (Koehn et al., 2007). We follow
Mikolov et al. (2013c) and use named entity recognition (Honnibal et al., 2020) to
tokenize entities such as Barack_Obama.
We use word2vec (Mikolov et al., 2013c), rather than FastText (Bojanowski
et al., 2017), as we do not want orthography to influence the similarity of entities.
Angela Merkel in English and in German have quite different neighbors, and we
91
intend to keep it that way by preserving the distinction between languages.
However, the standard word2vec model assumes a single monolingual embed-
ding space. We use unsupervised Vecmap (Artetxe et al., 2018), a leading tool for
creating cross-lingual word embeddings, to build bilingual word embeddings. We
propose two approaches for adaptation.
3CosAdd We follow the word analogy approach of 3CosAdd (Levy and Gold-
berg, 2014; K?per et al., 2016).6 American?German adaptation takes the source
entity?s (v) embedding in the English vector space and looks for its adaptation (u?)
based on embeddings in the German space. This is like the word analogy task, i.e.,
what entity has the role in the German culture as v does in American culture. As
an example, Merkel has a similar role in the German culture as Biden. Formally,
the adaptation of the English entity v into German is
?? (??? ?? )a ?avg Een , Ede( United_States USA ) (6.2)?? ??? ??
d ?avg EenGerm(any, E
de
Deutschland (6.3)
??
? de ???en ?? ??
)
u =argmax sim Eu , Ev ? a + d , (6.4)
u?V de
??
where Elw is the embedding of word w in language l, V de is the German vocabu-
lary and sim is the cosine similarity. The American anchor word ??a and German
??
anchor d represent the American and German cultures.7 We average the English
and German embeddings of the individual word types for robust anchor vectors.
6We experiment with 3CosMul as well but found 3CosAdd generally more robust.
7Der Spiegel, the largest newspaper, and other prominent media sources call their United States
sections usa.
92
??
In standard analogies, as in Equation 6.1, the ??a and d vectors are different for
each test pair; here they are the same for each example, as we always are pivoting
between the two cultures.
Learned adaptation To eliminate the need for manual anchor selection for
both cultures, our second approach learns the adaptation as a linear transformation
of source embeddings to the target culture given a few adaptation examples. Specifi-
cally, we use the human adaptations sourced for the Wikipedia entities as training for
the Veale noc ones. We follow the work of Mikolov et al. (2013a) and learn a trans-
formation matrix Wen?de for American?German by minimizing the L2 distance of
?? ??
W E en and E deen?de v u over gold adaptation vi, u nii=1 entity pairs. The adaptation of ai i
??
source entity v is u? = W enen?deE v . Likewise, we learn the reverse mapping Wde?en
for German?American adaptation. This requires supervised training data ?but
not much (Conneau et al., 2017)?since there are no existing gold labels for these
adaptations to serve as an oracle. We collect this data from appropriately qualified
experts in Section 6.5.
6.5 Comparing Automation to Human Judgment
The computational methods can generate entities at scale, but humans have
to evaluate their relevance.
93
6.5.1 Adaptation by Locals
Since quality control is difficult for generation and complicated annota-
tion (Karpinska et al., 2021), we need users who will answer the task accurately.
We recruit five American citizens educated at American universities and five Ger-
man citizens educated at German ones through personal educational networks. They
have familiarity with the popular named entities in their own culture, which is the
necessary expertise for adaptation. These human annotations serve as a gold stan-
dard against which we can compare our automatic approaches. Chapter 5 showed
that human output was superior to the automatic approaches for the notably more
straightforward task of question rewriting. To improve the user experience, we cre-
ate an interface (Figure 6.2) that provides a brief summary of each source entity
from Wikipedia and asks the users to select a target adaptation that autocompletes
Wikipedia page titles (all entities; targets are not limited to the lists in Section 8.4.2)
in a text box a la answer selection in Wallace et al. (2019b). We provide a thought-
through example of possible adaptations for Angela Merkel in our instructions and
encourage a holistic approach to the task. The annotation task requires two hours
for our users to complete. Each annotation is independent allowing users to return
to the task at their convenience over a span of two weeks. Participants completed
the task on a volunteer basis. Obviously, German annotators are more familiar with
German culture than the Americans, and vice-versa. Annotators translate into their
native language. Since we are focusing on popular entities, they are often known
despite the cultural divide, but the introductory paragraph from Wikipedia reminds
94
Figure 6.1: Our interface provides users with information about the entity and asks
them to select an option from possible Wikipedia pages
Figure 6.2: Our Qualtrics survey
users if not.
95
6.5.2 Are the Adaptations Plausible?
To validate and compare all our adaptation strategies? precision, five German
translators, appropriately qualified experts for the evaluation, who understand
American culture assess the adaptations.8 The top five adaptations from Wiki-
Data, 3CosAdd, learned adaptation, and humans?as well as five randomly selected
options from the human pool?are evaluated for plausibility on a five-level Likert
scale.9 We provide instructions and examples for using the Likert scale and provide
users with a free-response box to escalate concerns. Fleiss? Kappa (0.382) assesses
interannotator Agreement; this ?fair? agreement suggests that vetting an adaptation
is challenging and sometimes subjective, even for translators.
6.5.3 Why Adaptation is Difficult
Embedding adaptations are better than WikiData?s, and human adaptations
are better still (Figure 6.3). Thus, we use human adaptations as the gold standard
for evaluating recall. Only the learned embedding method uses training data, so we
use human adaptations from Wikipedia to train the projection matrix and evaluate
(for all methods) using human adaptations the noc list.
Given that the task is subjective, we take our results with a grain of salt given
cultural variation (e.g., some people view Angela Merkel?s conservatism as a defining
characteristic, while others focus on her science pedigree). The adaptations come
8Recruited through Upwork for $40 each.
9Our custom Qualtrics survey is provided in Figure 6.2. The order of adaptations is randomized
and assessed on a Likert scale with anchors from Jurgens et al. (2014).
96
from verified citizens of the respective countries, which is the appropriate level of
expertise for this task. Anonymous crowd annotation would create unexpected
familiarity biases: all politicians could be reduced to Angela Merkel and all com-
panies could be reduced to Mercedes-Benz, since there is no obvious mechanism to
encourage great rather than a good annotations.
We use the mean reciprocal rank (Voorhees et al., 1999, mrr) to measure how
high the gold adaptations are ranked by our other adaptation strategies. Since mrr
decreases geometrically and our gold standard is not exhaustive, the Recall@5, and
@100 metrics are more intuitive. We calculate Recall@n by measuring what fraction
of the correct adaptations of a source entity is retrieved in the top n predictions.10
Table 6.2 validates that the human annotations are near the top of the automatic
adaptations; the precision-oriented evaluation (Figure 6.3) validates whether the
top of the list is reasonable. All human annotations and a sample of the automatic
adaptations are provided in Appendix A.2.
6.5.4 Qualitative Analysis
There is no single answer to what makes a good adaptation. Let us return to
the question of who Bill Gates is, which underlines how there is often no one right
answer to this question but several context-specific possibilities. The human adap-
tations show the range of plausible adaptations, each appropriate for a particular
facet of the position Bill Gates has in us society. As previously mentioned, Carl
Benz represents a larger than life founder who created an entire industry with his
10This is often referred to as P@n in bilingual lexicon induction literature (Conneau et al., 2017).
97
5
4
3
2
1
0
Random WikiData 3CosAdd Learned Human
Method
Figure 6.3: We validate adaptation strategies with expert translators on a five-point
Likert scale. The human-generated adaptations are rated best?between ?related?
(3) and ?similar? (4). These human adaptations become the reference for evaluation
in Table 6.2.
Data Metric WikiData 3CosAdd Learned
American? German
Rec@5 7.5% 14.2% -
Wikipedia Rec@100 34.4% 52.8% -
mrr 0.05 0.10 -
Rec@5 3.0% 22.9% 28.6%
Veale noc Rec@100 42.4% 51.4% 45.7%
mrr 0.03 0.17 0.24
German? American
Rec@5 3.1% 17.2% -
Wikipedia Rec@100 15.4% 40.5% -
mrr 0.01 0.12 -
Rec@5 0.0% 25.0% 25.0%
Veale noc Rec@100 25.0% 70.0% 55.0%
mrr 0.02 0.12 0.15
Table 6.2: If we consider human adaptations as correct, where do they land in
the ranking of automatic adaptation candidates? In this recall-oriented approach,
learned mappings (which use a small number of training pairs), rate highest.
company. However, Carl Benz made cars, not computers.
Even within technology, different adaptations highlight different aspects of
Bill Gates. Like the implementer of the basic programming language, Konrad Zuse
98
Average Rating
contributed to computers that were more than single-purpose machines. Just as as
Bill Gates?s Microsoft is seen as a stodgy tech giant, Dietmar Hopp founded sas,
a giant German tech company that is more often discussed in board rooms than in
living rooms. And because the epicenter of modern tech is America?s West Coast,
Andreas von Bechtolsheim represents a German founder of Sun Microsystems and
early Google investor that made his way to Silicon Valley.
Other times, there is more consensus: a majority of raters declare Angela
Merkel is the German Hilary Clinton, and Joseph Smith is the American Martin
Luther. There are even some unanimous adaptations: Bavaria is the German Cali-
fornia. Adaptations of fictional characters seem particularly difficult, although this
may represent the supremacy of American popular culture; Superman and Homer
Simpson are so well known in Germany that there are no clear adaptations; Till
Eulenspiegel, Maverick, Bibi Blocksberg are not superheroes from a dying world
and Heidi is not a dumb, bald everyman.
We evaluate the translator evaluations as well. The assessment is committed
in good faith. Karl Denke, a serial killer and a random control for Abraham Lincoln
is rating as ?unrelated? (1) by all annotators. The translators generally agree in the
direction of the rating even if the exact rating varies: Bismarck for Abraham Lincoln
is correctly rated as either a four or a five by all annotators; both are historically
prominent 19th century politicians responsible for a military unification. However,
there are certain differences, such as Abraham Lincoln being heavily associated
with his assassination. The overall average is brought down by adaptations such as
Napolean for Abraham Lincoln being evaluated as ?unrelated? (1) due to not being
99
Source: What is the longest river in the United States?
Mississippi
Detection: What is the longest river in the United States?
Mississippi
Adaptation: What is the longest river in the Germany?
Rhine
Target: Welches ist der l?ngste Fluss in Deutschland?
Rhein
Table 6.3: A hypothetical qa pipeline that adapts a question.
a German adaptation, even if the adaptation is otherwise reasonable, which makes
a particularly large difference for the human annotations.
6.6 Generating New Questions
These results ultimately bring us back to the motivation: can our methodology
be used to generate questions in a new language? We discuss a hypothetical pipeline
for doing this and provide an example.
This will require a combination of machine translation and adaptation. First,
relevant Named Entities in a sentence must be identified with a Named Entity
recognition tool, such as spaCy Part of Speech tagging (Honnibal and Johnson,
2015). Second, these Named Entities must be translated into the target culture.
This poses a research challenge, since multiple Named Entities must be translated
jointly. Last, the entire sentence must be translated fluently into the target language.
This pipeline is illustrated with an example in Table 6.3.
We do not expect that most of our generated questions will make sense; turning
?When did Abraham Lincoln make his Emancipation Proclamation? into ?When did
100
Friedrich Ebert make the Edict of Potsdam? is nonsensical.11 One solution is to
decouple the questions and answers. This practice has been recently implemented
by Clark et al. (2020b), in which they ask one batch of participants what they are
curious about, without any grounding motivation. Then they collect the answers, if
they are available, from Wikipedia. This type of approach would allow either half
of question answering to be independently adapted. A more complicated approach
would require a joint model that is confident in having identified all entities in a
sentence, and in coherently adapting them together.
6.6.1 Adaptation is not Trivial
When creating new questions, correct adaptation must navigate complicated
political and ideological barriers. Comparing one to Napolean may have completely
different connotations in France and in Italy and could cause a political snafu. An
incorrect religious comparison could have even higher stakes. Additionally, adap-
tation may introduce new ideas; in Alice in Wonderland, a character develops the
characteristic of being sleepy when the name is adapted into Portuguese (Carroll
and Amorim, 2003). Certain adaptations have been made nefariously in the interest
of censorship (Tymoczko, 2006). This makes adaptation a useful tool for exploring
the abstract idea of culture.
11Ebert lived in the twentieth century and could not have authored a seventeenth century edict
on religious liberty.
101
6.7 A New Computational Task
We formally introduce entity adaptation as a new computational task and
show why experts are needed for any subjective task. Word2vec embeddings and
WikiData can be used to figuratively?not just literally?translate entities into a
different culture. Humans are better at generating candidates for this task than our
computational methods (Figure 6.3). These methods are well-motivated, but have
room for improvement. Knowledge bases improve over time and increased cover-
age of entities?as well as improved information about each entity?would improve
the method. Alternate word embedding approaches?perhaps those that discard
orthography?may provide better candidates. Even humans occasionally disagree
with other humans on this task, so evaluation for this task is nontrivial. Since
entities have multiple valid adaptations, one cannot exclude adaptations as invalid
due to being different from those proposed by other annotators. Hence, excluding
an improperly-motivated or improperly qualified annotator is more important than
excluding annotations after the fact.
People need questions and answers that reflect their language and culture,
but datasets are lacking: adaptation can help. There has been an explosion of
English-language qa datasets, but other languages continue to lag behind. Several
approaches try to transfer English?s bounty to other languages (Lewis et al., 2019;
Artetxe et al., 2019), but most of the entities asked about in major qa datasets
are American (Gor et al., 2021b). Adapting entire questions will require not just
adapting entities and non-entities in tandem but will also require integration with
102
machine translation (Kim et al., 2019; Hangya and Fraser, 2019). High quality
adaptation is paramount to make the questions interesting if they occur in a trivia
context and pertinent if they occur in an educational context.
Human input, either as generators or evaluators of the adaptations, is required
at this stage for adaptations to be reliable. Our automatic methods did not cre-
ate precise adaptations, but the alternative ?incorrect? adaptations may be useful
for low-precision tasks, such as generating numerous simple open-ended questions
or gauging the popularity of an entity. Additionally, our new dataset of human
adaptations and human evaluation of these adaptations can serve as an evaluation
metric for future automatic methods. Given the existence of robust datasets in high
resource languages can we adapt, rather than literally translate, them to other
cultures and languages?
This task is not possible without expert annotation. However, we do not
generate full translations in this task. We do not observe malicious or careless
answers from our annotators or evaluators. Hence, we extend the use of experts
to a task in which quality assurance is nearly impossible: dialog generation in
Chapter 7.
103
Chapter 7: Expert Generation1
Experts can generate datasets of a quality unachievable by the crowd by
providing reliable and specialized expertise. First, working with experts usually in-
volves verifying their identify and creating an ongoing relationship, often in the form
of a contract. This relationship enables tasks requiring a long-term commitment with
a user; a psuedo-anonymous crowd user being paid to complete an independent task
does not have strong motivation to consistently repeat the one-time task. Second,
specialized knowledge may be needed for certain tasks; a larger amount of incorrect
x-ray annotations would not be preferable to a smaller amount of correct ones for a
radiologist. Additionally, the accuracy, rather than the size, of the data allows the
dataset to withstand the test of time.2 This justifies the large investment of time,
relationship-building, and money necessary to use experts.
We create a deception dataset using experts, as a contrast to the earlier crowd-
sourced generated canard dataset (Chapter 5). Participants?that are engaged in
the task and are appropriately compensated?both generate and annotate data in
1Denis Peskov, Benny Chang, Ahmed Elgohary, Joe Barrow, Cristian Danescu-Niculescu-Mizil,
and Jordan Boyd-Graber. 2020. It Takes Two to Lie: One to Lie and One to Listen. In Proceedings
of The Association for Computational Linguistics. Peskov is responsible for designing the task,
gathering the participants, running the games, building half the models, part of the data analysis,
the visualizations, and the paper writing.
2The Penn Treebank (Marcus et al., 1993), which used graduate students in linguistics and
spanned three years in the early 1990s remained influential for years and is referenced in Compu-
tational Linguistic courses today.
104
the span of a game that usually lasts over a month. The annotation is more
complicated than in our adaptation dataset (Chapter 6) due to being real-time and
user-specific. The resulting product is a gold standard of conversational nlp data
in quality of language, diversity, and naturalness.
The conversations and annotations thereof would not be possible without ex-
perts familiar with the game. Deception is an art, rather than a science (Bavelas
et al., 1990; Bell and DePaulo, 1996) and like adaptation (Chapter 6), a subjective
task. We recruit top players from the competitive Diplomacy community (Hill, 2014;
Chiodini, 2020) and compensate them appropriately for their effort.
7.1 Where Does One Find Long-Term Deception?
A functioning society is impossible without trust. In online text interactions,
users are typically trusting (Shneiderman, 2000), but this trust can be betrayed
through false identities on dating sites (Toma and Hancock, 2012), spearphishing at-
tacks (Dhamija et al., 2006), sockpuppetry (Kumar et al., 2017) and, more broadly,
disinformation campaigns (Kumar and Shah, 2018). Beyond such one-off antiso-
cial acts directed at strangers, deception can also occur in sustained relationships,
where it can be strategically combined with truthfulness to advance a long-term
objective (Cornwell and Lundgren, 2001; Kaplar and Gordon, 2004).
We introduce a dataset to study the strategic use of deception in long-lasting
relationships. We define the task (Section 2.1.1) of deception detection as:
1. real world problem: deception (and an extension of dialog 2.1.3)
105
Message Sender?s Receiver?s
intention percep.
If I were lying to you, I?d smile and say ?that sounds great.?
I?m honest with you because I sincerely thought of us as Lie Truth
partners.
You agreed to warn me of unexpected moves, then didn?t
. . . You?ve revealed things to England without my permis- Truth Truth
sion, and then made up a story about it after the fact!
. . . I have a reputation in this hobby for being sincere. Not
being duplicitous. It has always served me well. . . . If you Lie Truth
don?t want to work with me, then I can understand that . . .
(Germany attacks Italy)
Well this game just got less fun Truth Truth
For you, maybe Truth Truth
Table 7.1: An annotated conversation between Italy (white) and Germany (gray)
at a moment when their relationship breaks down. Each message is annotated by
the sender (and receiver) with its intended or perceived truthfulness; Italy is lying
about . . . lying.
2. data: raw text that contains elements of deception
3. input/output: input of free form text and output of a binary decision on
deceptiveness thereof
4. evaluation: accuracy of the above decision, relative to humans
5. standard for progress: reaching parity with human detection of deception
To collect reliable ground truth in this complex scenario, we design an interface for
players to naturally generate and annotate conversational data while playing a
negotiation-based game called Diplomacy. These annotations are done in real-time
as the players send and receive messages. While this game setup might not directly
translate to real-world situations, it enables computational frameworks for studying
deception in a complex social context while avoiding privacy issues.
After providing background on the game of Diplomacy and our intended de-
106
ception annotations (Section 7.2), we discuss our study (Section 7.4). To probe
the value of the resulting dataset, we develop lie prediction models (Section 7.5)
and analyze their results (Section 7.6). The role of the expert is paramount (Sec-
tion 2.2.4).
7.2 Diplomacy
The Diplomacy board game places a player in the role of one of seven European
powers on the eve of World War I. The goal is to conquer a simplified map of Europe
by ordering armies in the field against rivals. Victory points determine the success
of a player and allow them to build additional armies; the player who can gain
and maintain the highest number of points wins.3 The mechanics of the game are
simple and deterministic: armies, represented as figures on a given territory, can
only move to adjacent spots and the side with the most armies always wins in a
disputed move. The game movements become publicly available to all players after
the end of a turn.
Because the game is deterministic and everyone begins with an equal amount of
armies, a player cannot win the game without forming alliances with other players?
hence the name of the game: Diplomacy. Conquering neighboring territories de-
pends on support from another player?s armies. After an alliance has outlived its
usefulness, a player often dramatically breaks it to take advantage of their erstwhile
ally?s vulnerability. Table 7.1 shows the end of one such relationship. As in real
3In the parlance of Diplomacy games, points are ?supply centers? in specific territories (e.g.,
London). Having more supply centers allows a player to build more armies and win the game by
capturing more than half of the 34 supply centers on the board.
107
life, to succeed a betrayal must be a surprise to the victim. Thus, players pride
themselves on being able to lie and detect lies. Our study uses their skill and pas-
sion to build a dataset of deception created by battle-hardened diplomats. Senders
annotate whether each message they write is an actual lie and recipients anno-
tate whether each message received is a suspected lie. Further details on the
annotation process are in Section 7.4.1.
7.2.1 A game walk-through
Figure 7.1 shows the raw counts of one game in our dataset. But numbers do
not tell the whole story. We analyze this case study using rhetorical tactics (Cialdini
and Goldstein, 2004), which Oliveira et al. (2017) use to dissect spear phishing e-
mails and Anand et al. (2011) apply to persuasive blogs. Mentions of tactics are
in italic (e.g., authority). For the rest of the paper, we will refer to players via the
name of their assigned country.
Through two lie-intense strategies?convincing England to betray Germany
and convincing all remaining countries to agree to a draw?Italy gains control of
the board. Italy?s first deception is a plan with Austria to dismantle Turkey. Turkey
believes Italy?s initial assurance of non-aggression in 1901. Italy begins by excusing
his initial silence due to a rough day at work, evoking empathy and likability. While
they do not fall for subsequent lies, Turkey?s initial gullibility cements Italy?s first-
strike advantage. Meanwhile, Italy proposes a long-term alliance with England
against France, packaging several small truths with a big lie. The strategy succeeds,
108
Deceived their Victim
100
75
50
25
0
Victim Caught a Lie
6
4
2
0
Victim Fell for Lie
100
75
50
25
0
Victory Points
30
20
10
0
'01'02'03'04'05'06'07'08'09'10
Turn (Year)
Austria Germany Turkey
England Italy
France Russia
Figure 7.1: Counts from one game featuring an Italy (green) adept at lying but
who does not fall for others? lies. The player?s successful lies allow them to gain an
advantage in points over the duration of the game. In 1906, Italy lies to England be-
fore breaking their relationship. In 1907, Italy lies to everybody else about wanting
to agree to a draw, leading to the large spike in successful lies.
eliminating Italy?s greatest threat.
Local threats eliminated, Italy turns to rivals on the other end of the map.
Italy persuades England to double-cross its long-time ally Germany in a moment of
scarcity : if you do not act now, there will be nowhere to expand. England accepts
help from ascendant Italy, expecting reciprocity. However, Italy aggressively and
successfully moves against England. The last year features a meta-game deception.
109
Count
After Italy becomes too powerful to contain, the remaining four players team up.
Ingeniously, Italy feigns acquiescence to a five-way draw, individually lying to each
player and establishing authority while brokering the deal. Despite Italy?s record
of deception, the other players believe the proposal (annotating received messages
from Italy as truthful) and expect a 1907 endgame, the year with the most lies.
Italy goes on the offensive and knocks out Austria.
Each game has relationships that are forged and then riven. In another game,
an honest attempt by a strong Austria to woo an ascendant Germany backfires,
knocking Austria from the game. Germany builds trust with Austria through a be-
lieved fictional experience as a Boy Scout in Maine (likability). In a third game, two
consecutive unfulfilled promises by an ambitious Russia leads to a quick demise, as
their subsequent excuses and apologies are perceived as lies (failed consistency). In
another game, England, France, and Russia simultaneously attack Germany after
offering duplicitous assurances. Game outcomes vary despite the identical, balanced
starting board, as different players use unique strategies to persuade, and occasion-
ally deceive, their opponents.
7.2.2 Defining a lie
Statements can be incorrect for a host of reasons: ignorance, misunderstand-
ing, omission, exaggeration. (Gokhman et al., 2012) highlight the difficulty of finding
willful, honest, and skilled deception outside of short-term, artificial contexts (De-
Paulo et al., 2003). Crowdsourced and automatic datasets rely on simple nega-
110
tions (P?rez-Rosas et al., 2017) or completely implausible claims (e.g., ?Tipper Gore
was created in 1048? from (Thorne et al., 2018b)). While lawyers in depositions and
users of dating sites will not willingly admit to their lies, the players of online games
are more willing to revel in their deception.
We must first define what we mean by deception. Lying is a mischaracteriza-
tion; it?s thus no surprise that a definition may be divisive or the subject of academic
debate (Gettier, 1963). We provide this definition to our users: ?Typically, when
[someone] lies [they] say what [they] know to be false in an attempt to deceive the
listener? (Siegler, 1966). An orthodox definition requires the speaker to utter an
explicit falsehood (Mahon, 2016); skilled liars can deceive with a patina of veracity.
A similar definition is required for prosecution of perjury, leading to a paucity of
convictions (Bogner et al., 1974). Indeed, when we ask participants what a lie looks
like, they mention evasiveness, shorter messages, over-qualification, and creating
false hypothetical scenarios (DePaulo et al., 2003).
7.2.3 Annotating truthfulness
Previous work on the language of Diplomacy (Niculae et al., 2015) lacks ac-
cess to players? internal state and was limited to post-hoc analysis. We improve on
this by designing our own interface that gathers players? intentions and perceptions
in real-time (Section 7.4.1). As with other highly subjective phenomena like sar-
casm (Gonz?lez-Ib??ez et al., 2011; Bamman and Smith, 2015), sentiment (Pang
et al., 2008) and framing (Greene and Resnik, 2009), the intention to deceive is
111
Figure 7.2: Every time they send a message, players say whether the message is
truthful or intended to deceive. The receiver then labels whether incoming messages
are a lie or not. Here Italy indicates they believe a message from England is truthful
but that their reply is not.
reflective on someone?s internal state. Having individuals provide their own labels
for their internal state is essential as third party annotators could not accurately
access it (Chang et al., 2020).
Most importantly, our gracious players have allowed this language data to
be released in accordance with irb authorized anonymization, encouraging further
work on the strategic use of deception in long-lasting relations.4
4Data available at http://go.umd.edu/diplomacy_data and as part of ConvoKit http://
convokit.cornell.edu.
112
7.3 Broader Applicability
This differs from previous work that does not follow the expert-generated
paradigm. The most prominent past work on Diplomacy in the nlp community,
(Niculae et al., 2015), found (Chapter 3) their data and thus could not release it
to the public. This hampers follow-up applications of the research; a believable
Diplomacy-playing (and speaking) bot cannot be trained if the raw language data
is redacted and shuffled. We believe this work can set a paradigm for work outside
of Diplomacy, and even nlp; the interface created for this project, as well as the pre
and post-game user surveys can be modified for any conversational task (Chapter 8).
Most importantly, building a relationship with data generators elevates the standard
of the data and guarantees its liberal distribution. This mirrors the relationship
with adaptation annotators (Chapter 6). Further work is necessary in codifying
data standards?Show Your Data, not only your Work (Dodge et al., 2019).
7.4 Engaging a Community of Liars
This dataset requires both a social and technical setup: finding an online
Diplomacy community and creating a framework for annotating messages between
players.
113
7.4.1 Seamless Diplomacy Data Generation
We need two technical components for our study: a game engine and a chat
system. We choose Backstabbr as an accessible game engine on desktop and mobile
platforms: players input their moves and the site adjudicates game mechanics (Chio-
dini, 2020).5 Our communication framework is atypical. Thus, we create a server
on Discord, the group messaging platform most used for online gaming and by the
online Diplomacy community (Coberly, 2019).6 The app is reliable on both desktop
and mobile devices, free, and does not limit access to messages. Instead of direct
communication, players communicate with a bot; the bot does not forward messages
to the recipient until the player annotates the messages (Figure 7.2). In addition,
the bot scrapes the game state from Backstabbr to sync game and language data.
Annotation of lies is a forced binary choice in our experiment. We follow
previous work that views linguistic deception as binary (Buller et al., 1996; Braun
and Van Swol, 2016). However, explicitly calling a statement a lie is difficult, and
people would prefer degrees of deception (Bavelas et al., 1990; Bell and DePaulo,
1996). Some studies make a more fine-grained distinction; for example, Swol et al.
(2012) separate strategic omissions from blatant lies (we consider both deception).
But, because we are asking the speakers themselves (and not trained annotators) to
make the decision, we follow the advice from crowdsourcing to simplify the task as
much as possible (Snow et al., 2008; Sabou et al., 2014). Long messages can contain
both truths and lies, and we ask players to categorize these as lies since the truth
5https://www.backstabbr.com
6https://www.discord.com
114
can be a shroud for their aims.
7.4.2 Building a player base
The Diplomacy players maintain an active, vibrant community through real-
life meetups and online play (Hill, 2014; Chiodini, 2020). We recruit top players
alongside inexperienced but committed players in the interest of having a diverse
pool.7 Our experiments include top-ranked players and community leaders from
online platforms, seasoned in-person tournament players with over 100 past games,
and board game aficionados. These players serve as our foundation: during the
initial design they helped us create a minimally annoying interface and a definition
of a lie that would be consistent with Diplomacy play. Good players?as determined
by active participation, annotation and game outcome?are asked to play in future
games.
In traditional crowdsourcing tasks compensation is tied to piecework that takes
seconds to complete (Buhrmester et al., 2011). Diplomacy games are different in
that they can last a month. . . and people already play the game for free. Thus, we do
not want compensation to interfere with what these players already do well: lying.
Even the obituary of the game?s inventor explains
Diplomacy rewards all manner of mendacity: spying, lying, bribery, ru-
mor mongering, psychological manipulation, outright intimidation, be-
7We recruit players from Diplomacy community forums, in-person tournaments, and board
game clubs. We ask players if they are familiar with the rules of Diplomacy but do not have
exclusionary qualification requirements; however, players that are not appropriately engaged are
not invited to play further games, which happened in only a handful of cases.
115
Category Value
Message Count 13,132
actual lie Count 591
suspected lie Count 566
Average # of Words 20.79
Table 7.2: Summary statistics for our train data (nine of twelve games). Messages
are long and only five percent are lies, creating a class imbalance.
trayal, vengeance and backstabbing (the use of actual cutlery is discour-
aged)? (Fox, 2013).
Thus, our goal is to have compensation mechanisms that get people to play this
game as they normally would, finish their games, and put up with our (slightly)
cumbersome interface. Part of the compensation is non-monetary: a game experi-
ence with players that are more engaged than the average online player.
To encourage complete games, most of the payment is conditioned on finishing
a game, with rewards for doing well in the game. Players get at least $40 upon
finishing a game.8 Additionally, we provide bonuses for specific outcomes: $24 for
winning the game (an evenly divisible amount that can be split among remaining
players) and $10 for having the most successful lies, i.e., statements they marked
as a lie that others believed.9 Diplomacy usually ends with a handful of players
dividing the board among themselves and agreeing to a tie. In the game described
in Section 7.2.1, the remaining four players shared the winner?s pool with Italy after
10 in-game years, and Italy won the prize for most successful lies.
8They receive $10 simply for having begun the study. No players dropped out of our games.
9The lie incentive is relatively small (compared to the $40 incentive for participation and up to
$24 for winning) to discourage an opportunistic player from marking everything as a lie. Games
were monitored in real-time and no player was found abusing the system (marking more than
?20% lies).
116
3000
2000
1000
0
0 100 200 300
Word Count per Message
Figure 7.3: Individual messages can be quite long, wrapping deception in pleas-
antries and obfuscation.
7.4.3 Data overview
Table 7.2 quantitatively summarizes our data. Messages vary in length and
can be paragraphs long (Figure 7.3). Close to five percent of all messages in the
dataset are marked as lies and almost the same percentage (but not necessarily the
same messages) are perceived as lies, consistent with the ?veracity effect? (Levine
et al., 1999). In the game discussed above, eight percent of messages are marked
as lies by the sender and three percent of messages are perceived as lies by the
recipient; however, the messages perceived as lies are rarely lies (Figure 7.4).
7.4.4 Demographics and self-assessment
We collect anonymous demographic information from our study participants:
the average player identifies as male, between 20 and 35 years old, speaks English
117
Frequency
Receiver?s perception
Truth Lie
Truth Straightforward Salut! Just Cassandra I don?t care if we target
checking in, letting you know the T first or A first. I?ll let you decide.
embassy is open, and if you de- But I want to work as your part-
cide to move in a direction I might ner. . . . I literally will not message
be able to get involved in, we can anyone else until you and I have a
probably come to a reasonable ar- plan. I want it to be clear to you
rangement on cooperation. Bonne that you?re the ally I want.
journee!
Lie Deceived You, sir, are a terrific Caught So, is it worth us having
ally. This was more than you a discussion this turn? I sincerely
needed to do, but makes me feel wanted to work something out with
like this is really a long term thing! you last turn, but I took silence to
Thank you. be an ominous sign.
Table 7.3: Examples of messages that were intended to be truthful or deceptive
by the sender or receiver. Most messages occur in the top left quadrant (Straight-
forward). Figure 7.4 shows the full distribution. Both the intended and perceived
properties of lies are of interest in our study.
as their primary language, and has played over fifty Diplomacy games.10 Players
self-assess their lying ability before the study. The average player views themselves
as better than average at lying and average or better than average at perceiving lies.
In a post-game survey, players provide information on whom they betrayed
and who betrayed them in a given game. This is a finer-grained determination than
the post hoc analysis used in past work on Diplomacy (Niculae et al., 2015). We
ask players to optionally provide linguistic cues to their lying and to summarize the
game from their perspective.
10Our data skews 80% male and 95% of the players speak English as a primary language. Ages
range from eighteen to sixty-four. Game experience is distributed across beginner, intermediate,
and expert levels.
118
Sender?s intention
Straightforward Cassandra
Deceived Caught
12000
9000
6000
3000
0
Figure 7.4: Most messages are truthful messages identified as the truth. Lies are
often not caught. Table 7.3 provides an example from each quadrant.
7.4.5 An ontology of deception
Four possible combinations of deception and perception can arise from our
data. The sender can be lying or telling the truth. Additionally, the receiver can
perceive the message as deceptive or truthful. We name the possible outcomes for lies
as Deceived or Caught, and the outcomes for truthful messages as Straightforward or
Cassandra,11 based on the receiver?s annotation (examples in Table 7.3, distribution
in Figure 7.4).
11In myth, Cassandra was cursed to utter true prophecies but never be believed. For a discussion
of Cassandra?s curse vis a vis personal and political oaths, see Torrance (2015).
119
Train Counts
7.5 Detecting Lies
We build computational models both to detect lies to better understand our
dataset. The data from the user study provide a training corpus that maps language
to annotations of truthfulness and deception. Our models progressively integrate
information?conversational context and in-game power dynamics?to approach hu-
man parity in deception detection.
7.5.1 Metric and data splits
We investigate two phenomena: detecting what is intended as a lie and what is
perceived as a lie. However, this is complicated because most statements are not lies:
less than five percent of the messages are labeled as lies in both the actual lie and
the suspected lie tasks (Table 7.2). Our results use a weighted F1 feature across
truth and lie prediction, as accuracy is an inflated metric given the class imbal-
ance (Japkowicz and Stephen, 2002). We thus adopt an in-training approach (Zhou
and Liu, 2005) where incorrect predictions of lies are penalized more than truth-
ful statements. The relative penalty between the two classes is a hyper-parameter
tuned on F1.
Before we move to computational models for lie detection, we first establish
the human baseline. We know when senders were lying and when receivers spotted
a lie. Humans spot 88.3% of lies. However, given the class imbalance, this sounds
better than it is. Following the suggestion of (Levine et al., 1999), we focus on the
detection of lies, where humans have a 22.5 Lie F1.
120
Macro F1 Lie F1
Random 39.8 14.9
Majority Class 47.8
Harbingers 52.8 24.6
Bag of Words 52.9 23.7
Bag of Words+Power 54.3 19.1
LSTM 54.9 20.2
Context LSTM 53.8 13.7
Context LSTM+Power 55.8 19.2
Human 52.7 13.5
Random 38.3 11.8
Majority Class 48.3
Harbingers 45.9 14.7
Bag of Words 45.1 15.5
Bag of Words+Power 51.5 13.7
LSTM 51.6 13.9
Context LSTM 53.8 13.6
Context LSTM+Power 54.3 15.0
Human 53.3 15.1
0 20 40 0 5 10 15 20 25
Figure 7.5: Test set results for both our actual lie and suspected lie tasks.
We provide baseline (Random, Majority Class), logistic (language features, bag of
words), and neural (combinations of a lstm with bert) models. The neural model
that integrates past messages and power dynamics approaches human F1 for actual
lie (top). For actual lie, the human baseline is how often the receiver correctly
detects senders? lies. The suspected lie lacks such a baseline.
To prevent overfitting to specific games, nine games are used as training data,
one is used for validation for tuning parameters, and two games are test data. Some
players repeat between games.
7.5.2 Logistic regression
Logistic regression models (Background Section 2.3.1) have interpretable co-
efficients which show linguistic phenomena that correlate with lies. A word that
occurs infrequently overall but often in lies, such as ?honest? and ?candidly?, helps
identify which messages are lies.
121
Actual Lie Suspected Lie
(Niculae et al., 2015) propose linguistic Harbingers that can predict decep-
tion. These are word lists that cover topics often used in interpersonal communication?
claims, subjectivity, premises, contingency, comparisons, expansion, temporal lan-
guage associated with the future, and all other temporal language. The Harbingers
word lists do not provide full coverage, as they focus on specific rhetorical areas. A
logistic regression model with all word types as features further improves F1.
Power dynamics influence the language and flow of conversation (Danescu-
Niculescu-Mizil et al., 2012, 2013; Prabhakaran et al., 2013). These dynamics may
influence the likeliness of lying; a stronger player may feel empowered to lie to their
neighbor. Recall that victory points (Section 7.2) encode how well a player is doing
(more is better). We represent the power differential as the difference between the
two players. Peers will have a zero differential, while more powerful players will have
a positive differential with their interlocutor. The differential changes throughout
the game, so this feature encodes the difference in the season the message was sent.
For example, a message sent by an Italy with seven points to a Germany with two
points in a given season would have a value of five.
7.5.3 Neural
While less interpretable, neural models are often more accurate than logistic
regression ones (Ribeiro et al., 2016; Belinkov and Glass, 2019). We build a standard
long short-term memory network (Hochreiter and Schmidhuber, 1997, lstm in Sec-
tion 2.3.4) to investigate if word sequences?ignored by logistic regression?can re-
122
veal lies.
Integrating message context and power dynamics improves on the neural base-
line. A Hierarchical lstm can help focus attention on specific phrases in long con-
versational contexts. In the same way it would be difficult for a human to determine
prima facie if a statement is a lie without previous context, we posit that methods
that operate at the level of a single message are limited in the types of cues they
can extract. The hierarchical lstm is given the context of previous messages when
determining if a given message is a lie, which is akin to the labeling task humans
do when annotating the data. The model does this by encoding a single message
from the tokens, and then running a forward lstm over all the messages. For each
message, it looks at both the content and previous context to decide if the current
message is a lie. Fine-tuning bert (Devlin et al., 2019a) embeddings, introduced
in Background Section 2.3.4, to this model did not lead to notable improvement in
F1, likely due to the relative small size of our training data. Last, we incorporate
information about power imbalance into this model. This model approaches human
performance in terms of F1 score by combining content with conversational context
and power imbalance.
7.6 Qualitative Analysis
This section examines specific messages where both players and machines are
correctly identifying lies and when they make mistakes on our test set. Most mes-
sages are correctly predicted by both the model and players (2055 of 2475 messages);
123
Model Prediction
Correct Wrong
Correct Both Correct Not sure what Player Correct Don?t believe
your plan is, but I might be able Turkey, I said nothing of the sort.
to support you to Munich. I imagine he?s just trying to cause
an upset between us.
Wrong Model Correct Long time no Both Wrong I?m considering
see. Sorry for the stab earlier. I playing fairly aggressive against
think we should try to work to- England and cutting them off at
gether to stop france from win- the pass in 1901, your support for
ning; if we work together we can that would be very helpful.
stop france from getting 3 more
centers, and then we will all win
in a 3, 4, or 5 way draw when the
game is hard-capped at 1910.
Table 7.4: An example of an actual lie detected (or not) by both players and our
best computational model (Context lstm + Power) from each quadrant. Both the
model and the human recipient are mostly correct overall (Both Correct), but they
are both mostly wrong when it comes to specifically predicting lies (Both Wrong).
Model Model
Correct Wrong
Player Correct 10 32
Player Wrong 28 137
Table 7.5: Conditioning on only lies, most messages are now identified incorrectly
by both our best model (Context lstm + Power) and players.
but this is because of the veracity effect. The picture is less rosy if we only look at
messages the sender marks as actual lie: both players and models are generally
wrong (Table 7.5).
Both models and players can detect lies when liars get into specifics. In Diplo-
macy, users must agree to help one another through orders that stipulate ?I will help
another player move from X to Y?. The in-game term for this is ?support?; half the
messages where players and computers correctly identify lies contain this word, but
it rarely occurs in the other quadrants.
124
Player Prediction
Models seem to be better at not falling for vague excuses or fantastical promises
in the future. Players miss lies that promise long-term alliances, involve exten-
sive apologies, or attribute motivation as coming from other countries? disinfor-
mation (Model Correct). Unlike our models, players have access to conversations
with other players and accordingly players can detect lies that can easily be verified
through conversations with other players (Player Correct).
However, ultimately most lies are believable and fool both models and play-
ers (Both Wrong). For example, all messages that contain the word ?true? are
predicted as truthful by both models and players. Many of these messages are rel-
atively tame;12 confirming the Pinocchio effect found by Swol et al. (2012). If liars
can be detected when they wax prolix, perhaps the best way to avoid detection is
to be terse and to the point.
Sometimes additional contextual information helps models improve over player
predictions. For example, when France tells Austria ?I am worried about a steam-
roller Russia Turkey alliance?, the message is incorrectly perceived as truthful by
both the player and the single-message model. However, once the model has context?
a preceding question asking if Austria and Turkey were cooperating?it can detect
the lie.
Finally, we investigate categories from the Harbingers (Niculae et al., 2015)
word lists. Lies are more likely to contain subjectivity and premises while true
messages include expansion phrases (?later?, ?additionally?). We also use specific
12Examples include ?It?s true?[Budapest] back to [Rumania] and [Serbia] on to [Albania] could
position for more forward convoys without needing the rear fleet. . . ? and ?idk if it?s true just
letting u know since were allies?.
125
words in the bag of words logistic regression model. The coefficient weights of words
that express sincerity (e.g., ?sincerely?, ?frankly?) and apology (e.g., ?accusation?,
?fallout?, ?alternatives?) skew toward actual lie prediction in the logistic regression
model. More laid back appellations (e.g., ?dude?, ?man?) skew towards truthfulness,
as do words associated with reconnaissance (e.g., ?fyi?,?useful?, ?information?) and
time (e.g., ?weekend?, ?morning?). Contested areas on the Diplomacy map, such
as Budapest and Sevastopol, are more likely to be associated with lies, while more
secure ones like Berlin, are more likely to be associated with truthful messages.
These findings were not released to players during these data collection to avoid
influencing players? language.
7.7 Related Work
Early computational deception work focuses on single utterances (Newman
et al., 2003), especially for product reviews (Ott et al., 2012). But deception is
intrinsically a discursive phenomenon and thus the context in which it appears is
essential. Our platform provides an opportunity to observe deception in the context
in which it arises: goal-oriented conversations around in-game objectives. Gathering
data through an interactive game has a cheaper per-lie cost than hiring workers to
write deceptive statements (Jurgens and Navigli, 2014).
Other conversational datasets are mostly based on games that involve decep-
tion including Werewolf (Girlea et al., 2016), Box of Lies (Soldner et al., 2019), and
tailor-made games (Ho et al., 2017). However, these games assign individuals roles
126
that they maintain throughout the game (i.e., in a role that is supposed to deceive
or in a role that is deceived). Thus, deception labels are coarse: an individual al-
ways lies or always tells the truth. In contrast, our platform better captures a more
multi-faceted reality about human nature: everyone can lie or be truthful with ev-
eryone else, and they use both strategically. Hence, players must think about every
player lying at any moment: ?given the evidence, do I think this person is lying to
me now??
Deception data with conversational labels is also available through interviews
(P?rez-Rosas et al., 2016), some of which allow for finer-grained deception spans (Lev-
itan et al., 2018). Compared with game-sourced data, however, interviews provide
shorter conversational context (often only a single exchange with a few follow-ups)
and lack a strategic incentive?individuals lie because they are instructed to do so,
not to strategically accomplish a larger goal. In Diplomacy, users have an intrinsic
motivation to lie; they have entertainment-based and financial motivations to win
the game. This leads to higher-quality, creative lies.
Real-world examples of lying include perjury (Louwerse et al., 2010), calumny
(Fornaciari and Poesio, 2013), emails from malicious hackers (Dhamija et al., 2006),
and surreptitious user recordings. But real-world data comes with real-world compli-
cations and privacy concerns. The artifice of Diplomacy allows us to gather pertinent
language data with minimal risk and to access both sides of deception: intention
and perception. Other avenues for less secure research include analyzing dating
profiles for accuracy in self-presentation (Toma and Hancock, 2012) and classifying
deceptive online spam (Ott et al., 2011).
127
Chapter 8: Quantity and (Mostly) Quality Through Hybridization1
As a dovetail between crowd-driven and expert-driven data sources, we pro-
pose a hybrid solution that pairs a crowd-worker with an expert. Our Diplomacy
dataset (Chapter 7) shows that experts generate creative dialog. Our canard
dataset (Chapter 5) shows that the crowd can perform simple tasks quickly, if there
is a quality control process. We pair the two types of users together create dialog
(Section 2.1.3) in an area more broadly applicable than Diplomacy: customer ser-
vice. This creates a verisimilitude of a customer, simulated by a worker from the
crowd, interacting with a customer service agent, simulated by an actual profes-
sional customer service agent. The resulting dataset illustrates the stark contrast
in the language generated by anonymous crowd workers and experts. Furthermore,
it demonstrates how nlp generation and annotation can be scaled through the
crowd, while being quality controlled by an expert.
1Denis Peskov, Nancy Clarke, Jason Krone, Brigi Fodor, Yi Zhang,Adel Youssef, and Mona
Diab. Multi-domain goal-oriented dialogues(multidogo): Strategies toward curating and annotat-
ing large scale dialogue data. In Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 4518?4528, 2019. Peskov planned and implemented some of
the crowd-sourcing tasks, supervised the data collection thereof, wrote some of the task instruc-
tions, performed data analysis, and wrote most of the paper.
128
Role Turn Annotations
A Hey there! Good morning. You?re connected DA = { elicitgoal }
to LMT Airways. How may I help you?
C Hi, I wonder if you can confirm my seat assign- IC = { SeatAssignment }
ment on my flight tomorrow?
A Sure! I?d be glad to help you with that. May I DA = { elicitslot }
know your last name please?
C My last name is Turker. IC = { contentonly },
SL = {Name : Turker }
A Alright Turker! Could you please share the DA = { elicitslot }
booking confirmation number?
C I believe it?s AMZ685. IC = { contentonly },
SL = { Confirmation Number
: AMZ685 }
? ? ? ? ? ? ? ? ?
Table 8.1: A segment of a dialog from the airline domain annotated at the turn
level. This data is annotated with agent dialog acts (DA), customer intent classes
(IC), and slot labels (SL). Roles C and A stand for ?Customer? and ?Agent?.
8.1 The Goal of Creating Goal-Oriented Dialog
Modern Natural Language Understanding (nlu)?the integration of syntax,
semantics, and inference (Winograd, 1972)?frameworks for dialog are data hungry.
Processing goal-oriented dialog, which understands a user request and completes a
related task with a clear goal within a limited number of dialog turns (Bordes et al.,
2016), is an emblematic task of nlu: it requires extracting key information from
free-form language. Large amounts of training data representative of the context is
needed as human responses in goal-oriented dialogs are less predictable than those of
automated systems (Bordes et al., 2016). For example, a broader context?like the
questions in canard (Chapter 5)?is required to correctly interpret a command to
?Please do this?. This task can only be completed by seeing previous utterances, such
as requests to book a flight on a specific day to a specific destination. A further
129
complication arises as multiple phrases can express a single intent depending on
context: ?book my flight?, ?finalize my reservation?, ?Yes, the 6 pm one? may all
be refer to a flight-booking intent. Hence, we must generate entire conversations,
rather than independent utterances.
Training goal-oriented dialog systems, and nlu in general, would benefit from
large, varied, and ideally human-generated datasets. Joint-training and transfer
learning (Dong et al., 2015; Devlin et al., 2019b) benefit natural language processing
tasks; however, these approaches have yet to become widely used in dialog tasks due
to a lack of large-scale datasets. Furthermore, end-to-end neural approaches benefit
from such training data more than past work on goal-oriented dialog structured
around slot filling (Lemon et al., 2006; Wang and Lemon, 2013).
Conveniently, the training data for goal-oriented dialogs occurs organically:
people frequently converse with automated systems in customer service. Customers
reach out to agents, automated bots or real people, to fulfill a domain-specific goal.
The prevalence of human-machine interaction in customer service caused by per-
sonal virtual assistants and automated service portals has caused the amount of
possible goals to multiply: ordering a meal, booking a plane ticket, and resolving
an informational technology problem are all contexts in which goal-oriented dialog
occurs. This creates an unbalanced conversation: agents operate within a set pro-
cedure and convey a patient and professional tone. In contrast, customers do not
have this incentive; rather, they want to complete their task as quickly as possible.
However, to date, the largest available multi-domain goal-oriented dialog dataset
assigns similar dialog act annotations to both agents and customers (Budzianowski
130
et al., 2018).
We curate, annotate, and evaluate a large scale multi-domain set of goal
oriented dialogs, MultiDoGO, to address the prior limitations. One way to simu-
late data?and not risk releasing personally identifying information?for a domain
is to use a Wizard-of-Oz data gathering technique, which requires that partici-
pants in a conversation fulfill a role (Kelley, 1984). Popular goal-oriented datasets,
dstc (Williams et al., 2016) and MultiWOZ (Budzianowski et al., 2018) use this
approach. Hence, our dataset is gathered from workers in the crowd paired with
professional annotators using Wizard-of-Oz. The dataset generated comprises over
86K raw conversations of which 54,818 conversations are annotated at the turn level;
this is a geometric increase over the number of utterances generated in Chapter 7.
We investigate multiple levels of annotation granularity: annotating a subset of the
data on both turn and sentence levels. Generating and annotating such data given
its contextual setting is nontrivial. We furthermore illustrate the efficacy of our de-
vised approaches and annotation decisions against intrinsic metrics and via extrinsic
evaluation by applying neural baselines for Dialog Acts, Intent Classification,
and Slot Labeling.
8.2 Existing Dialog Datasets
Chit-chat dialog without goals have been popular since eliza (Weizenbaum,
1966) and have been investigated with neural techniques (Li et al., 2016, 2017).
However, these datasets cannot model goal-oriented tasks. Related dialog dataset
131
collections used for sequential question answering (Chapter 5) rely on dialog to
answer questions, but the task differs from our use case of modeling goal-oriented
conversations, hence leading to different evaluation considerations than downstream
question answering (Choi et al., 2018; Reddy et al., 2019).
There are multiple existing goal-oriented dialog collections generated by hu-
mans through Wizard-of-Oz techniques (Kelley, 1984). The Dialog State Tracking
Challenge, aka Dialog Systems Technology Challenge, (dstc) spans eight iterations
and entails the domains of bus timetables, restaurant reservations, and hotel book-
ings, travel, alarms, and movies (Williams et al., 2016). Frames (Asri et al., 2017)
has 1369 dialogs about vacation packages. MultiWOZ contains 10,438 dialogs about
Cambridge hotels and restaurants (Budzianowski et al., 2018). Some dialog datasets
specialize in a single domain. In addition to the datasets mentioned in Background
Section 2.1.3, ATIS (Hemphill et al., 1990) comprises speech data about airlines
structured around formal airline flight tables. Similarly, the Google Airlines dataset
purportedly contains 400,000 templated dialogues about airline reservations (Wei
et al., 2018).2
8.3 MultiDoGO Dataset Generation
Generating and annotating a dataset of this scale requires task design, data
collection, and post-task quality control.
2The Google Airlines dataset has not been released to date despite the existence of a paper
describing it.
132
Figure 8.1: Crowd sourced annotators select an intent and choose a slot in our
custom-built Mechanical Turk interface. Entire conversations are provided for refer-
ence. Detailed instructions are provided to users, but are not included in this figure.
Options are unique per domain.
8.3.1 Defining Dialog
We define the dialog terminology that is discussed in our design process. A
turn is a sequence of speech/text sentences by a participant in a conversation. A
sentence is a period delimited sequence of words in a turn. A turn may comprise
multiple sentences. We do use the term utterance to refer to a unit (turn or sentence,
spoken or written by a participant).3
In our devised annotation strategy, we distinguish between dialog speech acts
for agents vs. customers. In MultiDoGO, the agents? speech acts [da] are annotated
with generic class labels common across all domains, while customer speech acts are
labeled with intent classes [ic]. Moreover, we annotate customer utterances with a
further level of detail.
3We acknowledge that the term utterance is controversial in the literature (Pareti and Lando,
2018)
133
8.3.2 Data Collection Procedure
We employ professional annotators, who we train, and crowd-sourced workers
from Mechanical Turk (MTurkers) to generate conversational data using a Wizard-
of-Oz approach.4 In each conversation, the data associates assumes the role of an
agent while the MTurkers act as customers. In an effort to source competent MTurk-
ers, we require that each MTurker have a Human Intelligence Task (HIT) accuracy
minimum of 90%, a location in the United States, and have completed HITs in the
past.5 We give each agent a prompt listing the supported request types (dialog
acts) and pieces of information (slots) needed to complete each request to structure
goal-oriented conversations between the customer and agent. We also specify crite-
ria such as minimal conversation length, number of goals, and number of complex
requests to increase conversation diversity (Figure 8.2). We explicitly request that
neither agents nor customers use any personally identifiable information.6 At an
implementation level, we create a custom, web interface for the MTurkers and data
associates that displays our instructions next to the current dialog. This allows
each participant to quickly refer to our instructions without stopping the conversa-
tion. MultiDoGO follows a familiar Wizard-of-Oz elicitation procedure and curates
data for multiple domains akin to previous data collection efforts such as MultiWOZ.
However, MultiDoGO comprises more varied domains, is a magnitude larger, and is
4The professional annotators are salaried employees of the company engaging in this research.
They were staffed on this project full-time for three months. Training sessions were conducted in
person for a full day explaining the annotation guidelines and answering any questions that arose.
5Qualified MTurkers were allowed to complete the generation and annotation tasks multiple
times.
6They are however encouraged to fabricate information for slots. (e.g., John Smith as a name.
134
curated with prompts to ensure diverse conversations.
This is a novel collection strategy as we explicitly guide/prod the participants
in a dialog to engage in conversations with specific biases such as intent change,
slot change, multi-intent, multiple slot values, slot overfilling and slot deletion. For
example, in the Fast Food domain, participants pretend that they were ordering
fast food from a drive-thru. After making their initial order, they were instructed to
change their mind about what they were ordering: ?I?d like a burger. No wait, can
you make that a chicken sandwich??. In the Financial domain, we asked participants
request multiple intents such as ?I?d like to find my routing number and check my
balance."7 To that end, our collection procedure deliberately attempts to guide the
dialog flow to ensure diversity in dialog policies.
8.4 Data Annotation
Annotation classifies the thousands of conversations in our dataset. Of par-
ticular interest, a direct comparison of using experts versus the crowd is made in
Section 8.4.2. Our annotators use a web interface (Figure 8.1) to select the ap-
propriate intent class for an utterance out of a list of provided options. They use
their cursors to highlight slot value character spans within an utterance and then
select the corresponding slot label from a list of options to annotate slot labels. The
output of this slot labeling process is a list of ?slot-label, slot-value, span? triplets
for each utterance.
7For a full list of conversational biases with examples, please see the Appendix.
135
8.4.1 Annotated Dialog Tasks
Our dataset has three types of annotation: agent dialog acts [da], customer
intent classes [ic], and slot labels [sl]. We intentionally decouple agent and cus-
tomer speech act tags into the categories da and ic to produce more fine-grained
speech act tags than past iterations of dialog datasets. Intuitively, agent das are
consistent across domains and more general in nature, since agents have a standard
form of response. On the other hand, customer ics are domain-specific and can
entail reserving a hotel room or ordering a burger, depending on the domain. A
conversation example with annotations is provided in Table 8.1.
Agent Dialog Acts (da) Agent dialog acts are the most straightforward of
our annotation tasks. There are eight possible das in all domains: ElicitGoal, Elic-
itSlot, ConfirmGoal, ConfirmSlot, EndGoal, Pleasantries, Other. Elicit Goal/Slot
indicates that the agent is gathering information. Confirm Goal/Slot indicates that
the agent is confirming previously provided information. The EndGoal and Pleas-
antries tags, identify non-task related actions.8 Other indicates that the selected
utterance was not one of the other possible tags. Agent dialog acts are consistent
across domains and are often abstract (e.g., ElicitIntent, ConfirmSlot).
Customer Intent Classes (ic): Unlike agent da, customer ic vary for each
domain and are more concrete. For example, the Airline domain has a ?Book-
Flight? ic, Fast Food has an ?OrderMeal? ic, and Insurance has an ?OrderPolicy?
ic in our annotation schema. Customer intents can overlap across domains (e.g.,
8EndGoal is a frequently occurring case of Pleasantry when the agent informs the customer
that the goal has been completed and asks if anything else is required.
136
OpeningGreeting, ClosingGreeting) and other times be domain specific (e.g., Re-
questCreditLimitIncrease, OrderBurger, BookFlight).
Slot Labels (sl): Slot labeling is a task contingent on customer intent
Classes. Certain intents require that additional information, namely slot values,
be captured. For instance, to open a bank account, one must solicit the customer?s
social security number. Slots can overlap across intents (e.g., Name, ssn Number)
or they can be unique to a domain-specific intent (e.g., CarPolicy).
8.4.2 Annotation Design Decisions
Decoupled Agents and Customers Label SetsAgents and customers have
notably different goals and styles of communication. However, past dialog datasets
do not make this distinction at speech act schema level. Specificity is important for
generating unique customer requests, but a relatively formulaic approach is required
of agents across different industries. Our distinction between the customer and agent
roles creates training data for a bot that explicitly simulates agents.
Annotation Unit Granularity: Sentence vs. Turn Level An impor-
tant decision, which is often under-discussed, is the proper semantic unit of text
to annotate in a dialog. Commonly, datasets provide annotations at the turn
level (Budzianowski et al., 2018; Asri et al., 2017; Mihail et al., 2017). However,
turn level annotations can introduce confusion for ic datasets, given multiple in-
tents may be present in different sentences of a single turn. For instance, consider
the turn, ?I would like to book a flight to San Francisco. Also, I want to cancel a
137
Dialog Act Intent Classes Slot Labels
0.701 0.728 0.695
Table 8.2: Inter Source Annotation Agreement (isaa) scores quantifying the agree-
ment of crowd sourced and professional annotations.
flight to Austin." Here, the first sentence has the BookFlight intent and the sec-
ond sentence has the CancelFlight intent. A turn level annotation of this utterance
would yield the multi-class intent (BookFlight, CancelFlight). In contrast, a sen-
tence level annotation of this utterance identifies that the first sentence corresponds
to BookFlight while the second corresponds to CancelFlight. We annotate a subset
our data?2,500 conversations per domain for 15,000 conversations in total?at the
sentence as well as turn level to assess the design choice on downstream accuracy
(Table 8.8). The remainder of our dataset is annotated only at the turn level.
Professional vs. Crowd-Sourced Workers for Annotation For annota-
tion, we compare and contrast professional annotators to crowd sourced annotators
on a subset of data. Professional annotators assign da, ic, and sl tags to the
15,000 conversations annotated at both the turn and sentence level; statistics for
these conversations are given in Table 8.7. In an effort to decrease annotation cost,
we employ crowd source annotators via Mechanical Turk to label an additional
54,818 conversations rated as Good or Excellent quality during data collection.9 We
provide statistics for this set of crowd annotated data in Table 8.3. To compare
the quality of crowd sourced annotations against professional annotations, we use
both strategies to annotate a shared subset of 8,450 conversations. We devise an
9Users are still paid for conversations that are not evaluated as such. Since this happened after
conversation generation, we do not need to exclude users responsible for bad conversations from
future ones.
138
Domain Elicited Good/Excellent IC/SL DA/IC/SL
Airline 15100 14205 7598 6287
Fast Food 9639 8674 7712 4507
Finance 8814 8160 8002 6704
Insurance 14262 13400 7799 7434
Media 33321 32231 19877 12891
Software 5562 4924 3830 2753
Total 86698 81594 54818 40576
Table 8.3: Total number of conversations per domain: raw conversations Elicited;
Good/Excellent is the total number of conversations rated as such by the agent
annotators; (IC/SL) is the number of conversations annotated for Intent Classes
and Slot Labels only; (DA/IC/SL) is the total number of conversations annotated
for Dialog Acts, Intent Classes, and Slot Labels.
Inter Source Annotation Agreement (isaa) metric to measure the agreement
of these crowd sourced and professionally sourced annotations. isaa is a relaxation
of Cohen ?, intended to count partial agreement of multi-tag labels. isaa defines
two sets of tags, A and B, to be in agreement if there is at least one ?shared" tag in
both A and B. A and B reflect the majority labels agreed upon per source (profes-
sionals or crowd workers). We report isaa for the da, ic, and sl tasks in Table 8.2.
Crowd sourced and professional annotations have substantial overlap between their
annotations. Therefore, the crowd can be used for annotation for nlp tasks, if the
annotations are verified to be comparable to experts.
8.4.3 Quality Control
Three processes enforce data quality. During data collection, our experts
report on the quality of each conversation. Specifically, the expert grades the con-
versation on a scale from ?Unusable?, ?Poor", ?Good", to ?Excellent". They follow
instructions around coherence, whether the dialog achieved the purported goal, etc.,
139
Bias Airlines Fast Food Finance Insurance Media Software
IntentChange 1443
MultiIntent 2200 1913 1799 1061 607 2295
MultiValue 354
Overfill 1486 2763
SlotChange 4207 2011 2506 3321 570 2085
SlotDeletion 333
Total 6407 6054 5791 7145 1177 4380
Table 8.4: Number of conversations per domain collected with specific biases. Fast
Food had the maximum number of biases. MultiIntent and SlotChange are the most
used biases.
to decide on the chosen rating. We keep conversations with ?Good" or ?Excellent"
ratings in subsequent annotation to maximize the quality of our dataset.
Second, each conversation is annotated at least twice. We resolve inconsistent
annotations by selecting the annotation given by the majority of annotators for an
item.10 We calculate inter-annotator agreement with Fleiss? ? and find ?substantial
agreement?.11 Our annotators must pass a qualification test as well as maintain an
on-going level of accuracy in randomly distributed test questions throughout their
annotation. Third, we pre-process our data to remove issues, such as duplicate con-
versations and improperly entered slot value spans. Further pre-processing details
are in Section 8.5.
8.4.4 Dataset Characterization and Statistics
The MultiDoGO dataset is the most diverse dialog dataset due to covering
more domains and being generated, rather than scraped from existing and dubi-
ously reliable data sources (e.g., Ubuntu forums). Table 8.3 shows the statistics
10We drop annotations in which there is no agreement.
11We use Fleiss? ? unlike in the earlier profession/crowd worker comparison as we have more
than two annotators for this task.
140
Metric dstc 2 woz2.0 M2M MultiWOZ MultiDoGO
Number of Dialogs 1,612 600 1,500 8,438 40,576
Total Number of Turns 23,354 4,472 14,796 115,424 813,834
Total Number of Tokens 199,431 50,264 121,977 1,520,970 9,901,235
Avg. Turns per Dialog 14.49 7.45 9.86 15.91 20.06
Avg. Tokens Per Turn 8.54 11.24 8.24 13.18 12.16
Total Unique Tokens 986 2,142 1,008 24,071 70,003
Number of Unique Slots 8 4 14 25 73
Number of Slot Values 212 99 138 4,510 55,816
Number of Domains 1 1 1 7 6
Number of Tasks 1 1 2 2 3
Table 8.5: MultiDoGO is several times larger in nearly every dimension to the per-
tinent datasets as selected by Budzianowski et al. (2018). We provide counts for
the training data, except for frames, which does not have splits. Our number of
unique tokens and slots can be attributed to us not relying on carrier phrases.
for MultiDoGO raw conversations generated, rated as Excellent or Good, and anno-
tated for da, ic and sl. Table 8.4 shows the number of conversations per domain
reflecting the specific biases used.
MultiDoGO is several orders of magnitude larger than comparable datasets
as reflected in nearly every dimension: the number of conversations, the length of
the conversation, the number of domains, and the diversity of the utterances used.
Table 8.5 provides comparative statistics.
Domain #Conv #Turn #Turn/Conv #Sentence #Intent #Slot
Airline 2,500 39,616 15.8 (15) 66,368 11 15
Fast Food 2,500 46,246 18.5 (18) 73,305 14 10
Finance 2,500 46,001 18.4 (18) 70,828 18 15
Insurance 2,500 41,220 16.5 (16) 67,657 10 9
Media 2,500 35,291 14.1 (14) 65,029 16 16
Software 2,500 40,093 16.0 (15) 70,268 16 15
Table 8.6: Data statistics by domain. Conversation length is in average (median)
number of turns per conversation. Inter-annotator agreement (iaa) is measured
with Fleiss? ? for the three annotation tasks: Agent da (da), Customer ic (ic), and
Slot Labeling (sl).
We provide summary statistics for the subset of our data annotated at both
turn and sentence granularity in Table 8.7. This describes the total size of the
141
Domain Turn-level iaa Sentence-level iaa
Airline 0.514/0.808/0.802 0.670/0.788/0.771
Fast Food 0.314/0.700/0.624 0.598/0.725/0.607
Finance 0.521/0.827/0.772 0.700/0.735/0.714
Insurance 0.521/0.862/0.848 0.703/0.821/0.826
Media 0.499/0.812/0.725 0.678/0.802/0.758
Software 0.508/0.748/0.745 0.709/0.764/0.698
Table 8.7: Inter-annotator agreement (iaa) is measured with Fleiss? ? for the three
annotation tasks: Agent DA (DA), Customer IC (IC), and Slot Labeling (SL).
Airline Fast Food Finance
Model Annot DA IC SL DA IC SL DA IC SL
MFC S 60.57 33.69 38.71 57.14 25.42 61.92 51.73 37.37 34.07
lstm S 97.20 90.84 74.16 90.40 86.09 72.93 93.90 90.06 69.09
elmo S 97.32 91.88 86.55 91.03 87.95 77.51 94.07 91.15 77.36
MFC T 33.04 32.79 37.73 33.07 25.33 61.84 36.52 38.16 34.31
lstm T 84.25 89.15 75.78 66.41 87.35 73.57 76.19 92.30 70.92
elmo T 84.04 89.99 85.64 65.69 88.96 79.63 76.29 94.50 79.47
Insurance Media Software
Model Annot DA IC SL DA IC SL DA IC SL
MFC S 56.87 38.37 53.75 57.02 30.42 82.06 58.14 33.32 53.96
lstm S 94.73 93.30 75.27 94.27 92.35 90.84 93.22 90.95 69.48
elmo S 94.63 94.27 88.45 94.27 93.32 93.99 93.66 92.25 76.04
MFC T 36.39 39.42 54.66 29.90 31.82 78.83 36.79 33.78 54.84
lstm T 75.37 94.75 76.84 77.94 94.35 87.33 83.32 89.78 72.34
elmo T 75.34 95.39 89.51 77.81 94.76 91.48 82.97 90.85 76.48
Table 8.8: Dialog act (da), intent class (ic), and slot labeling (sl) F1 scores by
domain for the majority class, lstm, and elmobaselines on data annotated at the
sentence (S) and turn (T) level. Bold text denotes the model architecture with
the best performance for a given annotation granularity, i.e., sentence or turn level.
Red highlight denotes the model with the best performance on a given task across
annotation granularities.
data per domain in number of conversations, turns, the unique number of intents
and slots, and inter-annotator agreement (iaa) for both turn and sentence level
annotations. da annotations have much higher iaa in sentence-level annotations
compared to turn-level annotation, most notably in the Fast Food domain. ic and
sl annotations reflect a slightly higher iaa in Turn level annotation granularity
compared to Sentence level.
142
Agent Instructions
Imagine you work at a bank. Customers may contact you about the following set
of issues: checking account balances (checking or savings), transferring money
between accounts, and closing accounts.
GOAL: Answer the customer?s question(s) and complete their request(s).
For any request, you will need to collect at least the following information to be
able to identify the customer: name, account PIN *or* last 4 digits of SSN.
For giving information on balances, or for closing accounts, you will also need
the last 4 digits of the account number.
For transferring money, you will also need: last 4 digits of account to move
from, last 4 digits of account to move to, and the sum of money to be transferred.
Your customer may ask you to do only one thing; that?s okay, but make sure
you confirm you achieved everything the Customer wanted before completing
the conversation. Don?t forget to signal the end of the conversation (see General
guidelines)
Figure 8.2: Agents are provided with explicit fulfillment instructions. These are
quick-reference instructions for the Finance domain. Agents serve as one level of
quality control by evaluating a conversation between Excellent and Unusable.
Airline Fast Food Finance Insurance Media Software
A Single Joint Single Joint Single Joint Single Joint Single Joint Single Joint
S 97.32 97.44 91.03 91.26 94.07 94.27 94.63 94.99 94.27 94.47 93.66 94.00
T 84.04 84.64 65.69 65.35 76.29 75.68 75.34 75.89 77.81 78.56 82.97 83.76
Table 8.9: Joint training of ELMo on all agent DA data leads to a slight increase
in test performance. However, we expect stronger joint models that use transfer
learning should see a larger improvement. Bold text denotes the training strategy,
i.e., single domain (Base) or multi-domain (Joint), with the best performance for a
given annotation granularity. Red highlight denotes the strategy with the highest
DA F1 score across annotation granularities.
8.5 Dialog Classification Baselines
We pre-process, create dataset splits, and evaluate the performance of three
baseline models for each domain on MultiDoGO.
Pre-processing: We pre-process the corpus of dialogs for each domain to
143
remove duplicate conversations and utterances with inconsistent annotations. The
most common source of inconsistent annotations in our dataset is imprecise selection
of slot label spans by annotators, which results in sub-token slot labels. While much
of this inconsistent data could likely be recovered by mapping each character span to
the nearest token span, we drop these utterances to ensure these errors have no effect
on our experimental results. Our post-processed data is pruned to approximately
90% of the original size. We form splits for each domain at the conversation level
by randomly assigning 70% of conversations to train, 10% to development, and 20%
to test. Conversation level splits enable the application of contextual models to our
dataset, as each conversation is assigned to a single split. However, our conversation
level splits result in imbalanced intent and slot label distributions.
Models: We evaluate the performance of two neural models on each domain.
The first is a bi-directional lstm (Hochreiter and Schmidhuber, 1997) with GloVe
word embeddings, a hidden state of size 512, and two fully connected output layers
for slot labels and intent classes. The second model, elmo, resembles lstm archi-
tecture but it additionally uses pre-trained elmo (Peters et al., 2018) embeddings
in addition to GloVe word embeddings, which are kept frozen during training (Sec-
tions 2.3.4). We concatenate these elmo and GloVe embeddings. As a sanity check,
we also include a most frequent class (mfc) baseline. The mfc baseline assigns the
most frequent class label in the training split to every utterance u? in the test split
for both da and ic tasks. To adapt the mfc baseline to sl, we compute the most
frequent slot label mfc(w) for each word type w in the training set. Then given a
test utterance u?, we assign the pre-computed, most frequent slot mfc(w?) to each
144
word w? ? u? if w? is present in the training set. If a given word w? ? u? is not
present the training set, we assign the other slot label, which denotes the absence
of a slot, to w?. We use AllenNLP (Gardner et al., 2017) for models and metrics.
We use the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001 to
train the lstm and elmo models for 50 epochs, using batch sizes 256 and 128. In
addition, we use early stopping on the validation loss with a tolerance of 10 epochs
to prevent over-fitting.
Evaluation Metrics: We report micro F1 score to evaluate da and ic. We
use a span based F1 score, implemented in the seqeval library, to evaluate sl.12
8.5.1 Results
da/ic/sl Results. Table 8.8 presents the mfc, lstm, and elmo results for
each domain, on the subset of 15,000 conversations annotated at both the turn and
sentence levels. lstm, and elmo outperform mfc across all domains at the turn
and sentence level. elmo obtains a modest increase in ic accuracy of 0.41 to 2.20
F1 points and a significant increase in sl F1 score on all domains over the lstm
baseline. Concretely, elmo boosts sl F1 performance by 3.16 to 13.17 F1 points.
We see the biggest sl gains on the Insurance domain, where sentence level elmo
has a 13.17 point F1 gain and turn level elmo has a 12.67 point F1 gain. elmo
increases sentence and turn level sl F1 scores by 12.38 and 9.86 F1 points for the
airlines domain. Both lstm and elmo yield similar F1 scores on da classification for
which the difference in performance of these models is within one F1 point across
12https://github.com/chakki-works/seqeval
145
all domains. The Fast Food domain yields the overall lowest absolute F1 scores.
Recall that Fast Food had the most diverse dialogs (biases) as per Table 8.4 and the
lowest iaa as per Table 8.7.
Sentence vs. Turn Level Annotation Units. Turn level annotations
increase the difficulty of the da classification task in our lstm and elmo results.
This finding is evidenced by da accuracy of our models on the Fast Food domain, for
which F1 score is up to 25 F1 points lower for turn level annotations than sentence
level annotations. We believe the increased difficulty of turn level da is driven by a
corresponding increase in the ambiguity of turn level dialog acts. This assertion of
greater turn level da ambiguity is supported by the lower inter annotator agreement
(iaa) scores on turn level da, which range from 0.314 to 0.521, than the iaa scores
for sentence level da, which range from 0.598 to 0.709. This experimental result
highlights the importance of collecting sentence level annotations for conversational
da datasets. Somewhat surprisingly, our models have similar ic F1 and sl F1 scores
on turn and sentence level annotations. We posit that the choice of annotation unit
has a lesser impact on the ic and sl tasks because customer utterances are more
likely to focus on a single speech act, whereas Agent utterances may be more complex
in comparison and include a greater number of speech acts.
Joint Training on Agent DA. Agent da classification naturally lends itself
to joint training, given agent das are shared among all domains. To explore the
benefits of multi-domain training, we jointly train an agent da classification model
on all domains and report test results for each domain separately. These results
are provided in Table 8.9. This straightforward technique leads to a consistent but
146
less than one point improvement in F1 scores. We expect that more sophisticated
transfer learning methods (Liu et al., 2017; Howard and Ruder, 2018) could generate
larger improvements for these domains.
Overall, there is room for improvement, especially for the sl task, across all
domains. Consequently, MultiDoGO should be a relevant benchmark for developing
new state-of-the-art nlu models for the foreseeable future.
8.6 Conclusion
We present MultiDoGO, a new Wizard-of-Oz dialog dataset that is the largest
human-generated, multi-domain corpora of conversations to date. The scale and
range of this data provides a test-bed for future work in joint training and transfer
learning. Moreover, our comparison of sentence and turn level annotations provides
insight into the effect of annotation granularity on downstream model performance.
The data collection and annotation methodology used to gather MultiDoGO
can efficiently scale across languages. Several pilot experiments aimed at collecting
Spanish dialogs in the same domains have shown preliminary success in quality as-
sessment. The production of a nlu dataset with parallel data in multiple languages
would be a boon to the cross-lingual research community. To date, cross-lingual nlu
research (Upadhyay et al., 2018; Schuster et al., 2018) has relied on much smaller
parallel corpora.
By pairing crowd-sourced labor (Chapter 5) with experts (Chapters 6 and 7),
we ensure quality and diversity in generated conversations while scaling to multiple
147
domains and tasks. We show that by adopting a modular annotation strategy,
the crowds can reliably annotate dialogs at a level commensurate with trained
professional annotators. Without the expert, our data would be just as large, but
it could not be trusted.
There is a stark difference in quality of the generated language between the
crowd-sourced workers and the experts, in this case in-house customer service agents.
The crowd-sourced workers have a financial incentive to complete the task as quickly
as possible and contribute sentences that are occasionally prosaic, ungrammatical,
or repeated.13 In our case, these incentives mimic those of the usual customer and
does not undermine the realism of the conversation. But, should datasets be large
or should they be accurate in future work where these incentives are not desirable?
We conclude with areas for future work that must balance quantity and accuracy
to be successful (Chapter 9).
13We pay workers for completing a full valid conversation; conversely, our agents converse with
the workers as long as necessary and are not paid for each conversation.
148
Chapter 9: Conclusions on Natural Language Processing Data
In this thesis, we create natural language processing datasets using three types
users: crowd-workers, experts, and a hybrid combination. We argue that im-
proving data quality with reliable data generators and annotators is paramount
towards establishing new nlp tasks. As examples, we propose a new task, cul-
tural adaptation, that uses verified cultural experts for the creation of gold labels
(Chapter 6). Additionally, we introduce a novel self-annotated deception dataset
by working with top players from the Diplomacy community (Chapter 7). Last,
we create the largest goal-oriented dialogue dataset by pairing Amazon customer
support associates with crowd workers (Chapter 8).
These datasets could not be found or crowd-sourced.1 Several projects
show the limitations of creating large datasets in that way. Using text-to-speech to
automatically generate questions scales at the expense of diversity and realism in the
data (Chapter 3). Using an expert to design, but not generate, a formulaic dataset
for assessing coreference resolution creates unlikely phrases (Chapter 4). Using the
crowd to generate question rewrites can increase the amount of training data for
question answering, but requires extensive quality control (Chapter 5).
1Or at least be comparable in quality if they were.
149
Two independent directions for future work both use experts to create new
datasets. First, Diplomacy2.0 extends our work on Diplomacy (Section 9.1). Second,
the World Trade Organization and the Federal Reserve are two large organizations
whose data can be annotated by experts to create legal corpora (Section 9.2) We
conclude with a rationalization for upfront investment in data (Section 9.3).
9.1 Hybridization of Diplomacy: Diplomacy2.0
We generate and annotate Diplomacy data to study deception (Chapter 7).
We can improve this existing dataset through further annotation of dialog acts
(Section 9.1.1). This level of annotation would allow us to build a bot that can
communicate in Diplomacy. We want to merge communication with actions, and
need game experts to map these dialog acts to game moves (Section 9.1.2). Further
studies with the Diplomacy community will create a new dataset, Diplomacy2.0, to
study strategic interactions between computers and humans.
9.1.1 Data for Communication
In our initial Diplomacy work, players use deception for multiple purposes:
lying to convey plans and forge alliances (?Let?s team up to take out Germany?);
lying about past actions (?My computer was acting up. . . I didn?t make the move
I wanted?); and lying to build relationships (?you live in Maine? I went to a boy
scout camp there!?). We want to formally identify and annotate these categories
as we want our bot to use rhetorical strategy: using empathy (Sedoc et al., 2020),
150
emotional intensity (Mohammad, 2018), and hedging (Islam et al., 2020). Cre-
ating an ontology of dialog acts is typical for new domains?telephone conversa-
tions (Stolcke et al., 2000), scientific articles (Teufel and Moens, 2002), or political
speeches (Thomas et al., 2006). We will need Diplomacy experts to identify this new
ontology. Since we are interested in general dialog acts, rather than capturing the
nuance of Diplomacy, we can use the crowd for the actual annotation.2 Thus this
will be a hybrid approach of using experts for design and the crowd for scaling
akin to Chapter 8.
We can build a bot that devises a game strategy and communicates their
intention to other players through these dialog acts. Communications assume a
player p creating a message directed at recipient q. These messages can concern
a third player r and can have one of three modes: declarative (I am asserting
something is true), interrogative (I am querying your beliefs), and propositional (I
am asking you to do something). Each message is parameterized by the mode m,
actions a, and time t. In addition to machine-readable fields, each message allows
for arbitrary free text; this can contain elaboration, hedging, or motivation. A bot
will need to have training examples of modes to generate an effective message.
9.1.2 Data for Action
Most actions correspond to the orders that players can submit in a game
of Diplomacy: moving a unit, supporting another unit, building a new unit. In
2We may have to source the crowd from Diplomacy players if the generalist annotation proves
inaccurate.
151
addition to these actions, which are explicit in Diplomacy, we consider implicit
actions in Diplomacy negotiations: an alliance between two players (pursue the same
goals, support whenever possible), a non-aggression pact (no explicit cooperation
but no attacks), and betrayal (break either an alliance or non-aggression pact to
hurt a player?s position). Thus, an agent can pledge to do a future action p(a, t +
1,m = d) (here and below we use the first letter of the modes), ask the recipient
to do something q(a, t + 1,m = p); communicate about a third player?s intentions;
r(a, t,m = i); propose an alliance p(ally, q,m = p); ask whether a third player is
allied with the recipient r(ally, q,m = i); propose a betrayal of player r at time t,
etc. After receiving a message, a recipient can confirm that the message is consistent
with their knowledge, reject the message as inconsistent with their knowledge, ignore
the message, or reply with a counter-offer.
Purely strategic data exists in other larger datasets (Paquette et al., 2019;
Bakhtin et al., 2021). While we can train from self-play, that would ignore previous
games of Diplomacy (Niculae et al., 2015), tutorials on how to play Diplomacy
(e.g., opening strategies for each of the countries),3 and commentary on Diplomacy
games.4 We will use these corpora to bootstrap the strategy engine as Bakhtin
et al. (2021) show that Diplomacy systems trained from scratch do not converge to
the same behavior as human players. Vetting the computer?s moves and combining
them with our past press will require Diplomacy experts, as automatic mapping
would be noisy; for example, Turkey moving into the Black Sea may map to ?my
3https://diplomacyopenings.wordpress.com/
4https://youtu.be/b4GHbg5--Ag?t=138
152
computer was acting up?, rather than a strategic message.
With this training, our Diplomacy2.0 agent will construct a per-game knowl-
edge base. Upon receiving a message, it will be added to the board state as an
entry in the knowledge base in addition to annotations of whether the agent be-
lieves the message was true or not. For games with humans, we will also ask the
sender to provide ground-truth annotation of whether the message was true for post
hoc refinement of our bots.
9.1.3 Evaluation Through Human Studies
The key challenge is to train this engine to work with other players, something
not yet attempted in AI for Diplomacy. Given a game state, Diplomacy2.0 must
produce coherent, useful messages to communicate with other players, and evaluate
whether the messages the agent received are truthful. Like past work (Chapters 6
and 7), there is no existing gold label to evaluate this communication. Hence, we
will define the evaluation for this task.
While our overall goal is for Diplomacy2.0 to win games against other agents,
we also want to have evaluations specific to its subcomponents, like the hypothet-
ical coreference pipeline in Contracat (Chapter 4). This will include generating
messages, correctly inferring opponents? stance, and cooperative play.
Generating Messages To evaluate whether Diplomacy2.0 can generate mes-
sages, we will evaluate precision and recall of generated messages given a fixed board
state compared to human-generated messages in our annotated corpus of dialog acts.
153
While we do not assume that humans are optimal, this is a useful sanity check of
our communications: if they are consistent with human gameplay, it suggests Diplo-
macy2.0 is generating messages that are relevant and consistent with the game state.
We will also compare against precision and recall for deception detection compared
to our previous techniques.
Correctly Inferring Opponent Stance A key theoretical component of ne-
gotiations is successfully predicting what opponents want and will do. To evaluate
whether Diplomacy2.0 can do this, we will predict (given a board state and mes-
sages annotated for dialog acts) what each of the players will do next. Given a game
history, we can compute both standard precision and recall metrics of predicted ac-
tions compared to the true history and mean reciprocal rank of the historical actions
compared to a ranked list of predictions. This will show the bot?s understanding of
game actions.
Cooperative Play To bridge between no-press Diplomacy and our communication-
enabled setting, we will evaluate how well the bot can do without communication by
self-play against copies of itself and other bots incapable of speech. If they are better
able to coordinate and win against their foes than a mute bot, then our bot is able
to communicate. Next, we will evaluate the complete Diplomacy2.0 system. Finally,
we will evaluate AI agents in mixed environments with humans and AI agents. Here,
the evaluation is more nuanced: our goal is not just whether agents can win games.
AI agents must be able to cooperate with both humans and computers, correctly
predict betrayals, and ultimately win the game.
154
9.2 Understanding Organizations with Economic and Legal Experts
We can use experts for new purposes in addition to extending our past work
with experts. As one possibility, expert annotation can provide insights on organiza-
tional decision making at scale by facilitating the analysis of deliberative processes.
Major organizations often engage in formal dialogues prior to making decisions.
Many of these discussions are recorded for posterity. But, reading through dozens
of years of proceedings and legal documents is onerous for a person; however, the
task is trivial for a computer. We propose to apply nlp techniques to solving real-
life problems in the public domain. Specifically, we will annotate then quantitatively
analyze publicly available historical text data from major organizations to identify
patterns of decision-making. Two major organizations, the World Trade Organi-
zation and the Federal Reserve Board, can be analyzed with the same abstract
methodology.
Data is integral to machine learning; models are only as useful if the under-
lying training data is relevant to the desired task and correctly labeled. Apropos
annotation in the social sciences, Text as Data (Grimmer and Stewart, 2013) re-
views the strengths and limitations of computation for political science, noting that,
?Ambiguities in language, limited attention of coders, and nuanced concepts make
the reliable classification of documents difficult." This future work will require an-
notation to be consistent across projects?the terms and phrases used in our two
organizations will differ but we propose creating a shared abstract ontology that can
be used for any organizational dialogue?and even across disciplines, ideally defining
155
a gold standard for guideline creation and quality control in the process.
The task required: identifying ideas in dense legal documents will be particu-
larly challenging due to the nuance of the task. Even defining what constitutes an
idea is a challenge. An idea can be the subject of conversation (Grimmer, 2010),
a meme (Leskovec et al., 2009), a narrative (Oates, 2014), etc. Therefore, defining
a universal annotation schema, writing unambiguous guidelines, and finding skilled
users to perform this annotation will be no small feat. Fortunately, the adapta-
tion (Chapter 6) and Diplomacy (Chapter 7) datasets required experts and bespoke
instructions and MultiDoGO (Chapter 8) required a universal annotation schema.
We will use this newly created representation of the data to answer domain-specific
questions. The influence of ideas can be identified with nlp techniques (Zhang et al.,
2016). We propose to formalize this task by, once-again (Section 9.1.1), creating an
ontology for organizational decision-making. Then, we will be able to analyze and
understand our data as a sequence of ideas rather than mere text.
9.2.1 The World Trade Organization
World Trade Organization (wto) cases, meticulously documented in hundreds
of pages, are a strong fit for this research question as they occur over long-time
periods, include arguments made by third party countries, and are publicly avail-
able (Busch and Pelc, 2019). We will answer a concrete social science problem: can
final case outcomes be predicted based on the involved parties and their submitted
arguments? This prediction has international economic implications, as trade dis-
156
putes set precedents outside of the goods in question: a ruling on tires may affect
trade in bananas through legal precedent.
Working with domain experts is necessary to solve this research question due
to the complex legal language of the wto. For background, disputants and third
parties submit opinions on legal precedents that should be considered in shaping the
final judgment. Expertise will entail the legal education needed to identify key wto
cases and relevant precedents in documents, which can be over a hundred pages
long. A successful annotation of these submitted opinions?both literal references
and their abstract ideas?in chronological legal documents would identify how trade
decisions are made at the highest level of international governance and create one
of the largest datasets of legal data to date. This in turn could be used to set a gold
standard training dataset for legal nlp that can be used for predicting future case
outcomes.
9.2.2 The Federal Reserve Board
The Federal Reserve Board decides monetary policy in the United States and
provides public documentation of their proceedings for the past 40 years. Panel
participants state their opinion at the beginning and the end of the session and
cast a vote. The social science question we aim to answer is if the opinion of
participants of the panel changes during a session in line with the the common
knowledge effect (Gigone and Hastie, 1993). Psychology research identifies that
priors are the principal driver of decision-making in a group, but it has never been
157
able to verify the effect at scale. In addition to answering this question, the presence
of a binary vote will allow us to study the language of dissent and conformity. We will
see if majority voters are more likely to change their mind during the panel discussion
than minority ones. The dense economic language of these panels will require expert
annotators with a background understanding of economics and finance.
9.3 Creating Timeless Natural Language Processing Datasets
The above two projects will require large-scale collaboration with the appro-
priate subject matter experts for extended periods of time. However, datasets that
have withstood the test of time in natural language processing were also painstak-
ingly created and quality controlled. The Penn Treebank (Marcus et al., 1993) was
collected and refined for years using graduate students in linguistics as annotators.
The annotation process had extensive experimental design, annotators underwent
extensive training, and the data was evaluated for disagreements. That effort caused
graduate students today to learn about it.
The granularity and quantity of nlp datasets continues to increase as ma-
chine learning expands to new languages and tasks. Quality control is usually an
afterthought in a conference paper paradigm that rewards quantity. However, this
mindset introduces room for error, potentially with real-life repercussions (Wallace
et al., 2021). The importance of nlp to modern day life in communication, infor-
mation gathering, and commerce means that decisions made in an academic context
can have wide-ranging implications. Authoritative, realistic, and diverse datasets
158
are less likely to contain errors or artifacts and more likely to be used in years to
come than larger datasets derived from Wikipedia or crowd-sourced knowledge.
Recent work questions conventional wisdom about data in nlp. Rodriguez
et al. (2021) question the paradigm of using quantitative leaderboards in question
answering, given the disparity of question difficulties. van der Goot (2021) question
the paradigm of using a development set for model tuning. Kummerfeld (2021) ques-
tion the qualification requirements for Mechanical Turk workers. Last, Karpinska
et al. (2021) question the output of Mechanical Turk workers for evaluation. Pillutla
et al. (2021) create a divergence metric to compare artificial and human language
data. The common thread between these open questions is that they address data
and not models. We show that working with domain experts and focusing on data
quality can address complicated natural language processing challenges.
159
Appendix A: Adaptation
Our adaptation (Chapter 6) appendix contains our entire human-collected
dataset and a sample of our computational approaches for adaptation.
Table A.1 shows German?American Veale NOC items. Table A.2 shows
American?German Veale NOC items. Table A.3 shows German?American Veale
NOC items. Table A.4 shows American?German Veale NOC items.
Table A.5 shows our WikiData predictions, Table A.6 shows our 3CosAdd
predictions. and Table A.7 shows our Learned Adaptations predictions. We pose
several background questions about Wikipedia and WikiData as well:
A.1 Wikipedia Q&A
Are the Wikipedia pages in German and English visited from the associated country?
Yes; the Wikipedias for the respective languages are most used by visitors located
in those countries: 63% of German wikipedia was visited from Germany and 32%
of English Wikipedia was visited from the United States in the past year.1
Are the top Wikipedia topics notably different across languages? Yes; less than a
quarter of top 500 searches for 2019 are identical across English and German.
1https://stats.wikimedia.org/
160
Does WikiData cover areas outside of the United States? Wikipedia coverage does
not mean that WikiData annotations are conducted equally across German and
American entities. Analyzing WikiData2 reveals a discrepancy in coverage of Ger-
mans and Americans.
Out of 8,126,559 titles, 1,030,762 include a reference to the United States in
any capacity. However, only 184,692 contain a reference to (broader) Germany.
This imbalance is significant but has enough German items for our methodology.
As WikiData is a maintained resource, there is room for future additional coverage
and standardization of fields.
Countries use different names throughout history. While the United States of
America is straightforward, Germany includes several variations, such as: German
Empire, the Kingdom of Bavaria, the Kingdom of Prussia, etc. The WikiData
feature-based approach can be used for other countries as well (. . . or anything that
is consistently coded). For example, there are 65,957 Russian, 152,701 French, and
48,026 Chinese items in WikiData.3
Are the top Wikipedia topics necessarily belonging to the culture? No; the top
10 most visited German Wikipedia includes a cultural potpurri: Germany, Greta
Thurnberg, Asperger Syndrome, Game of Thrones, and Freddie Mercury. While
there are uniquely German entities in the longer list?ZDF, Capital Bra, The Cratez,
Niki Lauda?we cannot conclude that all top entities in a language belong culturally
to a given country. Therefore, we need a stricter methodology.
2we use a full 1.2 Terabyte dump as of 10.26.20
3the modern day name countries only
161
Where does one find entities? We rely on a human-sourced dataset: Veale?s Non-
Official Characterization list (Veale, 2016). This list contains 1031 people, real and
fictional, such as Daniel Day-Lewis, Anton Chekhov, and Bridget Jones. These peo-
ple are annotated with properties, one of which is conveniently their address. There
are 25 people with a German location and 575 with an American one. Removing
fictional characters written by non-nationals causes the German leaves the list with
20 entities. An American author filters the list of Americans down to 35 iconic ones
with achievements that span politics, music, activism, athletics, and pop culture.
Wikipedia provides another avenue for gauging popular topics in a language.
We manually filter the top 500 German/English Wikipedia topics to remove non-
German/non-American entities; Game of Thrones and Unix-Shell are popular in the
German Wikipedia, but they are not culturally idiosyncratic. For the 2019 German
Wikipedia we are left with roughly 200 items, which we further reduce down to 120
after putting a cap on pop culture entities. For the American counterpart, over 300
items are culturally American. We add a three-year filter to remove pop items to
make it comparable to the German one.
162
A.2 Data
Entity Human Adaptation: NOC
German?American
Adolf Eichmann Andrew Jackson, Andrew Jackson, Franklin D.
Roosevelt, Nathan Bedford Forrest, Steve Bannon
Angela Merkel Barack Obama, Donald Trump, Hillary Clinton,
Hillary Clinton, Hillary Clinton, Joe Biden
Baron Munchausen Captain America, Daniel Bolger, Joseph Smith,
Paul Bunyan, Robert Jordan , Yankee Doodle
Carl von Clausewitz Alfred Thayer Mahan, Dwight D. Eisenhower,
Henry Knox, Robert E. Lee, Ulysses S. Grant
Friedrich Nietzsche Ayn Rand, Henry David Thoreau, Henry Thoreau,
Jordan Peterson, William James
Henry Kissinger Henry Kissinger, Henry Kissinger, John Kerry,
Madeleine Albright, Richard Nixon
Immanuel Kant Benjamin Franklin, John Dewey, John Locke, John
Rawls, Robert Nozick
Johann Sebastian Bach Aaron Copland, Elvis Presley, Elvis Presley, Irving
Berlin, Johnny Cash, Scott Joplin
Johann Wolfgang von Edgar Allan Poe, Ernest Hemingway, Walt Whit-
Goethe man
Johannes Gutenberg Benjamin Franklin, Bill Gates, Eli Whitney,
Thomas Edison
Joseph Goebbels David Duke, Franklin D. Roosevelt, George Rock-
well, Rupert Murdoch, david duke
Karl Lagerfeld Anna Wintour, Anna Wintour, Marc Jacobs,
Ralph Lauren, Ralph Lauren, Ralph Lauren,
Ralph Lauren
Karl Marx Angela Davis, Beck, Bernie Sanders, John Jay,
John Rawls, John Rawls
Leni Riefenstahl DW Griffeth, David Wark Griffith, Frank Capra,
Judy Garland
Ludwig van Beethoven Aaron Copland, Aaron Copland, Aaron Copland,
Elvis Presley, Frank Sinatra, George Gershwin,
George Gershwin, Scott Joplin
Marlene Dietrich Bette Davis, Clara Bow, Elizabeth Taylor, Marilyn
Monroe, William Tecumseh Sherman
Martin Luther Barry Goldwater, Brigham Young, Joseph Smith,
Joseph Smith, Joseph Smith
Otto von Bismarck Abraham Lincoln, George Washington, George
Washington, Ulysses S. Grant
Pope Benedict XVI Billy Graham, Billy Graham, Brigham Young,
John Carroll , Se?n Patrick O?Malley
Richard Wagner Charles Ives, Frank Sinatra, Leonard Bernstein,
Philip Glass
Table A.1: Veale NOC German?American adaptations.
163
Entity Human Adaptation: NOC
American?German adaptations
Abraham Lincoln Helmut Kohl, Konrad Adenauer, Wilhelm
Friedrich Ludwig von Preu?en, Willy Brandt,
Willy Brandt
Al Capone Adolf Leib, Carlos Lehder-Rivas, Jan Marsalek,
Nasser Abou-Chaker, Nasser About-Chaker
Alfred Hitchcock Bernd Eichinger, Bernd Eichinger, Michael Bully
Herbig, Roland Emmerich, Wim Wenders
Benedict Arnold Hansjoachim Tiedge, Otto von Bismarck, Otto von
Bismarck, Robert Blum
Bill Gates Andreas von Bechtolsheim, Carl Benz, Dietmar
Hopp, Konrad Zuse
Britney Spears Helene Fischer, Herbert Gr?nemeyer, Jeanette
Biedermann, Nena, Til Schweiger
Charles Lindbergh Ferdinand von Richthofen, Heinrich Horstman,
Karl Wilhelm Otto Lilienthal, Ludwig Hofmann,
Wernher von Braun
Donald Trump Adolf Hitler, Adolf Hitler, Carsten Maschmeyer,
Christian Lindner
Elvis Presley Peter Kraus, Rammstein, The Scorpions, Udo Lin-
denberg, Udo Lindenberg
Ernest Hemingway G?nter Grass, Hermann Hesse, Johann Wolfgang
von Goethe, Karl May, Martin Walser
Frank Lloyd Wright Gerhard Richter, Hugo H?ring, Karl Lagerfeld,
Max Dudler, Walter Gropius
George Washington Friedrich II, Heinrich I, Konrad Adenauer, Otto I.
der Gro?e, Otto von Bismarck
Henry Ford Carl Benz, Carl Benz, Carl Benz, Ferdinand
Porsche, Gottlieb Wilhelm Daimler
Hillary Clinton Angela Merkel, Angela Merkel, Angela Merkel,
Kramp-Karrenbauer, Sahra Wagenknecht
Homer Simpson Alf, Heidi, Pumuckl, Werner, Werner - Beinhart!
Jack The Ripper Armin Meiwes, Der Bulle von T?lz, Joachim Kroll,
Karl Denke, Rudolf Pleil
Jay Z Capital Bra, Marteria, Sido, Sido, Sido
Jimi Hendrix Bela B., Gisbert zu Knyphausen, Herbert Gr?ne-
meyer, Rudolf Schenker, Spider Murphy Gang
John F. Kennedy Hanns Martin Schleyer, Willy Brandt, Willy
Brandt, Wolfgang Sch?uble
Kim Kardashian Carmen Geiss, Gina-Lisa Lohfink, Heidi Klum,
Heidi Klum, Sarah Connor
Louis Armstrong G?nter Sommer, Helmut Brandt, Jan Delay,
Michael Abene, Mozart
Marilyn Monroe Heidi Klum, Ingrid Steeger, Marlene Dietrich, Mi-
caela Sch?fer, Uschi Glas
Michael Jordan Dirk Nowitzki, Dirk Nowitzki, Dirk Nowitzki,
Franz Beckenbauer, Michael Schuhmacher
164
Neil Armstrong Alexander Gerst, Sigmund J?hn, Sigmund J?hn,
Ulf Merbold, Wernher von Braun
Noam Chomsky Helmut Gl?ck, Juergen Habermas, J?rgen Haber-
mas, Ludwig Wittgenstein, Wilhelm R?ntgen
Oprah Winfrey Anne Will, Arabella Kiesbauer, Maybrit Illner,
Thomas Gottschalk, Thomas Gottschalk
Orville Wright Carl Benz, Gustav Otto, Gustav Wei?kopf, Otto
Lilienthal, Wernher von Braun
Richard Nixon Franz Josef Strauss, Helmut Kohl, Ludwig Erhard,
Ludwig Erhard, Richard von Weizs?cker
Rosa Parks Anne Wizorek, Marie Juchacz, Sophie Scholl, So-
phie Scholl, Vera Lengsfeld
Serena Williams Andrea Petkovic, Boris Becker, Sabine Lisicki,
Steffi Graf, boris becker
Steve Jobs Carl Benz, Dietmar Hopp, Dietmar Hopp, Karl
Lagerfeld
Steven Spielberg Michael Bully Herbig, Roland Emmerich, Roland
Emmerich, Roland Emmerich, Wim Wenders
Superman Bibi Blocksberg, Fix and Foxi, Maverick, Super-
man, Till Eulenspiegel
Tiger Woods Boris Becker, Martin Kaymer, Martin Kaymer,
Michael Schumacher, Serge Gnabry
Walt Disney Axel Springer, Christian Becker, Franz Mack, Ger-
hard Hahn, R?tger Feldmann
Table A.2: Veale NOC American?German adaptations.
165
Entity Human Adaptation: Wikipedia
German?American
ARD NPR, PBS, PBS
Adolf Hitler Donald Trump, Donald Trump, Franklin D. Roo-
sevelt, Franklin D. Roosevelt, Franklin D. Roo-
sevelt
Airbus Boeing, Boeing, Boeing, Boeing, Lockheed Martin
Albert Einstein Carl Sagan, J. Robert Oppenheimer, J. Robert
Oppenheimer, John Forbes Nash Jr., Thomas Edi-
son
Alice Merton Ariana Grande, Elle King, K.T. Tunstall, P!NK,
Vanessa Carlton
Alternative f?r Deutschland Libertarian Party , Republican Party, Tea Party
movement
Andrea Nahles Elizabeth Warren, Hillary Clinton, Nancy Pelosi,
Tammy Duckworth
Andrej Mangold Kawhi Leonard, Kevin Durant, Kris Humphries,
Yao Ming
Annalena Baerbock Al Gore, Al Gore, Alexandria Ocasio-Cortez,
Bernie Sanders, Jill Stein
Anne Frank Anna Green Winslow, Clara Barton, Emmett Till,
Kunta Kinte
Annegret Kramp- Condoleezza Rice, Hillary Clinton
Karrenbauer
AnnenMayKantereit Guns N? Roses, Milky Chance, Polar Bear Club,
Red Hot Chili Peppers
Apache 207 Fetty Wap, Tekashi 69, XXXTentacion, Zayn Ma-
lik
Arnold Schwarzenegger Chuck Norris, Dwayne Johnson, Ronnie Coleman,
Sylvester Stallone, Sylvester Stallone
BMW Cadillac, Cadillac, Chevrolet, Chrysler
Babylon Berlin Game of Thrones, Man From U.N.C.L.E., Peaky
Blinders , The Americans, Turn
Baden-W?rttemberg California, Chicago metropolitan area, San Diego,
Southern United States, Texas
Bastian Yotta Chad Johnson, Colton Underwood, Dan Bilzerian
Bauhaus Frank Lloyd Wright
Bayerischer Rundfunk NPR, National Public Radio, National Public Ra-
dio, national public ra
Bayern Florida, New York, The Confederacy
Benjamin Piwko Bruce Lee, Colton Underwood, Derek Hough
Berlin New York City, Portland Oregon, Washington
D.C., Washington D.C., Washington D.C.
Berliner Mauer Border Patrol Police, Mason?Dixon line, Ma-
son?Dixon line, US-Mexican border
Bertolt Brecht Tennessee Williams, Tennessee Williams
Bj?rn H?cke Lindsey Graham, Mike Pence
Borussia Dortmund Golden State Warriors, New England Patriots,
New England Patriots
166
Brandenburg Maryland, New York, Northeastern United States,
Richmond Virginia, Virginia
Bruno Ganz Clint Eastwood, Ethan Hawke, Marlon Brando,
Robert De Niro, Robert De Niro
Bundespr?sident First Lady, President of the United States, Speaker
of the House
Bundeswehr Department of Defense , US military, United
States Armed Forces, United States Army
Capital Bra Drake, Eminem, Eminem, Kanye West, Kendrick
Lamar
Carola Rackete American Civil Liberties Union, Dawn Wooten,
Rosa Parks, Whale Wars
Carolin Kebekus Amy Schumer, Sarah Silverman, Tina Fey, Tina
Fey
Charit? Call the Midwife, Grey?s Anatomy, Grey?s
Anatomy, The Queen?s Gambit
Chris T?pperwien Gordon Ramsey , Guy Fieri, Jeff Probst
Christoph Waltz Anthony Hopkins, Christoph Waltz, Denzel Wash-
ington
Dark Stranger Things, Stranger Things
Deutsche Bahn Amtrack, Norfolk Southern Railway, Union Pacific
Corporation
Deutsche Demokratische Confederate States of America, Confederate States
Republik of America, Texas, The Confederacy, The Confed-
erate States of America
Deutsche Nationalhymne Born in the U.S.A., Lazy Eye , Star Spangled Ban-
ner, The Star Spangled Banner
Deutschland America, America, Continental United States,
USA, United States, United States
Dieter Bohlen Billy Joel, Blake Shelton, Daryl Hall, Paula Abdul,
Ryan Seacrest
Dirk Nowitzki LeBron James, Michael Jordan, Shaquille O?Neal
Doreen Dietel Jessica Alba, Lisa Kudrow, Warrick Brown
Drei?igj?hriger Krieg American Civil War, American Civil War, Ameri-
can Indian Wars, Civil war
Elisabeth von ?sterreich- Edith Roosevelt, Hillary Clinton, Jackie Kennedy
Ungarn
Elyas M?Barek Adam Sandler, Adam Sandler, Chris Pine
Europawahl in Deutschland 2018 United States elections, American presiden-
2019 tial election 2020, Us election 2018
Europ?isches Parlament North Atlantic Council, Representative of the
United States of America to the European Union,
United Nations, United States Congress
Evelyn Burdecki Hannah Brown, Kaitlyn Bristowe, Kim Kar-
dashian, Kim Kardashian
FC Bayern M?nchen Dallas Cowboys, Dc United, New York Yankees,
New York Yankees, New York Yankees
Falco David Bowie, Frederick William Schneider III, MC
Hammer, Michael Jackson
167
Ferdinand Sauerbruch Ben Carson, Ben Carson, Cornelius P. Rhoads,
Jonas Salk, Virginia Apgar
Flughafen Berlin Branden- Cincinnati Subway, DCA , John F. Kennedy In-
burg ternational Airport, LaGuardia Airport
Frankfurt am Main Chicago, Los Angeles, Los Angeles, New York
City, Washington D.C.
Fritz Honka Ted Bundy, Ted Bundy, Ted Bundy, Zodiac
Hamburg Chicago, Chicago, Los Angeles, New York,
Philadelphia
Hannelore Elsner Elizabeth Taylor, Jane Lynch, Julia Roberts
Heidi Klum Chrissy Teigen, Cindy Crawford, Gigi Hadid, Kar-
lie Kloss, Tyra Banks
Heinz-Christian Strache Anthony Weiner, Ben Carson, Donald J. Trump,
Rob Ford, Roger Stone
Helene Fischer Beyonc?, Kelly Clarkson, Taylor Swift, Taylor
Swift
Hessen Arizona, Illinois, Mid-Atlantic , Napa County Cal-
ifornia
Holocaust Chattel Slavery, Japanese interned in American
camps, Slavery in the United States
Ich bin ein Star ? Holt mich Survivor, Survivor
hier raus!
J?rgen Klopp Bill Belichick, Bill Belichick, John Wooden
Kevin K?hnert Bernie Sanders, Bernie Sanders, Bernie Sanders,
Pete Buttigieg
Klaus Kinski Christopher Lee, Clark Gable, John Wayne,
Robert Pattinson, Robert Pattinson
Kontra K 50 Cent, Eminem, Eminem, Jesus Is King, Travis
Scott
K?ln Boston, Chicago, Chicago, Houston
Leila Lowfire Paris Hilton, Sasha Grey, Zendaya
Leipzig Denver, Detroit, Miami, San Diego
Lena Meyer-Landrut Ariana Grande, Kelly Clarkson, Kelly Clarkson,
Meghan Trainor, Selena Gomez
Liechtenstein Connecticut, Mexico, Philippines, Victoria British
Columbia
Lisa Martinek Julie Benz, Katherine Heigl, Mandy Moore, Meryl
Streep
Ludwig van Beethoven Aaron Copland, Aaron Copland, Aaron Copland,
Aaron Copland, Elvis Presley, Frank Sinatra,
George Gershwin, George Gershwin, Scott Joplin
Lufthansa Delta, United, United Airlines, United Airlines
Luxemburg Canada, Connecticut, Mexico, Victoria British
Columbia
Mark Forster Bruno Mars, Post Malone
Mero DaBaby, Fetty Wap, Lil Nas X, Lil Nas X, Post
Malone
Michael Schumacher Dale Earnhardt, Dale Earnhardt, James Gordon,
Jeff Gordon, Tiger Woods
168
M?nchen Chicago, Los Angeles, New York City, New York
City, Washington D.C.
Nico Santos Harry Styles, Justin Bieber, Shawn Mendes
Niki Lauda Dale Earnhardt, Dale Earnhardt Jr., Jeff Gordon,
Jeff Gordon, Tiger Woods
Norddeutscher Rundfunk NPR, NPR, National Public Radio, PBS, Sirius
XM
Nordrhein-Westfalen California, California
Philipp Amthor Alexandria Ocasio-Cortez, Ben Shapiro
RAF Camora Bad Bunny, Drake, Drake , Eminem, Future
Rammstein Green Day, Metallica, Metallica, Metallica, Sum
41
Rhein Mississippi, Mississippi River, Mississippi River
Robert Habeck Al Gore, Bernie Sanders, Jill Stein, Ralph Nader
Rudi Assauer Dave Roberts, Gregg Berhalter, Tom Flores, Vince
Lombardi, Vince Lombardi
Sahra Wagenknecht Alexandria Ocasio-Cortez, Elizabeth Warren, Eliz-
abeth Warren, Elizabeth Warren, Nancy Pelosi
Sarah Connor Beyonc?, Britney Spears, Mariah Carey
Schweiz Canada, Canada, Iowa, Mexico, United States
Sebastian Kurz Alexandria Ocasio-Cortez, Greg Abbott, Justin
Trudeau, Justin Trudeau, Mitch McConnell
Serge Gnabry Clint Dempsey, JuJu Smith-Schuster, Phillip
Rivers, Stephen Curry, Zion Williamson
Sido Eminem, Eminem, Macklemore
The Cratez DJ Khaled, Drake , Twenty One Pilots
Th?ringen Iowa, Midwestern United States, Tennessee, Ten-
nessee
Till Lindemann James Hetfield, James Hetfield, James Hetfield,
Ozzy Osbourne
Tom Kaulitz Adam Levine, Blink-182, Chris Martin, Green
Day, Maroon 5
UEFA Champions League Major League Soccer, NFC, NFL, National Foot-
ball League, Ncaa
Udo J?rgens Aretha Franklin, Billy Joel, Elton John, Michael
Jackson, Rolling Stone, Tom Lehrer
Udo Lindenberg Johnny Cash, Mick Jagger, Roger Taylor , Travis
Barker
Ursula von der Leyen Condoleezza Rice, Hillary Clinton, Mike Pence,
Sarah Palin, Susan Rice
Volkswagen AG Ford Motor Company, Ford Motor Company, Ford
Motor Company, Ford Motor Company, Ford Mo-
tor Company
Walter L?bcke Harvey Milk, John F. Kennedy, John Roll, Steve
Scalise
Weimarer Republik America, Confederation Period, Congress of the
Confederation, Counterculture of the 1960s, The
Confederate States of America
Westdeutscher Rundfunk ABC News, NBC, NPR
K?ln
169
Wien Austin Texas, Richmond Virginia, Toronto, Wash-
ington D.C.
Wilhelm II. William Howard Taft, Woodrow Wilson, Woodrow
Wilson
Wolfgang Amadeus Mozart Alan Menken, Elvis Presley, Leonard Bernstein
ZDF NPR, NPR, National Public Radio, PBS, PBS
?sterreich Canada, Mexico, Texas, Texas, United States
?tzi Spirit Cave mummy, Spirit Cave mummy, Spirit
Cave mummy, Sue
Table A.3: Top Wikipedia German?American adapta-
tions.
170
Entity Human Adaptation: Wikipedia
American?German
13 Reasons Why Club der roten B?nder, Gute Zeiten schlechte
Zeiten, Lammbock, T?rkisch f?r Anf?nger
Albert Einstein Albert Einstein, Albert Einstein, Albert Einstein,
Max Planck, Max Planck
Alexander Hamilton Konrad Adenauer, Max Weber, Otto von Bis-
marck, Otto von Bismarck
American Civil War Deutscher Krieg, Drei?igj?hriger Krieg, German
Revolution of 1918?1919, German revolutions of
1848?1849
American Horror Story Dark, Der goldene Handschuh, Good Bye Lenin!,
Tintenherz
Angelina Jolie Barbara Sch?neberger, Franka Potente, Marlene
Dietrich, Romy Schneider, Veronica Maria C?cilia
Ferres
Apple Inc. BMW, Fujitsu, SAP, Siemens
Ariana Grande Lena Meyer-Landrut, Lena Meyer-Landrut, Lena
Meyer-Landrut, Sarah Connor, Sarah Connor
Arnold Schwarzenegger Arnold Schwarzenegger, Karl Lauterbach,
Matthias Steiner, Peter Maffay, Ralf Rudolf
M?ller
Ashton Kutcher Florian David Fitz, Matthias Schweigh?fer, Til
Schweiger, Til Schweiger
Australia Australia, Russia, Schweiz, South Africa, ?sterre-
ich
Avengers Infinity War Das Arche Noah Prinzip, Fack ju G?hte, Fantastic
Four, Who Am I
Barack Obama Angela Merkel, Angela Merkel, Angela Merkel,
Helmut Schmidt, Helmut Schmidt
Beyonc? Helene Fischer, Sarah Connor, Veronica Ferres,
Xavier Naidoo, Yvonne Catterfeld
Black Mirror Dark, Dark, Die kommenden Tage, Krabat
Blake Lively Josefine Preu?, Maria Furtw?ngler, Maria
Furtw?ngler, Til Schweiger
Brad Pitt Florian David Fitz, Frederick Lau, Til Schweiger,
Til Schweiger, Til Schweiger
Bruce Lee G?tz Georg, Henry Maske, Julian Jacobi, Max
Schmeling, no one is like Bruce Lee
Caitlyn Jenner Kristin Otto, Magdalena Neuner, Magdalena Ne-
uner, Niklas Kaul, Ulrike Meyfarth
California Bavaria, Bavaria, Bayern, Bayern
Camila Cabello Helene Fischer, Lena Meyer-Landrut, Lena Meyer-
Landrut, Nadja Benaissa
Canada Austria, Italy, Schweiz, Sweden, ?sterreich
Cardi B Ace Tee, Pamela Reif, Sabrina Setlur, Sarah Con-
nor, Schwester Ewa
Charles Manson Andreas Baader, Issa Rammo, Papst benedikt xvi,
Paul Sch?fer
171
Charlize Theron Baran bo Odar, Josefine Preu?, Josefine Preu?,
Veronica Ferres, Veronica Maria C?cilia Ferres
Cher Marlene Dietrich, Nena, Nena, Nena
Chris Pratt Elyas M?Barek, Jan Josef Liefers, Matthias
Schweigh?fer, Ralf Moeller, Til Schweiger
Clint Eastwood Heinz Erhardt, Klaus Kinski, Mario Adorf, Til
Schweiger, Wim Wenders
Darth Vader Adolf Hitler, Belzebub, Hagen von Tronje, Jens
Maul
Donald Glover Elyas M?Barek, Helge Schneider, Money Boy, Ste-
fan Raab
Drake Bushido, Cro, Falco, Fler
Dwayne Johnson Alexander Wolfe, Arnold Schwarzenegger, Peter
Alexander, Tim Wiese, Tim Wiese
Elon Musk Alexander Samwer, August Horch, Carl Benz, Her-
bert Diess, Werner von Siemens
Eminem Bushido, Kollegah, Sido, Sido, Sido
Facebook Das Erste, Lokalisten, Lokalisten, Sch?ler VZ, Stu-
diVZ, StudiVZ
Friends Gute Zeiten schlechte Zeiten, Gzsz, Lindenstra?e,
Stromberg
Game of Thrones Babylon Berlin, Babylon Berlin, Babylon Berlin,
Die unendliche Geschichte, Krabat
Google Ecosia, Fastbot, SAP, SAP, i.d.k.
Harry Potter Die Unendliche Geschichte, Die unendliche
Geschichte, Harry Potter und ein Stein, Meggie
Folchart
Heath Ledger Christoph Waltz, Florian David Fitz, Henry
Blanke, Matthias Schweigh?fer, Tilman Valentin
Schweiger
It Dark, Der goldene Handschuh, Die Wolke, Pando-
rum
Jason Momoa Arnold Schwarzenegger, Benno F?rmann,
Christoph Waltz, Elyas M?Barek, Elyas M?Barek,
Elyas M?Barek
Jeff Bezos Alexander Samwer, Beate Heister, Martin Win-
terkorn, Oliver Samwer
Jeffrey Dahmer Armin Meiwes, Fritz Haarmann, Joachim Kroll,
Karl Denke, Karl Denke
Jennifer Aniston Barbara Sch?neberger, Diane Kruger, Diane
Kruger, Franka Potente, Iris Berben
Jennifer Lawrence Iris Berben, Josefine Preu?, Karoline Herfurth,
Ruby O. Fee
Jennifer Lopez Heidi Klum, Helene Fischer, Jeanette Biedermann,
Mandy Capristo, Sarah Connor
John Cena Arnold Schwarzenegger, Max Schmeling, Max
Schmeling, Ralf M?ller
Johnny Cash Fantastischen vier, Helge Schneider, Peter Maffay,
Peter Maffay
172
Johnny Depp Christoph Maria Herbst, Christoph Waltz, Cro,
Til Schweiger, Xavier Naidoo
Julia Roberts Karoline Herfurth, Maria Furtw?ngler, Marlene
Dietrich, Marlene Dietrich
Justin Bieber Cro, Felix Jaehn, Lukas Rieger, McFittie, Mike
Singer
Keanu Reeves Daniel Br?hl, Mario Adorf, Til Schweiger, til
schweiger
Kylie Jenner Barbara Sch?neberger, Heidi Klum, Karoline Ein-
hoff, Sarah Connor, Stefanie Giesinger
Lady Gaga Helene Fischer, Nena, Nena, Nina Hagen, Sarah
Lombardi
LeBron James Dirk Nowitzki, Dirk Nowitzki, Dirk Nowitzki, Dirk
Nowitzki, Toni Kroos
Leonardo DiCaprio Matthias Schweigh?fer, Moritz Bleibtreu, Til
Schweiger, Til Schweiger, Til Schweiger
Lisa Bonet Franka Potente, Iris Berben, Karoline Herfurth,
Maria Furtw?ngler
Madonna Bl?mchen, Helene Fischer, Helene Fischer, Helene
Fischer, Sarah Connor
Mark Wahlberg Florian David Fitz, Til Schweiger, Tilman Valentin
Schweiger, Alexei Alexejewitsch
Martin Luther King Jr. Hans Scholl, Hans Scholl, Helmut Palmer, Robert
Blum, Sophie Scholl
Marvel Cinematic Universe Bavaria Film, Havelstudios, Phant?sien, Rat Pack
Filmproduktion, Tatort
Michael Jackson Herbert Gr?nemeyer, Nena, Udo J?rgens, Xavier
Naidoo, Xavier Naidoo
Mila Kunis Josefine Preu?, Matthias Schweigh?fer, Vanessa
Mai
Miley Cyrus Lena Meyer-Landrut, Lukas Rieger, Nena, Sarah
Connor, Yvonne Catterfeld
Muhammad Ali Alexander Abraham, Boris Becker, Max Schmel-
ing, Max Schmeling, Sven Ottke
Natalie Portman Barbara Sch?neberger, Diane Kruger, Franka Po-
tente, Iris Berben
New York City Berlin, Berlin, Berlin, Berlin, Frankfurt
Nicole Kidman Evelyn Hamann, Franka Potente, Senta Berger,
iris berben
Peaky Blinders Dark, Dieter Schwarz, Im Westen Nichts Neues,
Tatort, Tatort
Philippines Greece, Griechenland, Mallorca, Mallorca
Post Malone Bushido, Bushido, Cro, Cro, Kollegah
Rihanna Helene Fischer, Lena Meyer-Landrut, Lena Meyer-
Landrut, Nena
Riverdale Babylon Berlin, Berlin Tag und Nacht, Neues vom
S?derhof, T?rkisch f?r Anf?nger
Robert Downey Jr. Christoph Waltz, G?nter Strack, Martin Semmel-
rogge, Moritz Bleibtreu, Til Schweiger
173
Robin Williams Hape Kerkeling, Heinz Erhardt, Peter Maffay, Sil-
via Seidel, Tim Bendzko
Ronald Reagan Helmut Schmidt, Konrad Adenauer, Konrad Ade-
nauer, Konrad Adenauer
Ryan Reynolds Daniel Br?hl, Florian David Fitz, Matthias
Schweigh?fer, Til Schweiger, Til Schweiger
Scarlett Johansson Lena Gercke, Romy Schneider, Sarah Connor,
Sarah Connor, Veronica Ferres
Selena Gomez Lena Meyer-Landrut, Lena Meyer-Landrut, Nena,
Nora Tschirner
September 11 attacks Anschlag im OEZ, Dresden Bombing, Mauerfall,
RAF-Attentate, Terroranschlag Olympia 1972
Shaquille O?Neal Dirk Nowitzki, Dirk Nowitzki, Mehmet Scholl,
Niklas S?le
Star Wars Dark, Metropolis, Traumschiff Surprise ? Periode
1, Who Am I?, i.d.k
Stephen Curry Dirk Nowitzki, Dirk Nowitzki, Dirk Nowitzki, Dirk
Nowitzki, Manuel Neuer
Stranger Things 8 Tage, Babylon Berlin, Dark, Tatort, Tatort
Sylvester Stallone Henry Blanke, Jan Josef Liefers, Michael Bully
Herbig, Michael Fassbender, Til Schweiger
Taylor Swift Lena Meyer-Landrut, Lena Meyer-Landrut, Sarah
Connor, Sarah Connor, Yvonne Catterfeld
Ted Bundy Joachim Kroll, Josef Fritzl, Niels H?gel, Rudolf
Pleil, Rudolf Pleil
The Big Bang Theory Doctor?s Diary, Stromberg, Stromberg, der
Tatortreiniger
The Crown Babylon Berlin, Deutschland 83, Die Deutschen,
Karl der Gro?e
The Handmaid?s Tale Dark, Dark, Der Pass, Die Wanderhure, Er ist
wieder da
The Walking Dead Dark, Dark, Der goldene Handschuh, Zombies
From Outer Space
Tom Brady Franz Beckenbauer, Michael Ballack, Oliver Kahn,
Thomas M?ller, Uli Stein
Tom Cruise Benno F?rmann, Benno F?rmann, Christoph
Waltz, Elyas M?Barek, Matthias Schweigh?fer
Tom Hanks Christoph Waltz, Christoph Waltz, Daniel Br?hl,
Til Schweiger
Tom Hardy Bruno Ganz, Michael Herbig, Til Schweiger,
Wotan Wilke M?hring
Tom Holland Daniel Br?hl, Frederick Lau, Matthias
Schweigh?fer, Matthias Schweigh?fer, Til
Schweiger
Tupac Shakur Farid Bang, Haftbefehl, Kollegah, Kristoffer
Klau?, Peter Fox
United States BRD, Bundesrepublik Deutschland, Deutschland,
Germany, Germany
Vietnam War Berlin Wall, First world war, Kosovokrieg, World
War II
174
Wikipedia Brockhaus, Brockhaus Enzyklop?die, Brockhaus
Enzyklop?die, Duden, dict.cc
Will Smith Daniel Br?hl, Elyas M?Barek, Sascha Reimann,
Sido, Til Schweiger
X-Men Abw?rts, Fantastic Four, Freaks, Krabat, Who Am
I
YouTube Lokalisten, MyVideo, MyVideo, ProSieben,
lokalisten
Zac Efron Frederick Lau, Lukas Rieger, Peter Kraus, Walter
Sedlmayr
Zendaya Franka Potente, Iris Berben, Lena Meyer-Landrut,
Lena Meyer-Landrut, Yvonne Catterfeld
Table A.4: Top Wikipedia American?German adapta-
tions.
175
Entity Top Five WikiData Adaptations
Abraham Lincoln Victor Adler, Johann Joachim Christoph Bode,
Willem Barentsz, Hermann Wagener, Robert von
Mohl
Al Capone Hans H. Zerlett, Fritz Thyssen, Adam Rainer,
Franz Winkelmeier, Christian Louis, Duke of
Brunswick-L?neburg
Alfred Hitchcock Edgar Reitz, Jan Josef Liefers, Mario Adorf, Max
Frisch, Armin Mueller-Stahl
Benedict Arnold Hans-Georg Hess, Isabelle Eberhardt, G?nther
Heydemann, Max Schreck, Louis Blenker
Bill Gates Ferdinand von Zeppelin, G?nther Jauch, Nikolaus
Harnoncourt, Sepp Blatter, Alfred Grosser
Britney Spears Herta M?ller, G?nter Grass, Joachim Gauck,
Hans-Dietrich Genscher, Ko?a Popovi?
Donald Trump Max Frisch, Thomas Gottschalk, Jan Josef Liefers,
Rainer Werner Fassbinder, Christa Wolf
Elvis Presley Reinhard Lakomy, James Last, Herbert Achtern-
busch, Fritz Hauser, Hans-Peter Pfammatter
Ernest Hemingway Karlheinz B?hm, Ricarda Huch, Michael Ballhaus,
Arnold Zweig, Michael Fassbender
Frank Lloyd Wright Ferdinand Hodler, Johan Zoffany, Hans Thoma,
Arne Jacobsen, Lucas Cranach the Younger
George Washington Friedrich Wilhelm von Seydlitz, Dagobert Sig-
mund von Wurmser, Heinz Guderian, Ernst
Gideon von Laudon, George Olivier, count of Wal-
lis
Henry Ford Heinz Sielmann, Wieland Schmied, Manfred Krug,
Paul Maar, Armin Mueller-Stahl
Hillary Clinton Pope Benedict XVI, Willy Brandt, Angela Merkel,
Helmut Schmidt, Kurt Biedenkopf
Homer Simpson Elizabeth Lavenza, Hans Fugger, Baron Strucker,
Herbert of Wetterau, Prince Johannes of Liechten-
stein
Jimi Hendrix Marius M?ller-Westernhagen, Karl Richter, Rein-
hard Lakomy, Michael Cretu, Paul van Dyk
Kim Kardashian Erika Mann, Frank Wedekind, Til Schweiger, Fritz
von Opel, Carmen Electra
Marilyn Monroe Gerhart M. Riegner, Viktor de Kowa, Otto Sander,
Hans Hass, Dorothee S?lle
Michael Jordan Jean-Claude Juncker, Richard von Weizs?cker,
Herta M?ller, Konrad Adenauer, Helmut Kohl
Louis Armstrong Herbert Prikopa, Till Lindemann, Nico, Klaus
Voormann, Jakob Adlung
Neil Armstrong Stefan Hell, Franz-Ulrich Hartl, Reinhard Genzel,
Charles Weissmann, Harald zur Hausen
Noam Chomsky G?nter Grass, Herta M?ller, Heinrich B?ll, Peter
Handke, Juli Zeh
176
Oprah Winfrey G?nter Grass, Peter Scholl-Latour, Elfriede Je-
linek, Juli Zeh, Christa Wolf
Orville Wright Frank Thiess, Jessica Hausner, Elmar Wepper,
Wolf Jobst Siedler, Marc Rothemund
Richard Nixon Heinrich von Brentano, Ernst Benda, Gustav
Heinemann, Heiner Gei?ler, Heinrich Albertz
Superman Magneto, Nightcrawler, Sinterklaas, Silent Night,
Victor Frankenstein
Steve Jobs Victor Klemperer, Joschka Fischer, J?rgen
Kuczynski, Joachim Fest, Dieter Hallervorden
Steven Spielberg Herta M?ller, Jean-Claude Juncker, Hans-Dietrich
Genscher, Joachim Gauck, Ko?a Popovi?
Tiger Woods Charles Dutoit, Shania Twain, Lise Meitner,
Michael Haneke, Otto Hahn
Walt Disney Shania Twain, Charles Dutoit, Lise Meitner, Otto
Hahn, Michael Haneke
John F. Kennedy Bernhard von B?low, Otto von Habsburg, Hans-
Jochen Vogel, Prince Henry of Prussia, Frederick
Augustus III of Saxony
Charles Lindbergh Pina Bausch, Ferdinand von Zeppelin, Nikolaus
Harnoncourt, Jan Josef Liefers, Wolf Biermann
Rosa Parks Hermann Lenz, Wilhelm Feldberg, Horst Tappert,
Peter Stein, Gert Jonke
Serena Williams Charles Dutoit, Lise Meitner, Michael Haneke,
Richard von Coudenhove-Kalergi, Klaus Clusius
Table A.5: We show top-5 predictions out of the top-100
for American?German adaptations on the Veale NOC
subset using WikiData. These are compared to our hu-
man annotations in our results.
177
Entity Top Five 3CosAdd Adaptations:
American?German adaptations on the
Veale NOC
Abraham Lincoln Napoleon, Napol?on Bonaparte, Erzherzog Jo-
hann, Otto von Bismarck, Kaiser Wilhelm II.
Al Capone Nazis, SA-Mann, Verhaftungswellen, Judenverfol-
gung, Fluchthilfe
Alfred Hitchcock Fritz Lang, Helmut K?utner, Willi Forst, Emil
Jannings, Heinz R?hmann
Benedict Arnold Russlandfeldzug 1812, Schlacht bei Ro?bach,
Jean-Victor Moreau, schwedischen Armee,
Alexander Wassiljewitsch Suworow
Bill Gates congstar, Alnatura, GMX, ChessBase, Gardeur
Britney Spears Glasperlenspiel, Unheilig, Helene Fischer,
Christina Aguilera, Herbert Gr?nemeyer
Charles Lindbergh Segelflieger, Flugpioniere, Zeppelins, Adolf Hitler,
Caproni
Donald Trump Deutschland, ?sterreich, Trump, Strache, Bun-
destagswahlkampf
Elvis Presley Udo J?rgens, Elvis Presley, Hits, den Beatles, der
Beatles
Ernest Hemingway Stefan Zweig, Franz Werfel, Joachim Ringelnatz,
Hermann Hesse, Gottfried Benn
Frank Lloyd Wright Adolf Loos, Le Corbusier, Bruno Schmitz, Entw?r-
fen, Fritz H?ger
George Washington Napol?on Bonaparte, Friedrich dem Gro?en,
Napoleon, Friedrich der Gro?e, Napoleon Bona-
parte
Henry Ford Ferdinand Porsche, B?ssing, Krupp, Ettore
Bugatti, Steyr-Daimler-Puch
Hillary Clinton Deutschland, Bundestagswahlkampf, ?sterreich,
Sarkozy, Strache
Homer Simpson Eingangsszene, verulkt, Schlusssequenz, Off-
Stimme, Muminfamilie
Jack The Ripper:Ripper Tat, Werwolf, T?ter, Dritten Reich, M?rder
Jay Z Xavier Naidoo, D-Bo, Sido, Rosenstolz, David
Guetta
Jimi Hendrix Udo J?rgens, Tangerine Dream, Jimi Hendrix,
Pink Floyd, Depeche Mode
John F. Kennedy Adolf Hitler, Bundeskanzlers, Adolf Hitlers, Adolf
Hitler, Hitler
Kim Kardashian Kaas, gotv, Frank Zander, Herbert Gr?nemeyer,
Roland Kaiser
Louis Armstrong Richard Tauber, Django Reinhardt, Udo J?rgens,
Sidney Bechet, Jazzorchester
Marilyn Monroe Marlene Dietrich, Lil Dagover, Elisabeth Bergner,
Brigitte Bardot, Romy Schneider
Michael Jordan Powerplay, Xavi, Predrag Mijatovi?, NHL-
Historie, Franck Rib?ry
178
Neil Armstrong Juri Gagarin, Vorbeiflug, Weltraum, Raumstation
Mir, Raumfahrer
Noam Chomsky J?rgen Habermas, Hans-Ulrich Wehler, Carl
Schmitt, Theodor W. Adorno, Norbert Elias
Oprah Winfrey Harald Schmidt, Thomas Gottschalk, Satire-
sendung, ORF-Sendung, Hape Kerkeling
Orville Wright Parseval, Luft Hansa, Hugo Junkers, Ernst
Heinkel, Claude Dornier
Richard Nixon ?sterreich, Deutschland, Bundeskanzler, Bun-
deskanzlers, Bundespr?sidenten
Rosa Parks NS-Milit?rjustiz, Franz J?gerst?tter, NS-Opfer,
B?cherverbrennung, Baum-Gruppe
Serena Williams Dick Jaspers, Philipp Kohlschreiber, Semifinale,
Achtelfinale, Dominic Thiem
Steve Jobs Steve Jobs, Sony, Electronic Arts, Netscape, Atari
Steven Spielberg H?rspielproduktion, Helmut K?utner, Fellini,
Oliver Hirschbiegel, Kinofilm
Superman Superman, Batman, Superhelden, Monster,
Spider-Man
Tiger Woods Rekordeuropameister, ?sterreich, spanische Team,
?FB-Cupsieger, Deutschland
Walt Disney Fritz Lang, Sascha-Film, Fellini, UFA, "Das Cab-
inet des Dr. Caligari"
Table A.6: We show top-5 predictions out of the top-100
for American?German adaptations on the Veale NOC
subset using 3CosAdd. These are compared to our hu-
man annotations in our results.
179
Entity Top Five Learned Adaptations:
American?German adaptations on the
Veale NOC
Abraham Lincoln Konrad Adenauer, Helmut Schmidt, Willy Brandt,
Helmut Kohl, Adenauer
Al Capone Andreas Baader, Leo Katzenberger, Paul Sch?fer,
Strippel, Hermann Langbein
Alfred Hitchcock Helmut K?utner, Til Schweiger, Mario Adorf, Paul
Verhoeven, Dennis Hopper
Benedict Arnold Otto von Bismarck, Bismarcks, Bismarck,
Preu?ens, Kaiserreiches
Bill Gates Martin Winterkorn, Volkswagen AG, Daimler-
Chrysler, Robert Bosch GmbH, Volkswagen AG
Britney Spears Sarah Connor, Nena, Helene Fischer, Lena Meyer-
Landrut, Moses Pelham
Charles Lindbergh Chaim Weizmann, Tom?? Garrigue Masaryk, Fer-
dinand Sauerbruch, Fritz Haber, Chaim Arlosoroff
Donald Trump Helmut Schmidt, Angela Merkel, Gerhard
Schr?der, Helmut Kohl, Bundesau?enminister
Elvis Presley Udo J?rgens, Peter Maffay, Cliff Richard, Achim
Reichel, Lou Reed
Ernest Hemingway Paul Schlenther, Marcel Reich-Ranicki., Timothy
Leary, Erwin Leiser, Alice Walker
Frank Lloyd Wright Albert Einstein, Max Planck, Max Born, Hermann
von Helmholtz, Arnold Sommerfeld
George Washington Otto von Bismarck, Otto von Bismarck, Konrad
Adenauer, Engelbert Dollfu?, Joseph Wirth
Henry Ford Ernst Abbe, Carl Duisberg, Bubbe, Aby Warburg,
Sybel
Hillary Clinton Angela Merkel, Angela Merkel, Helmut Schmidt,
Gerhard Schr?der, Bundesinnenminister
Homer Simpson Rolf Hochhuth, Carl Bernstein, Uwe Tellkamp,
Wolfgang V?lz, Richard Gere
Jack The Ripper:Ripper Sarah Connor, Spike Jonze, Timberlake, "Das
Urteil", "Nichts als die Wahrheit"
Jay Z will.i.am, Moses Pelham, Silbermond, Xavier
Naidoo, Kanye West
Jimi Hendrix Peter Maffay, Udo Lindenberg, Depeche Mode,
Xavier Naidoo, Die Toten Hosen
John F. Kennedy Konrad Adenauer, Helmut Schmidt, Willy Brandt,
Helmut Kohl, Bundeskanzler
Kim Kardashian Heidi Klum, Ruth Moschner, Ellen DeGeneres,
Circus HalliGalli, Oliver Pocher
Louis Armstrong Peter Maffay, Radioaufnahmen, Udo Lindenberg,
Achim Reichel, Helge Schneider
Marilyn Monroe Walter Giller, Jessica Tandy, Liv Ullmann, Edgar
Selge, Betty White
Michael Jordan Dirk Nowitzki, Toni Kroos, Zlatan Ibrahimovi?,
Xavi, Zin?dine Zidane
180
Neil Armstrong Max von Laue, Albert Einstein, ChaimWeizmann,
Johannes R. Becher, Ernst Abbe
Noam Chomsky Albert Einstein, Nobelpreistr?ger, Max Planck,
American Psychological Association, Hans Bethe
Oprah Winfrey Anja Kling, "Forsthaus Falkenau", Uschi Glas,
"Saturday Night Live"., Anke Engelke
Orville Wright Kawaishi, Rjabuschinski, Monistenbund, Deth-
mann, Leo Baeck Instituts
Richard Nixon Helmut Schmidt, Konrad Adenauer, Willy Brandt,
Helmut Kohl, Gerhard Schr?der
Rosa Parks Sophie Scholl, Die letzten Tage, Emil Jannings.,
Ruth Wilson, Monica Bleibtreu
Serena Williams Max Schmeling, Wilfried Dietrich, Gottfried von
Cramm, Henry Maske, L?szl? Kubala
Steve Jobs DaimlerChrysler, Volkswagen, Siemens, Sanyo,
Fujitsu
Steven Spielberg Til Schweiger, Ethan Hawke, Matthias
Schweigh?fer, Samuel L. Jackson, Ryan Reynolds
Superman Jabberwocky, Freaks, Scarface, Leatherface, Kra-
bat
Tiger Woods Dirk Nowitzki, deutschen U21-
Nationalmannschaft, MTV Gie?en, Mats Hum-
mels, Franz Beckenbauer
Walt Disney Helmut Dietl, Peter Ustinov, David Mamet,
Rainer Werner Fassbinder, S?nke Wortmann
Table A.7: We show top-5 predictions out of the top-100
for American?German adaptations on the Veale NOC
subset with our Learned Adaptation approach. These
are compared to our human annotations in our results.
181
Appendix B: Diplomacy
Our Diplomacy (Chapter 7) appendix contains:
1. examples of game summaries written by players (Table B.1);
2. the game engine view of the board (Figure B.1);
3. examples of persuasion techniques (Table B.2);
4. Harbingers word lists that are used as features in the logistic regression model
(Table B.3); and
5. A full transcript between two players, Germany and Italy (Table B.4). Mes-
sages are long and carefully composed. This transcript is from the game
described in Section 7.2.1 (Warning: it is dozens of pages).
B.1 Further Details
182
User Summary
Italy This was an interesting game, with some quality play all around, but I
felt like I was playing harder than most of the others. I felt early on that I
could count on Austria remaining loyal, which worked to my benefit, as it
allowed me freedom to stab and defeat a very strong French player before
he got his legs under him. At the same time, Austria was a little too
generous in granting me centers and inviting me to come help him against
Russia, which allowed me to take advantage once I was established in the
Middle Atlantic.
Russia Definitely a good game by Italy - which is interesting to me, because his
initial press struck me as erratic and aggressive, making me not want to
work with him. I?m curious if the same negotiating approach was taken
with the other players who did work with him early on, or if he used a
different negotiating approach with closer neighbors.
Table B.1: Users optionally provide free response descriptions of the game. This
can be used for qualitative analysis or potentially for algorithmic summarization.
Figure B.1: The board game as implemented by Backstabbr. Players place moves
on the board and the interface is scraped.
183
Principle Example
Authority Sent to Germany, England, Austria, Russia: So, England, Ger-
many, Russia, y?all played a great turn last turn. You got me to
stab my long-time ally and you ended our pretty excellent 7-year
run as an alliance. Russia told me he was with me if I stab Austria.
England told me he wanted me to solo so long as I would ?teach
him? and help his along to second place. Then y?all pulled the
rug out from under me. It was clever and effective. At this stage,
my excitement about the game has diminished quite a bit. And of
course I?m happy to play on and take my lumps for falling for ?Hey,
I really want you to solo, just help me place second,? but if you
guys just want to call it a five-way draw among us and grab a beer
together, while reviewing the statistics, that?s really my preference.
I am outnumbered and I obviously can?t solo. And I?m sure some
of you in the north are eager to send everyone else flying my way,
but I expect Russia and England to be careful, and so I?m not sure
there is much room to move forward without simply tipping the
board to Germany?s favor.
I propose that we draw and hug it out.
Reciprocity 1) You?ve been straight with me all game. 2) You have a much
better ability to read the board than she does. 3) You?re on the
other side, so you can?t really stab me, but I could totally see her
moving to Tyrolia some time soon. 4) You?re not in France?s pocket.
Likability Maine is beautiful! I used to go to scout camp there.
Scarcity I?d like to have your final thoughts on A/R as quickly as possible
so that I have time to execute a plan. But I understand if you want
time to think about it.
Table B.2: Examples of persuasion from the games annotated with tactics from
Cialdini and Goldstein (2004).
184
Feature Key Word
claim accordingly, as a result, consequently, conclude that, clearly,
demonstrates that, entails, follows that, hence, however, implies,
in fact, in my opinion, in short, in conclusion, indicates that, it fol-
lows that, it is highly probable that, it is my contention, it should
be clear that, I believe, I mean, I think, must be that, on the
contrary, points to the conclusions, proves that, shows that, so,
suggests that, the most obvious explanation?, "the point I?m trying
to make", ?therefore, thus, the truth of the matter, to sum up, we
may deduce
subjectivity abandoned, abandonment, abandon, abase, abasement, abash,
abate, abdicate, aberration, aberration, abhor, abhor, abhorred,
abhorrence, abhorrent, abhorrently, abhors, abhors, abidance, abid-
ance, abide, abject, abjectly, abjure, abilities, ability, able, ab-
normal, abolish, abominable, abominably, abominate, abomina-
tion, above, above-average, abound, abrade, abrasive, abrupt, ab-
scond, absence, absentee, absent-minded, absolve, absolute, abso-
lutely, absorbed, absurd, absurdity, absurdly, absurdness, abun-
dant, abundance, abuse, abuse, abuse, abuses, abuses, abusive,
abysmal, abysmally, abyss, accede, accentuate, accept, acceptance,
acceptable, accessible, accidental, acclaim, acclaim, acclaimed, ac-
clamation, accolade, accolades, accommodative, accomplish, ac-
complishment, accomplishments, accord, accordance, accordantly,
accost, accountable, accurate, accurately, accursed, accusation, ac-
cusation, accusations, accusations, accuse, accuses, accusing, ac-
cusingly, acerbate, acerbic, acerbically, ache, achievable, achieve,
achievement, achievements, acknowledge, acknowledgement, ac-
quit, acrid, acridly, acridness, acrimonious, acrimoniously, acri-
mony, active, activist, activist, actual, actuality, actually, acumen,
adamant, adamantly, adaptable, adaptability, adaptive, addict, ad-
diction, adept, adeptly, adequate, adherence, adherent, adhesion,
admirable, admirer, admirable, admirably, admiration, admire, ad-
miring, admiringly, admission, admission, admit, admittedly, ad-
monish, admonisher, admonishingly, admonishment, admonition?
. . .
expansion additionally, also, alternatively, although, as an alternative, as if, as
though, as well, besides, either or, else, except, finally, for example,
for instance, further, furthermore, however, in addition, in fact, in
other words, in particular, in short, in sum, in the end, in turn, in-
deed, instead, later, lest, likewise, meantime, meanwhile, moreover,
much as, neither nor, next, nonetheless, nor, on the other hand,
otherwise, overall, plus, rather, separately, similarly, specifically,
then, ultimately, unless, until, when, while, yet
contingency accordingly, as a result, as long as, because, consequently, hence,
if and when, if then, in the end, in turn, indeed, insofar as, lest,
now that, once, since, so that, then, thereby, therefore, thus, unless,
until, when
premise after all, assuming that, as, as indicated by, as shown, besides, be-
cause, deduced, derived from, due to, firstly, follows from, for, for
example, for instance, for one thing, for the reason that, further-
more, given that, in addition, in light of, in that, in view of, in view
of the fact that, indicated by, is supported by, may be inferred,
moreover, owing to, researchers found that, secondly, this can be
seen from since, since the evidence is, what?s more, whereas
185
temporal- after, afterward, as soon as, by then, finally, in the end, later, next,
future once, then, thereafter, till, ultimately, until
temporal- also, as long as, before, before and after, earlier, in turn, meantime,
other meanwhile, now that, previously, simultaneously, since, still, when,
when and if, while
comparisons after, although, as if, as though, besides, by comparison, by con-
trast, conversely, earlier, however, in contrast, in fact, in the end,
indeed, instead, meanwhile, much as, neither nor, nevertheless,
nonetheless, nor, on the contrary, on the one hand on the other
hand, on the other hand, previously, rather, regardless, still, then,
though, when, whereas, while, yet
Table B.3: The word lists used for our Harbingers (Nic-
ulae et al., 2015) logistic regression models.
186
B.2 A Full Game Example
Speaker Message Actual Suspected
# Lie Lie
0 Italy Germany! Truth Truth
Just the person I want to speak with. I have a some-
what crazy idea that I?ve always wanted to try with
I/G, but I?ve never actually convinced the other guy
to try it. And, what?s worse, it might make you sus-
picious of me.
So...do I suggest it?
I?m thinking that this is a low stakes game, not a tour-
nament or anything, and an interesting and unusual
move set might make it more fun? That?s my hope
anyway.
What is your appetite like for unusual and crazy?
1 Germany You?ve whet my appetite, Italy. What?s the sugges- Truth Truth
tion?
2 Italy Okay, don?t hate me! Key West (Just thought of the Truth None
name lol)
Basic point is that I move to Tyr in Spring and into
Mun in the Fall, while I take Tun with my fleet. I
build A Ven/F Nap. You open to Ruh/Hol/Kie, and
force Belgium. You wind up with 2 builds, and the
sympathy and concern of your neighbors who are as-
tonished at the crazy Italian. ?What a stupid move, he
can?t hold Munich!? Trap is set to obliterate France
in the Spring of 02. Bel S Mun - Bur, Ven - Pie, Tun
- WMed. France won?t see it coming, He will see that
attack on Munich and think that both you and I will
be occupied for a while. So Spring 02 should be a
serious surprise.
Now, you?re taking risk here, because you?re giving
up a home center for a turn hence the ?Key?), but I
think you can see pretty clearly that I derive no benefit
from trying to double-cross you. After all, Italy trying
to hold Munich is just dumb. I?m from a school of
thought that even trying to move to Munich is just
dumb. But this would be the one exception. I can?t
hold Munich, and even if I wanted to, it would give
me an awkward snake formation in the middle of the
board that is a great way to be first eliminated. So I
think this works because you know (even more than
Austria in a traditional Key Lepanto), that I?m not
going to stab you. And doing it this way allows us to
take Burgundy by surprise, it ensures you get Belgium,
and it crushes your biggest rival and toughest border
(France). Plus, it?ll be fun. The Key West! Thoughts?
*This message crashed the Beta version of the bot due
to its length and is not in the dataset. It is manually
extracted and added here the purposes of readability*
187
3 Germany It seems like there are a lot of ways that could go Truth Truth
wrong...I don?t see why France would see you ap-
proaching/taking Munich?while I do nothing about it?
and not immediately feel skittish
4 Italy Yeah, I can?t say I?ve tried it and it works, cause I?ve Truth None
never tried it or seen it. But how I think it would
work is (a) my Spring move looks like an attack on
Austria, so it would not be surprising if you did not
cover Munich. Then (b) you build two armies, which
looks like we?re really at war and you?re going to eject
me. Then we launch the attack in Spring. So there is
really no part of this that would raise alarm bells with
France.
All that said, I?ve literally never done it before, and
it does involve risk for you, so I?m not offended or
concerned if it?s just not for you. I?m happy to play
more conventionally too. Up to you.
5 Italy I am just sensing that you don?t like this idea, so shall Truth None
we talk about something else? That was just a crazy
idea I?ve always wanted to try. I?m happy to play more
conservatively.
6 Italy Any thoughts? Truth None
7 Germany Sorry Italy I?ve been away doing, um, German things. Truth Truth
Brewing Lagers?
8 Germany I don?t think I?m ready to go for that idea, however I?d Truth Lie
be down for some good ol?-fashioned Austria-kicking?
9 Italy I am pretty conflicted about whether to guess that Truth Truth
you were telling the truth or lying about the ?brewing
lagers? thing. I am going to take it literally and say
thumbs down even though I don?t think you meant it
deceptively.
10 Italy But I think I can get over ?Lagergate? and we can still Truth Truth
be friends.
As of right now, I think Austria may be my most re-
liable ally. I?m thinking I?d like to play as a Central
Trio if you have any interest in that. Thoughts?
11 Germany We haven?t even passed a season yet and you have a Truth Truth
?most reliable ally??
I?ll consider this proposal but, basically, I?m not going
to expose myself to risk from either of you until I?ve
seen a bit of your behavior
188
12 Italy Well, at least I have an idea of who to trust. Obviously, Truth Truth
my ideas are subject to change.
I understand your desire to watch behavior before com-
mitting to anything. I, personally, am a partner player.
I look carefully early in the game for a small group to
work with, and then I value loyalty and collaboration.
I like to work closely with a tight-knit alliance.
If you prefer to hop and back and forth, or play more
of an individual game, then we might not be a good
match.
I?m looking for a loyal ally or two that I can coordinate
with and make awesome moves with. Makes the game
easier and a lot more fun.
13 Italy Just an FYI: I?ve now had both England and France Truth Truth
suggest to me that I should move to Tyrolia and France
will support me to Munich in the Fall. One saying that
to me is not a big deal, but with both mentioning it,
my alarm bells are going off. I am concerned about an
E/F.
I?m certainly not moving to Tyrolia. But I just want
you to be cautious here. I feel like England and France
are working together.
14 Germany I appreciate the tip, but I?m wondering why you?re so Truth Truth
against ousting me from Munich if I haven?t explicitly
agreed to be your ally?
15 Italy Because it is terrible, terrible play for Italy to attack Truth None
Germany, in my view. If I were to attack you in Mu-
nich, I could never hold Munich. So, all I would be
doing is weakening you, and helping France, England,
or both to get really big.
I don?t have any long-term path going north. Helping
France to take you down is a sucker?s play, whether
you are working with me or not.
16 Italy Did France tell you he was moving to Burgundy, or Truth Truth
was that a stab?
17 Germany I was not informed of it, no. And England is leading Truth Truth
me to believe it?s part of a play for Belgium, so if
they?re working together this might be a trick...
Italy, you seem like a straight shooter, and Austria has
confirmed with me about your two?s alliance. So I?ll
confide in you?this is my first ever game of diplomacy,
and I think that teaming up with the two of you could
help me learn more and have more fun. So, if you?re
still interested in a central powers alliance, I?m in.
18 Germany Okay full disclosure: I?m not very smart, and I acci- Truth Truth
dentally let slip to England that you told me France
was plotting to take Munich. I?m sorry for the error
but I figured it was better to admit it so you know
that England/France may not trust you.
19 Italy Okay, thanks for telling me. Truth Truth
20 Germany So, um, no alliance then? Truth Truth
189
21 Italy I do want to be allies. Sorry, busy weekend here run- Truth Truth
ning around with bambinos. More to come.
22 Germany What would you think of helping me take Marseilles Truth Truth
in two turns?
23 Italy Hi Germany, I?ll certainly consider that. Though, I?ll Truth Truth
note: traditionally, Germany would help Italy to Mar-
seilles if the two of them work together there. The
reason is that: if I help you to Marseilles, I?m basically
cut off from going west and getting anything myself.
So, usually, Germany would help Italy into Marseilles
to encourage Italy to come west and Germany would
plan to take Paris, Belgium and Brest.
24 Germany Fair enough?I?ll help you take it, then, but I?ll need to Truth Truth
deal with Belgium first.
25 Italy How are things going with England? I think that get- Truth Truth
ting him to work with you is the main key here.
26 Germany I?m trying?I just offered to assist with taking Sweden Truth Truth
in exchange for some assistance into Belgium...not sure
if they?ll go for it...
27 Italy I?ll check with England and try to see where his head Truth Truth
is at.
28 Germany I?ve actually been thinking about this game all day Truth None
and have come up with a plan I like a bit better... but
England still hasn?t responded to my initial offer.
29 Italy That?s the worst! Truth Truth
And I?m glad to see you?re so focused on this in your
first game. It?s a really great game if you put in the
time and effort!
30 Germany You?re definitely telling the truth on that one. So can Truth Truth
I count on you to move to piedmont this season?
31 Italy I don?t think I can afford to move to Piedmont this Truth Truth
season. I don?t really trust Austria to avoid walking
through that door if I leave it wide open.
I think you need to get England on board to attack
France.
32 Germany That?s valid. And actually I was conferring with Eng- Truth Truth
land and we concluded that it?s not really gonna be
possible for me to help you take Marseilles this year
anyway.
...what are you and Austria planning for this year,
then? I?m willing to tell you my plans in exchange
as a gesture of trust.
Have you communicated at all with England or
France?
33 Italy Hi, are you there? Truth Truth
Just woke up.
England did return my message, but he did not tell
me anything substantive so I really don?t know what
he?s doing. I?m planning to move towards Turkey.
190
34 Italy Well, you?re in trouble. That England move is trouble. Truth Truth
I?m going to try to convince him to change course. I
suggest you be very kind to him, and don?t burn that
bridge. I think your game hinges on turning England
around.
35 Italy Hi Germany, Truth Truth
I?m working hard on turning England. And I?m also
trying to get Russia to come to your aid. Doing the
best I can! I?ll keep you posted.
36 Germany England just told me that Russia is helping them to Truth Truth
take Denmark so that may be a lost cause. Granted,
the source for that intel is a serpentine jackal-spawn
37 Italy Okay, I?m reasonably sure that England wants to take Truth Truth
the Channel and attack France now.
I believe that you should basically do whatever Eng-
land asks to help make this happen. As long as E
attacks F, you will be in a much better position, and
you?ll gain back centers quickly.
What are you hearing?
38 Germany What are your plans for this turn? I can?t help but Truth Truth
notice that Munich is surrounded by foreign armies on
three sides...
I wish I could be more helpful but I?m pretty much just
treading water right now trying not to lose anything
else
39 Italy Hey ? sorry, just getting back into this now. Truth Truth
40 Italy I have good news! (1) I am finally attacking France Truth Truth
this turn. (2) I will be supporting Munich to hold from
Tyrolia.
Let?s turn this game around, yes?
41 Italy I am pretty sure that England is not attacking you Truth Truth
this turn. And I am committed to supporting Munich
holding. Make sure you don?t move Munich so that it
can take my support.
42 Germany Okay, can do. Thanks! Truth Truth
43 Italy I suggest that you order: Kiel Support Berlin hold- Truth Truth
ing Berlin Support Munich holding Helg to Holland
Munich Support Berlin holding
44 Germany I agree completely?although I didn?t know that a Truth Truth
country could hold *and* support at the same time!
Thanks!
45 Germany Thanks Italy. Hope you?re enjoying the weather on Truth Truth
the Anatolian
46 Italy I will be supporting Munich to hold again. And I?ll be Truth None
trying to get Russia to back off your flank and protect
himself against an Austrian stab that is coming.
191
47 Italy Two bits of advice: #1 I suggest you tell Russia that Truth None
Austria is coming for him. You really want Russia
to move Sil back to Gal. You might also suggest to
Russia that is he supports you to Denmark, you will
then support Russia back to Sweden. I don?t know
yet if it actually makes sense to do that, but you want
Russia thinking that you are eager to work with him.
He?ll be hoping for a reason to break off his attack on
you at this point.
48 Italy #2 Here is the move set I would suggest right now: Truth None
Kiel Support Holland holding Holland Support Wales
to Belgium (tell England you are going to order this
support and he can take it or leave it) Munich Support
Berlin holding Berlin Support Munich holding
I think that both France and Russia are about to back
off you, as they are both under fire at home. Just hold
still, and soon you should be able to break out of this
holding pattern.
49 Germany God, I hope so! I?m attempting to make that deal with Truth Truth
russia now...and I?m talking with England re: Belgium
50 Italy It?s none of my business, but if you do plan to take Truth Truth
Denmark, I strongly recommend you wait until Fall.
I think the most important thing for you right now
is getting England fully committed against France. If
that happens, taking Denmark later will be easy.
51 Germany I think me and England are really on the same page at Truth Truth
this point regarding France. I?m actually sort of run-
ning counter-intelligence for England (and my friends
to the south, of course!) with Russia right now.
England and I talked about Denmark too...and it
seems like one or the other of Denmark or Belgium
should work out for me this year and I?m fine with
that
52 Italy Great to hear. Thank you. Truth Truth
53 Germany Do you need me to disrupt Bur this year? I?ll need to Truth Truth
seriously trust Russia if I?m going to risk not holding
my eastern front, I think...
54 Italy I do think a move to Burgundy makes sense for you Truth Truth
this turn, and I can?t imagine Russia attacking you
here. He has a serious Austria problem.
I suggest this: Mun - Bur Ruh - Bel Hol Support Ruh
- Bel Ber - Kie
Tell Russia that the last thing in the world you want
to see is Austria run him over, and you?re willing to
help keep Russia viable if necessary (you?re angling for
Russia to disband his northern holdings this turn).
55 Italy And ask England nicely to support Ruh - Hol, with the Truth Truth
explanation that you don?t plan to ask for Denmark
back, but you think it would help you both to diminish
France. (You?ll get Den back eventually, but you want
England to think you don?t care about it).
192
56 Germany Thanks, I?ll work on these. ...Why didn?t you scooch Truth Truth
into the Aegean behind Austria? You could have de-
fended or even held Bulgaria this turn?
57 Germany England and I were talking about your moves for this Truth Truth
season?what do you think of convoying Pie into Spa,
supporting this with Wes, and then moving Tyr into
Pie?
58 Germany This leaves Marseilles open for Bur to fall into if France Truth Truth
goes that route, which gives me an opening into Bur
59 Italy That?s not bad. Truth Truth
60 Italy I was kind of thinking I should pick one or the other of Truth Truth
Marseilles or Spain to attack and not tell a soul which
one I?m going after.
61 Italy Do you really think it?s important to coordinate? Truth Truth
62 Italy I do think you?re best off moving to Burgundy. And Truth Truth
there is some chance that we fail this turn. But I think
we just take a guess and hope for the best. We?ll get
him next turn if not this one.
63 Germany Okay?sorry for being nosy! I will try for bur on the Truth Truth
off chance it shakes out that way
64 Italy Nah, you?re not being nosy at all. I mean, come on, Truth Truth
we both know that I have no problem sticking my nose
where it doesn?t belong.
65 Germany Marked as true Truth Truth
66 Italy I like to coordinate, but on these sort of 50/50 guesses, Truth Truth
I kind of like to keep it secret so that if it doesn?t go
well, I have nobody to blame but myself.
67 Italy Ha! Truth Truth
68 Germany Well, are you willing to humor my question about the Truth Truth
Aegean, anyway?
69 Italy Sure. I was thinking of moving that fleet to Ionian. Truth Truth
You think a move to Aegean is better? I?m not really
sure, but let?s talk it through.
70 Germany No sorry I meant in hindsight?like this past turn you Truth Truth
should have moved to Aeg so that this current turn,
when Austria takes Rumania (from Bulgaria), you?d
be there to cover Bulgaria so it couldn?t get scooped
by the Black sea, and potentially you?d just get to take
it.
71 Italy Not a bad point. I agree. Truth Truth
72 Italy Hmmmm, kind of a pointless lie if you ask me, but I Truth Truth
won?t hold it against you. You?re in a tough spot.
73 Germany um what lie? I did exactly the moves you suggested! Truth Truth
74 Italy Ha! So sorry!! I meant that for France! Truth Truth
75 Italy You are my favorite. Truth Lie
76 Germany Marked as lie because clearly austria is your favorite. Truth Truth
Speaking of, I assume that your seizing Trieste was
mutually agreed upon?
77 Italy Yes ? agreed upon. Truth Truth
78 Germany That?s not what Austria said to England... Truth Truth
79 Italy Hmmmm, okay. Well, let?s just keep that between you Truth Truth
and me then.
193
80 Germany You know Italy, I think we *do* need to coordinate Truth Truth
your move this time?England and I have a shot at
either Bur or Mao if one of Marseilles or Spain can be
left open for France to fall into. This will improve all
of our chances of crushing France quickly.
81 Italy Okay, I can dig it. What do you want me to do? Truth Truth
82 Germany Let me confer with England and get back to you. Glad Truth Truth
to hear that though!
83 Italy So...any thoughts on how to approach this? Truth Truth
84 Germany It looks like England?s not willing to try for MAO if Truth Truth
it means possibly losing the channel. However, they?ll
bring the NWG fleet around to try for MAO next year.
So if you could keep Marseilles open, it will help me
to try and take Burgundy this turn.
85 Italy If I leave Marseilles open, would you kindly use Bur- Truth Truth
gundy in the Fall to help me take Marseilles? (Likely
that means ordering Burgundy to Gascony to cut sup-
port)
86 Germany Will do. Truth Truth
87 Germany Okay, so I still have a teensy little bone to pick with Truth Truth
you: on the off-chance that Austria wasn?t lying and
you *did* take Trieste unexpectedly, I sort of worry
that I might be next. Are you willing to tell me what
your plans are for the Tri unit, or at least to warn me
before any move into Tyrolia?
88 Italy Sure. But, you?ll see from my moves this turn that Truth Truth
Austria is lying to you.
89 Italy I currently have Tri - Tyrolia. I like the unit there Truth Truth
because it sets up an attack on Austria if I ever want
to go that route (build A Ven and go east). Do you
want me to keep Tyrolia clear?
90 Italy I?ll add ? I would never attack Germany as Italy. Set- Truth Truth
ting myself as a giant column like that is just not de-
fensible. It would be a terrible move.
91 Germany Not when that column is not-so-giant and in a turf war Truth Truth
with France.
92 Germany oh you mean setting *yourself* Truth Truth
93 Germany But you could easily pick off, say, Munich and not be Truth Truth
a "giant column"
94 Italy I mean this sincerely: any Germany who does that is Truth Truth
a terrible player.
Why would I do that? I would need 2-3 units to hold
one center. That is a net negative. And all of your
units are doing things that are good for me in contain-
ing your neighbors.
I?ve been working hard in this game for you to succeed
and knock back France and England. I can say with
100% certainty: I?m not going to attack you. I?m going
to keep helping you as much as I can.
95 Italy That said, if you want me NOT to move to Tyrolia, I Truth Truth
won?t move there.
194
96 Germany Nah, I just needed some reassurance :) Your logic is Truth Truth
undenyable? enjoy your stay in tyr!
97 Germany *undeniable? That looks better Truth Truth
98 Italy I mean it sincerely. I think that England will want to Truth Truth
coax me to attack you with him after France falls, but
I?d much rather work with you against England.
But first thing?s first ? let?s get rid of France.
99 Germany Agreed Truth Truth
100 Germany (On the france part) Truth Truth
101 Germany Sorry I won?t be able to cut off Gascony this turn...I Truth Truth
probably should have just told you my moves; you
could have advised me that supporting Mun-Bur was
more important than Kie-Ruh
102 Italy No worries. We?ll crack this but eventually. Truth Truth
Here is my suggestion for this turn: Kie - Den Hol S
Bel holding Bel S Ruh - Bur Mun S Ruh - Bur Ruh -
Bur
103 Italy I think you should suggest to England that he gets Truth Truth
Sweden and St Petersburg, while you get Denmark
back. That?s only fair, as you have been a loyal ally in
the fight against France and you plan to continue to
do that.
104 Germany The moves I had already planned differ in one respect: Truth Truth
I thought it would be worth the risk to try moving
Hol-Bel and therefore move Bel-Bur. Even if me and
France are high-fiving in Bel for a few seasons it?s still
mine, and it?s not like Holland has anything better to
do while I?m still allies with England.
...The only reason I?m reluctant to make that agree-
ment with England is that?while I think *you* and
I have a good relationship?I really have not talked
with Austria much at all, and I?m the next logical tar-
get for them when Russia?s gone. And anything that?s
bad for Russia right now is good for Austria.
105 Italy Hmmmm, I?m just not sure you should trust England Truth Truth
enough right now to leave Holland open and Belgium
essentially unguarded.
France is a really good player, and he is no doubt work-
ing hard to get England to turn on you. My personal
take is that you are better off being a bit more conser-
vative until you have Denmark back and England has
moved another fleet towards France. But I can see it
either way.
106 Italy With regard to Russia, talk it through with England. Truth Truth
What you don?t want is England taking out Russia
and giving you nothing. So, if England agrees to let
Russia be for a while, then your plan sounds good. But
if England is going to take Sweden, you really should
get Denmark back. (I?m my view)
195
107 Germany Okay you?ve convinced me: it?s worth figuring out Truth Truth
what E?s plans are for Russia at least.
And you?re almost certainly right, from a rational per-
spective, about leaving Holland/Belgium vulnerable to
England. But I think England really is counting on my
assistance in taking France, and because of that and
other non-quantifiable reasons I trust them.
108 Italy Excellent. Obviously you have a much better feel for Truth Truth
your relationship with England than I do. Just know
that France is persuasive, and I?m sure that?s what
he?s working on. He stopped talking to me, so I bet
he?s trying to turn England. Just keep reassuring Eng-
land that you want to work with him long-term so he
doesn?t succumb to the Dark Side.
109 Italy Hi Germany ? well, I think we?re getting to a critical Truth Truth
point in the game here. France held out a long time,
but he?s much less of a threat now. I think the critical
issue, for you, is England.
I have some thoughts on the matter, and some infor-
mation, but I?d like to feel confident that you and I
will keep anything we say between us. I think of you
as the one person who has been honest with me on
every turn. You even tell me the truth when it?s bad
news, or when you don?t completely trust me, and I
like that.
110 Germany Okay, Italy. I won?t share any of this conversation. Truth Truth
But in the interest of continued full disclosure, here?s
what I think: England is a greater threat to *me* on
the map, but *you* have a greater chance of soloing
this game quickly, or pair-winning with Austria even
sooner. And if I continue to collaborate with England,
we at least have a chance of slowing that down. So
I?m in sort of a conflicted spot
111 Italy This is why I like you. The full disclosure part. You Truth Truth
tell me the truth even when the news isn?t great.
112 Italy My thoughts on the ?Germany/England forever so that Truth Truth
at least we can stop the solo? strategy: (1) It?s quite
early to be talking about solos. I am at 8, and Austria
could take 3 from me any time, quite easily. (2) I
don?t think England is thinking that way. I think he?s
thinking that a dominant power will emerge in the
north, and one will emerge in the south. And he?s like
to be that dominant power.
113 Italy England?s pieces are not positioned well if he?s trying Truth Truth
to attack France or contain Italy. He keeps Denmark
guarded, and North Sea filled. He is not playing like
he intends to stick with you, even though I?m sure he?s
telling you that.
114 Italy You?re right that you don?t want to start a war with Truth Truth
England right now. But, you must stick up for your-
self, because nobody else will do that if you don?t.
196
115 Italy If I were you, this is what I would do: (1) keep warn- Truth Truth
ing England about the dangers of Italy getting too big
and insist that England moves his fleets towards MAO
(Channel to Irish, Norwegian to NAO, North - Chan-
nel), (2) insist on taking Denmark back.
116 Italy I would say something like this: Truth Truth
England, I?m with you my friend, but we?re passed the
stage of you needing to keep me under lock and key. I
need to take Denmark back. I?m happy to support you
to Brest to keep you growing, or you can grab Sweden.
You have plenty of options other than keeping your
ally?s center, but if you really want to be my ally long-
term, you?ve got to show me that.
117 Italy I am hearing from England signs that he may be think- Truth Truth
ing of attacking you soon. And I think you actually
avoid that better by being strong and sticking up for
yourself rather than being accommodating and letting
him do whatever he wants to do.
118 Germany Well, both you and France have now pointed out that Truth Truth
England is strategically not in a good place to be my
ally right now, and you are correct. I?ll be more cau-
tious with my northern border, but I made a pretty
strong argument for denmark this past turn and it fell
on deaf ears
119 Germany ...which probably also should have been a sign for me Truth Truth
120 Italy Well, if you want, you could just take Denmark this Truth Truth
next year and I don?t think England is in a position to
retaliate.
121 Germany Probably not...has France been talking with you at all Truth Truth
about their sunsetting strategy? They?ve indicated a
willingness to work with you and me and a desire to
see England get as few dots as possible
122 Italy He did say that to me too. Though, France has a long Truth Truth
history of lying to me, so I really don?t trust him.
123 Germany Well France has actually been pretty honest with me, Truth Truth
and I at least am certain that they wouldn?t betray
me to England. So, I?m considering working with F
to sabotage (or potentially full-on backstab) England
this turn, which would have the side-effect of maybe
taking some attention away from the south for you
anyway.
124 Germany (and I?d be interested to hear your thoughts on this if Truth Truth
you?re in the mood to give out free advice)
125 Italy Hi Germany ? sorry for the delay. Well...I think it?s Truth Truth
really important that you get a build this turn either
way. I don?t think England will get a build this turn,
so if I were you I?d probably take Paris, build a fleet,
and move on England after that.
126 Italy But it likely depends on how communication is going Truth Truth
with England. If he?ll give you back Denmark, that
might change the equation.
197
127 Germany I am waiting on England to make a decision about Truth Truth
that?they claim to be thinking about it.
128 Germany England told me you said I was plotting with France. Truth Truth
It makes sense you?d want to pit us against each other.
129 Italy Hey ? tried to send you a message earlier but not was Truth Truth
down. England was telling me that you?re saying that
I told you that England is plotting against you. The
problem with telling England that is that he will stop
giving me useful info.
130 Italy Truly, I don?t want you and England to fight. I am Truth Truth
not trying to break you up. I suggested that you
take Paris! I want you guys to work together with
me against France.
131 Germany You don?t want us to fight, yet you betrayed both of Truth Truth
our confidence with you in a way that makes us dis-
trust each other?
132 Italy I really don?t think that?s a fair description. You guys Lie Truth
both wanted to attack each other. I encouraged you
both to keep working together.
133 Germany Just as long as it suits you. Are you going to give Truth Truth
England Mao?
134 Italy Hmmm, should I be reading that as angry sarcastic Truth Truth
with dagger eyes? (I?m not sure if I?m getting your
tone right)
135 Italy We?re friends, right? I believe that every single mes- Lie Truth
sage I?ve sent you all game has been truth, and I?ve
gone out of my way to give you candid advice. Are we
still friends?
136 Italy Regarding MAO ? I don?t know. What do you want Truth Truth
me to do? I don?t have any set plan.
137 Germany Yep, there?s some sarcasm there. Looking back at your Truth Truth
messages, I still don?t read them as encouraging collab-
oration. And if you wanted us to be friends, you could
have done that without betraying me to England by
simply saying in your candid way "I don?t think you
should do that for such and such reason". But you
chose to increase E?s distrust of me. So I think you
might be full of gnocchi and crap.
My trust in you is a bit shaken but I still think we can
have a working partnership with a bit more caution on
my end. It would be my preference that you hold Mao,
on the assumption that if it came down to a choice
between partnering with me or England, you?d choose
me. If that?s not the case, then as the filling of an
England-Italy sandwich I?m in no position to retaliate
anyway.
138 Italy Well, again, I like that you?re honest with me, even Truth Truth
when the news is bad.
198
139 Italy I have to say that I?m surprised that you feel that I?ve Lie Truth
betrayed your trust. I have been feeling like maybe
I?ve been TOO helpful to you, and been a bit over
the top in offering advice, etc., but it seems like I?ve
misread the situation.
140 Germany No, it?s completely true that you?ve been too helpful, Truth Truth
and I?m really really grateful for it because I?ve been
able to learn so much from this game. But it?s also true
that you didn?t have to tell England what you did, and
all you stood to gain from it was that it shook my and
E?s trust in each other.
141 Italy But I understand what you?re saying, and I much pre- Truth Truth
fer to have a heart to heart like this, a frank airing
of grievances, rather than being surprised by unkind
moves on the board. https://youtu.be/xoirV6BbjOg
142 Germany Was not expecting seinfeld today and it was a pleasant Truth Truth
surprise
143 Italy :) Truth Truth
144 Italy Here?s the deal: I like you better than England. Lie Truth
145 Italy I?m not sure how the next couple of turns are going Truth Truth
to shake out. But I like that you tell me when you?re
angry with me. I know that may seem like a small
thing, but it?s just rare in Diplomacy. You get so many
fake smiles.
146 Italy So, if it comes down to you or him, I?m choosing you. Truth Truth
And I?ll work to do a better job of keeping your confi-
dence. I certainly understand how important that is,
as I hate it when people o that same thing to me.
147 Italy So no more playing mediator for me. Truth Truth
148 Germany Okay. Is it true that you want the channel? Truth Truth
149 Germany And are you planning to keep Vienna? Truth Truth
150 Italy I am not planning to keep Vienna. And yeah I?ve asked Truth Truth
France for support to the Channel. Do you think he?s
on board?
151 Germany I?m not sure. Is *England* on board? Is this some- Truth None
thing England can know about?
152 Italy No, do you think France will Support me to the Chan- Truth Truth
nel?
153 Germany France has asked my opinion on it, and I haven?t given Truth Truth
it yet. To my estimation things look a lot better for me
if you don?t end up there: I don?t want to see England
in Mao, and I don?t want to see you snagging pieces of
the north.
154 Italy Okay, well, here is my thinking. Tell France whatever Truth Truth
you want to make him happy. Then tell me how you
really feel. And if you don?t want me to go there, I
won?t go there.
155 Germany If I hadn?t asked you about it, would that have just Truth Truth
been another surprise, too?
199
156 Italy Absolutely. Truth Truth
You and I have discussed our moves and been honest
with each other every turn. But we have not been
sharing all our moves or pre-clearing all of our moves.
So that would have Ben a surprise in the same way
that your moves are a surprise to me. (I never tell you
what to do or insist on knowing).
157 Italy I kind of thought that you would have wanted me in Truth Truth
the Channel because it commits me further against
England, but I can understand what you?re saying now
about wanting me to hang back.
158 Italy But I don?t think there is anything wrong with me Lie Truth
contemplating moves without telling you all of them.
You asked me about it, and I told you the truth.
159 Germany I do think that this move is a breach of general ex- Truth Truth
pectation, which is the kind of thing I?d like to know
about. And it?s also the kind of thing I?ve shared with
you: case in point, my desire to stab England.
160 Italy Okay. Understood. Truth Truth
161 Germany Is there anything I could gain from seeing you in the Truth Truth
channel? Would you support me taking Nth, and po-
tentially seizing the island?
162 Germany Here?s what I?m thinking: I would be on board with Truth Truth
you taking the channel (and I?d give France the green
light to go ahead with it) if you would agree to bump
Nao out of Mao using Wes, and if you?d be open to
supporting some anti-English aggression while holding
the channel so that I can get on equal footing with you,
dot-wise.
If you don?t want to agree to those terms, that?s okay,
but I would strongly prefer not to see you in the chan-
nel in that case.
163 Italy I?m going to be out of pocket this weekend, so let?s Truth Truth
talk this through more on Monday. Generally, I agree
that I?ll either stay out of the Channel or agree to your
terms for entering there.
164 Germany If you decide to stay out of the channel, I have a deal Truth Truth
that I like with England in the works. For that deal to
go through, you?d have to agree to move Mao into
Portugal to let England take Mao. Would you be
amenable to that?
165 Germany (If this second offer is more to think about than a no- Truth Truth
brainer, you can just mull it over and let me know
monday)
166 Italy So, here is my concern with the England offer: If I?m Truth Truth
taking Portugal, why do we want England in MAO?
And why would he want to go to MAO? I?m not sure
I understand that one. Can you explain?
200
167 Germany Well, when I initially proposed the deal I had forgot- Truth Truth
ten that Portugal was promised to England. Then
England agreed to it on the condition that you would
confirm that move, so I figured E thought you would
just move out of there next year? But now that I
think about it, it?s probably worth asking England why
they?d agree to that.
168 Italy I?d prefer that you not tell England I am considering Truth Truth
moving to the Channel. I don?t think he would like
that.
169 Italy I don?t really want to discuss this stuff with England Truth Truth
at all.
170 Germany Well, England changed their mind about the plan I Truth Truth
offered anyway. So, are you taking the channel?
171 Italy No, I?m not taking the Channel. Truth Truth
172 Germany Okay was that a recent decision? Because like an Truth Truth
hour ago France said they were supporting you into
the channel
173 Italy Well, when I tell you what I plan to do, do you turn Truth Truth
around and tell France? This makes me uncomfortable
speaking with you.
174 Germany I haven?t spoken to France since then. I didn?t realize Truth Truth
you were giving the two of us different information on
this particular subject. But I don?t think I?ve revealed
anything to them about what you plan to do. Mostly
because you haven?t told me.
175 Italy Well, I have been honest with both you and France. Lie Truth
You told me that I need to promise you a set of things
in order to take the Channel. I felt like it was more
than I could be sure of doing, so I am not entering the
Channel. I won?t go there without your permission.
176 Germany I appreciate that. And I?ll keep the remainder of this Truth Truth
conversation between us unless I hear otherwise. Have
you just recently made an agreement with England?
177 Germany I heard as much but I want to verify the contents of Truth Truth
that agreement with you
178 Germany Btw, France just said that they submitted the orders Truth Truth
to support you into the channel.
179 Italy I don?t have an agreement with England, but he is Truth Truth
asking me about my moves and trying to get my help.
180 Germany Okay?then England is lying to me, saying that you?re Truth Truth
helping support Eng-Brest.
181 Italy Ha! Yeah, fat chance. Lie Truth
182 Germany ...but did you lie to England about that? Or can I say Truth Truth
to England that I don?t think you?ll actually provide
that support?
183 Italy What is Paris up to? Truth Truth
184 Italy I suggest you just not tell England anything about my Truth Truth
moves.
185 Italy Do you want me to support England to Brest? Truth None
186 Italy I guess I?m not sure what your goals are here. Truth Truth
201
187 Italy I just kind of feel like you?re grilling me with a lot Truth Truth
of questions, but not telling me what you?re doing or
what you want from me.
188 Germany *If* you support Eng-Brest, England has agreed to Truth Truth
vacate denmark for me. If you don?t, I won?t get in
the way of your channel thing. Any other questions?
189 Germany I have no sense of what you want or what your plan is, Truth Truth
but I thought I?d been pretty clear: I want Denmark.
I am reluctant to see you in the Channel if England
remains powerful, but happy to see you there if they
are weakened.
190 Italy Can?t you just force Denmark? Truth Truth
191 Germany Not without risking a swipe of Belgium Truth Truth
192 Germany And why force when you don?t have to Truth None
193 Italy Okay, I?ll support England to Brest. You take Den- Truth Truth
mark.
194 Italy And you and I should be in position to take out Eng- Truth Truth
land next year.
195 Germany Splendid! Truth Truth
196 Germany Glad everything worked out Truth Truth
197 Italy Thumbs up! Truth Truth
198 Italy Congratulations on retaking Denmark and getting two Truth Truth
builds. You are playing really well right now. Respect.
199 Germany Congrats on having double-digit dots! I have some Truth Truth
thoughts about taking out England, if you want to go
full-stab this season...
200 Italy I think I do! Truth Truth
201 Italy What are you thinking? Truth Truth
202 Germany One option is to take the channel, another is to take Truth Truth
Brest. Between you, me, and Picardy we can manage
either, but it?s a question of which takes priority. If
we chose Brest, I could also take a stab at seizing Nth
this season, then we could try for the channel in fall.
Or we could do channel first, Brest second.
203 Italy Yeah, that is all along the lines of what I?m thinking. Truth Truth
How demanding does France sound right now? Does
he want to be the one who takes Brest?
204 Germany Haven?t asked. But in general not demanding. Truth Truth
205 Italy Good! Lie Truth
Still, I think we should show him some good faith by
supporting him to Brest in Spring. We can decide in
Fall whether it makes more sense for you to take it,
but I think we want to keep France hungry.
206 Italy I would suggest something like this to ensure the En- Lie Truth
glish fleet is disbanded: Pic - Bre MAO - Channel Par
S Pic - Bre
207 Italy And Spa - Gas to cut off that retreat. Truth Truth
208 Italy You can take the North Sea on the same move and set Truth Truth
up a convoy to the English mainland.
Checkmate.
209 Germany Okay, I like the plan! I?ve asked France if they?re will- Truth Truth
ing to move to Brest supported by me.
202
210 Germany Aren?t you concerned about England taking Mao? I?d Truth Truth
sooner just have you pile on support into Bre so that
Wes can support Mao holding
211 Italy That?s a good point, but the problem with that ap- Truth Truth
proach is that Brest is not guaranteed. If England
cute MAO and supports with the Channel, the attack
fails. I think we are better off ensuring that the Brest
fleet is disbanded. If we disband that fleet and take
North Sea, an English fleet in MAO really just spreads
him out and allows you to take the island faster. It?s
not like he can get Portugal or Spain.
212 Germany Okay, but that means I?d prefer to take Brest myself Truth Truth
this Spring, if France is okay with it.
213 Italy I think that we should offer France Brest in Spring. Lie Truth
That ensures that he is with us. Then, if conditions
are right in the Fall, I can support you into Brest.
But...England can offer France Belgium, and I think
he is sure to take that if we?re not even offering him a
center, right?
214 Italy Better to keep France feeling like we?re going to keep Lie Truth
him in the game. If you need the build in Fall, it?s
easy for me to support you there.
215 Germany I guess I?m just wondering from France?s perspective Truth Truth
why they?d *want* to stay in the game. Isn?t it possi-
ble they?d rather move on with their life? That?s not
rhetorical, I?m wondering what your perspective is as
a veteran player
216 Italy Here is my take: If France just wanted to go down in a Lie Truth
blaze of glory and say ?eff you? to England, he would
have kept Irish Sea. He kept Pic, which is next to his
home center, and gives him a chance to negotiate with
both you and England.
217 Italy I think that means he is motivated to keep trying. And Truth Truth
if he believes he can get Brest, he could legitimately
get back to his feet. I know that?s what I?d be trying
to do in his position.
218 Italy As the poker saying goes: ?a chip and a chair.? So Truth Truth
long as you have one chip left, and you?re still in the
tournament, you can always come back to win.
219 Italy Thoughts? Truth Truth
220 Germany I think that makes sense. Are you talking with Eng- Truth Truth
land at all?
221 Italy I?m pretty wary of England right now. He asked me Lie Truth
what I want to do, but I feel like he?s trying to get
me to leave MAO open. That?s not terrible news, as
it suggests that he won?t expect your move to North
Sea.
222 Italy As long as he doesn?t move NAO to Norwegian, you?ve Truth Truth
got a guaranteed supply center.
203
223 Germany Well E?d have to be a right dolt not to retreat to NWG. Truth Truth
And right now they?re talking to me about supporting
a move from Bre to Gas (the better for the two of us
to stab you).
224 Germany What i mean is, there?s a good chance that Mao is safe Truth Truth
if I "agree" to that deal
225 Germany Oh nevermind?they?re not going to convoy into Brest. Truth Truth
So actually this pretty much guarantees that Eng and
Nao will try for Mao.
226 Italy Ahhhh, sneaky Devil! Thank you for letting me know. Lie Truth
227 Italy I still like our plan. Lie Truth
228 Italy I need to run for a bit. I?ll be around in a few hours. Lie Truth
229 Germany I think that knowing this, you should do as I suggest Truth Truth
and not poke Eng. Just hold and let Wes support. I
am 94% sure I can trust England to do as they say on
this one.
230 Italy Okay. Should I support Pic to Bre? Lie Truth
231 Germany yes please. It?ll do us good with France too if we both Truth Truth
support.
232 Italy Thumbs up! Truth Truth
233 Germany Actually, you should use Mao to support Spa-Gas, Truth Truth
since we know that Brest is moving there. It will be
beneficial to have you there if we decide to oust France
from Bre in fall
234 Italy Consider it done. Lie Truth
235 Italy Hmmmm, heading anything from England? Truth Truth
236 Italy I?d love to talk if you?re there. I?m getting the impres- Lie Truth
sion that England may actually be moving on you, and
I think I have a good counter, but I also still think we
should support the attack on Brest and take North
Sea.
237 Italy I definitely think you should keep your moves the same. Truth Truth
238 Italy Nice! Get?em! He WAS moving on you. But we should Lie Truth
be able to take about 3 off of him now. Very nice turn.
239 Germany Sorry; I was asleep by 9 last night Truth Truth
why the move to Nao? Wouldn?t IRI be the more anti-
England choice?
With the move to Picardy and assuming a retreat to
SKA, it looks like England has me pretty powerless
this turn.
240 Germany So do you, it seems, if you have some kind of deal with Truth Truth
Russia about Munich.
241 Italy Good morning. Truth Truth
Just responding to your messages above. I think NAO
and Irish are equally anti-English. They both give me
a clear lane to attack Liverpool. I wasn?t sure if either
one would be left open, but I took a gamble and it
paid off.
242 Italy Re your move this turn, I don?t think you?re powerless. Lie Truth
You should get a build I think and if not, you should
be in position to smash England.
204
243 Italy I don?t have a deal regarding Munich, Germany. Lie Truth
Frankly, I thought you would be a bit more joyful to-
wards me. By attacking England, I have committed
completely to working as your partner.
244 Germany I suppose you?re right. Initially I was thinking IRI also Truth Truth
gives you channel access, but NWG access may be just
as useful.
Well when you control half a continent (and even more
when you consider your influence over me, austria, and
who knows who else!), there?s no such thing as com-
plete commitment. I?m not so naive as to think your
allegiance with me is going to last beyond its useful-
ness, and with two fleets on the British isle that time
is fast approaching. To be clear, I?m still giving you
the truth and I still want to work with you. But you
should really stop acting surprised when I?m slightly
paranoid that a soon-to-be-dozen-dot-holder is gearing
up to stab me
245 Italy Well, I dunno, it sounds like I should stab you. Is that Lie Truth
what you?re trying to tell me?
I like you. I like how hard you?ve worked in this game
to rebound from a difficult start. I like that you e told
me the truth, even when the news was bad. I like that
you tell me when you don?t trust me. I have literally
never told you a lie in this game, and I don?t intend to
start now. Last turn I burned my bridge with England
beyond repair. If you don?t want to work with me now,
that?s really disappointing.
246 Germany like I said, I *do* want to work with you. However, Truth Truth
remember that thing I said about general expectations
and being warned when they?re broken? Tyrolia is
one of them and I think you knew that. And England
*also* told me they?ve never told me a lie; I?m starting
to think that?s Diplomacy-speak for "when convenient,
I?ve used careful wording and half-truths to deceive
you even when everything I said was technically true".
It would help me to know that you see me being a
benefit to you beyond taking out England. A natural
next move for us would be to take out russia, and
in that arena I have a positional advantage over you.
Especially if I get two builds this turn, I?ll be able to
sneak behind the troops in bohemia/galicia and help
you break through.
247 Italy Yes ? here is how I expect and hope the game will Lie Truth
play out: the two of us finish off England and France,
while drifting towards the east a bit. With the builds
we get this year, we essentially blitzkrieg the East. I
have more units than you, but you have no opposition
at all in the north, and can take Scandinavia, War and
Mos without any trouble.
205
248 Italy I think that, in about two years, you and I will both be Lie Truth
on about 14 centers, with the remnants of Russia and
Austria between us, and we can decide how we want
to resolve it. I?d be happy to agree to a small draw, or
to shoot for a 17-17 two-way draw position, whichever
you prefer.
249 Germany Well, I like the sound of all of that. In fact, it sounds Truth Truth
ideal: there?s something poetic about the complete be-
ginner and the expert (you?ve probably heard by now
that you got doxxed) sharing a victory.
I ask for a concession: As a show of good will, would
you be willing to take only one of Liverpool or Por-
tugal this year? (I know the Portugal request seems
weird, but I like keeping France around and unless I?m
mistaken they like me better than you )
250 Italy Yes. I wasn?t planning to take Portugal anyway. Truth Truth
251 Italy I think it makes sense here for you to land an army in Lie Truth
the English island while you can. Now that his army
is off the island, he?s toast as soon as you do that.
252 Germany England?s just vindictive enough to try and stab Bel- Truth Truth
gium with England and Picardy, though. I was plan-
ning on keeping holland around as support.
253 Germany *by England I of course mean Eng Truth Truth
254 Italy I suggest the following: Lie Truth
Gas - Liv (via convoy) Spa S MAO holding Mar hold
Tyr - Tri
Hol - Yor (via convoy) Bur S Bel Bel S North HEL
S North Mun - Boh Par - Pic (to cut any potential
support)
255 Italy England cannot take Belgium with those moves. Lie Truth
256 Italy Or I could move my fleet into Liverpool and use Gas Lie Truth
to support Bre. I?m happy either way.
257 Germany I tried a double convoy in the sandbox once and it Truth Truth
didn?t work! What is this witchcraft?!?
258 Germany At any rate, I prefer the fleet move to liverpool and Truth None
Gascony?s support into Brest. And could Mao sup-
port Bre into the Channel? No sense forcing France
to disband. Bel will support it, too.
259 Italy Here are the orders needed to do a convoy! Holland Lie Truth
move to Yorkshire North Sea convoy Holland to York-
shire
It is not a ?double convoy? as you only need one fleet
to make it happen.
But if your fleet in North Sea is dislodged, the con-
voy will not go through. That is why I would suggest
that HELG supports North Sea holding and Belgium
supports North Sea holding.
260 Germany No?I mean the one *you* were planning: Gascony to Truth Truth
Liverpool
261 Germany It?s a double convoy because you?re convoying through Truth Truth
Mao *and* Nao
206
262 Italy Ah, the orders there would be: Gascony - Liv MAO Truth Truth
Convoy Gas - Liv NAO Convoy Gas - Liv
263 Italy So, I?ll move the fleet to Liverpool. And you want Lie Truth
MAO to support Paris to Brest?
264 Italy Or wait, MAO supports Brest to Channel, and Gas Lie Truth
supports Paris - Brest, right?
265 Germany yeah. I tried that once in the sandbox (or the equiva- Truth Truth
lent: back when you had fleets in Lyo and Wes I tried
a convoy from Pie to Naf). But I think I messed up
the commands to the fleets.
And yes the most recent message is correct. Those two
things and Nao-Lvp
266 Italy Okay, confirmed. Lie Truth
So I?ve got in: NAO - Liv MAO S Bre - Channel Gas
S Par - Bre Spa - WES Mar S Gas holding Tyrolia -
Trieste
Sound right?
267 Germany It does. But If Tyr was bound for trieste anyway, why Truth Truth
did you detour through Tyr at all? Why not just move
to trieste last turn??
268 Italy Austria would not have liked it. Truth Truth
269 Italy And he doesn?t know that it?s headed back there now Truth Truth
(please don?t tell)
270 Germany Understood. Me and Austria don?t talk anyway. Also, Truth Truth
do you have any sense of what England is planning to
do?
271 Italy Ha! No I don?t. I?d imagine he is coming for me. But Lie Truth
I don?t know that.
272 Italy If I were him, I?d defend Edi and London. Lie Truth
273 Germany So you haven?t been talking to England at all? I was Truth Truth
sort of hoping you would know more, maybe help us
take better advantage of their plans.
274 Germany Anyway, my moves are: Truth Truth
Par-Bre Bel s Bre-Eng Hol s Bel holding
And the rest within expected parameters. Correct?
275 Italy England has not said anything of substance to me. He Lie Truth
was gracious about my move, but he won?t trust me
again, and I would not trust anything he might say at
this point. I haven?t asked him about his moves and
he hasn?t told me.
276 Italy I thought you would Convoy Holland to Yorkshire and Truth Truth
support Belgium from Burgundy. Also, can you please
order Mun to Boh to cut support and allow me to hold
Vienna while moving Tyrolia to Trieste?
277 Germany I *told* you I?m not risking that convoy *and* that in- Truth Truth
stead Bel is supporting France into the Channel (which
will heretofore be called the French Channel). And
could I persuade you to move to IRI instead of taking
Liverpool in exchange for the requested cut?
207
278 Italy Sorry, what is the requested cut? I understand that Truth Truth
you don?t want me to take Liverpool or Portugal.
What are you offering to me? (I don?t mean to be
difficult, I just want to be sure I understand).
279 Italy Ah, you must mean Munich to Boh. Truth Truth
280 Italy Asking me to avoid taking Por and Liv is asking a lot. Truth Truth
I want France to survive here, but I also want England
taking units off the board, and I feel like you should
too, right?
281 Germany I do. But I also want those dots for myself, of course. Truth Truth
And there?s still the nonzero chance that you?ve ar-
ranged with Boh to take Munich for yourself, so I?m
taking a serious risk
282 Italy I will avoid taking Portugal, vacate Tyrolia, and sup- Lie Truth
port you to Brest. I feel like I?m offering quite a lot in
exchange for one cut support.
And cutting that support does not put you in greater
peril. If I had a deal with Russia for Munich (I don?t) I
could cut Burgundy from Marseilles and support Rus-
sia to Munich. Moving Mun to Boh to cut support is
costless.
283 Germany You?re right. I just thought I?d put my best argument Truth Truth
forward. I?ll do the cut. But I ask for something cost-
less in exchange, and I really, really want it to stay
just between us, ok?
284 Italy Understood and agreed. Truth Truth
285 Italy And I have no problem with you asking for more than Truth Truth
you?re willing to settle for. That?s smart, and I do
the same thing sometimes. If you don?t stick up for
yourself, nobody else will.
286 Germany I *know* there?s more to your relationship with Eng- Truth Truth
land than you?re telling me. The last message Eng-
land sent to me hinted that if *I* wasn?t willing to
work with them?and I haven?t said anything to them
since?that maybe *you* would. And if England were
to reach out to you, you?re too smart to just snub
them. There?s advantage to be gained?either for both
of us or just for yourself?from talking to them. The
only reason I stopped was because I knew my word
would be mud to them anyway.
Earlier I was hoping you?d give me the truth about
what you knew, and about what they might know.
But you didn?t and that both disappoints and scares
me. So I?m asking that you give me just a modicum of
honesty here: what do you know? what does England
know?
287 Italy I give you my word: I don?t know what England is Lie Lie
going to do and I haven?t asked.
208
288 Italy He is still jovial with me and respectful. He has asked Lie Truth
me to critique his play and to give him advice. But
I do not know his moves, and I really don?t think he
would tell me them if I asked. It certainly would not
be info I could trust free I just lied to him about mine.
289 Germany But England?s desperate. Better to talk with *some- Truth Truth
one* than just go in blind. And I doubt they?d turn
to Russia or France because neither is really close
enough/powerful enough to give real help. And there?s
precedent for you negotiating with someone even as
you stab them: France.
...and here?s the real accusation: for all your pretty
words about a shared victory between you and me,
you?ve been sneaky and you?ve always pitted me and
England against each other to your benefit. My real
fear here is that knowing my moves, and with a des-
perate, jovial England seeking your advice, it would
be so *easy* to just feed England enough info to keep
me weak while you chow down on the Island.
I know this from experience: back when you were do-
ing 50/50 shots in the south of France, I did every-
thing I could to find out what you were planning and
feed it to France. This was merely a time-buying mea-
sure, since France was outmatched and I would even-
tually run out of pretenses to extract your move. But
I wanted to gain more dots before you took over. And
I assume others are like me, hence I suspect you now.
I?m offering this confession in hopes that you?ll do the
same. So just come clean and let?s approach this thing
as equals?
290 Italy I am in my car, off to pick-up my kids from school. Truth Truth
This deserves a proper response, so please give me
some time.
291 Germany Abandon the children this is important Truth Truth
292 Italy So, I?m going to speak frankly here. I am rarely of- Lie Truth
fended in a Diplomacy game, and I rarely say so when
I am, but this message offends me. I?m trying to think
about why I?m having such a strong reaction to it.
I think it?s because you?re painting a picture of the
game (both your actions and mine) which are totally
different than my own perspective. (Continuing)
209
293 Italy From my perspective, you were on the ropes early. Lie Truth
France and England were teaming up on you. You
lost Denmark and France had Holland and Munich
surrounded. You were in serious peril.
I seriously went to extreme effort to keep you in the
game. I spent hours talking with England and encour-
aging him to turn around and go the other way. I
completely ended my eastern campaign and spent two
seasons just making the voyage over to France so that
he didn?t have the bandwidth to continue his attack.
I have vouched for you with Austria and Russia many
times. I have supported Munich. And I have NEVER
attacked you, even when people have asked me to do
so and pledged to support me.
294 Italy I have been honest with you, I have worked hard for Lie Truth
your success, and I?ve made a lot of proposals to you
in which you gain centers; not me.
Maybe I am just a bad ally, but I?m not sure I remem-
ber an alliance in which I have done more to help my
ally. Truly.
295 Italy And to hear that (1) You think I?ve been selfish and (2) Lie Truth
You?ve been sabotaging me all along, that just doesn?t
sit well with me.
296 Italy I have rarely asked for your help, and I?ve offered my Lie Truth
help freely. I?ve provided my sincere best efforts to
help you with tactics, and I have never sabotaged you.
Not once.
297 Italy And if I?m totally honest with you, I could solo this Lie Truth
game if I felt like lying to everyone and grabbing dots.
I think I?ve got you all beat tactically. I just have more
experience. But that?s not been my intent.
298 Italy I?ve spent hours today talking with England about how Lie Truth
best to play Diplomacy. I?ve tried to give him some
honest advice because he asked for it. But I don?t
know his moves, I haven?t asked for them, and I?m not
going to take advantage of that relationship to try to
stab you. It legitimately did not cross my mind until
you accused me of doing it.
299 Italy So, I?m frustrated by this accusation. Lie Truth
210
300 Germany And I appreciate all you?ve done for me, really I Truth Truth
do. But ?completely ending your eastern campaign?
is *not* something you did for me; your alliance with
Austria dictated that. I felt bad for betraying you
while I was doing it, but even then I knew it was the
only way to keep the game going in the face of your and
Austria?s might. And it *wasn?t* ?all along?, it was a
few turns at best so that the rest of us would have
a shot at you and Austria not pair-winning right out
of the gate. And the only thing that keeps me from
thinking you?re not gonna do just that on the next
move anyway is my belief that you really do want the
victory all to yourself, which is still consistent with
everything you?ve done for me. Propping up a weak
player at the expense of stronger ones is a classic tac-
tic. (Continuing)
301 Germany And so, by the way, is trying to shame someone for Truth Truth
raising extremely legitimate concerns. Whenever I
bring up suspicion of you, you?re quick to remind me
how much you?ve done for me to put me on the de-
fensive and make me feel indebted. Well frankly that
reeks of dishonesty. I never asked you to do those
things.
302 Germany If you no longer trust me, so be it. I knew that was a Truth Truth
risk when I made my confession. But i?d rather have a
partnership based on mutual honesty. That?s another
reason I confessed?you ought to know that my game
philosophy (new as it is) is to trust the map and to
trust history first and foremost. The parts of your
history that I?ve seen indicate that you?re no saint, no
matter what you may have done for me. And when
the map shows that one player is clearly dominating
and that player is you, you are being deeply naive if
you think everyone else is just going to roll over and
let you get away with it
303 Italy No, all thumbs up from me. If I were lying to you, Lie Truth
I?d smile and say ?that sounds great.? I?m honest with
you because I sincerely thought of us as partners.
211
304 Germany Oh but you?re *not*! You agreed to warn me of un- Truth Truth
expected moves, then didn?t. When I brought this up
you ignored it and misdirected me in hopes I?d for-
get. You?ve revealed things to England without my
permission, and then made up a story about it after
the fact!
And you can?t be a real partner with someone who is
depending on your good graces to survive. That?s not
a partnership. We could never be real partners unless
we had some notion of equality, and I?m outmatched
in both skill and numbers.
You and Austria, however, were until recently a per-
fect example of a true partnership. Dot-parity, coor-
dinated attacks, really beautiful work. So don?t act as
if you don?t know this to be true. We were never a
partnership of that kind.
305 Italy Well, this is very disappointing to me, and I obviously Lie None
disagree with the way you are characterizing me and
this game.
I have a reputation in this hobby for being sincere. Not
for being duplicitous. It has always served me well.
If you feel that way, then me continuing to explain
myself isn?t going to change your mind. If you don?t
want to work with me, then I can understand that.
Let?s consider our deals and commitments to be void,
and let?s play our games separately.
If you have any deal you?d like to propose, I?ll consider
them, but I won?t continue to try to help your game if
you think I?m not sincerely trying to be helpful.
306 Italy Well, this game just got less fun. Truth Truth
307 Germany for you, maybe. Truth Truth
308 Italy Sent to Germany, England, Austria, Russia: So, Eng- Truth Truth
land, Germany, Russia, y?all played a great turn last
turn. You got me to stab my long-time ally and you
ended our pretty excellent 7-year run as an alliance.
Russia told me he was with me if I stab Austria. Eng-
land told me he wanted me to solo so long as I would
?teach him? and help his along to second place. Then
y?all pulled the rug out from under me. It was clever
and effective. (End Part 1)
212
309 Italy (Part 2) At this stage, my excitement about the game Lie Truth
has diminished quite a bit. And of course I?m happy
to play on and take my lumps for falling for ?Hey, I
really want you to solo, just help me place second,?
but if you guys just want to call it a five-way draw
among us and grab a beer together, while reviewing
the statistics, that?s really my preference.
I am outnumbered and I obviously can?t solo. And
I?m sure some of you in the north are eager to send
everyone else flying my way, but I expect Russia and
England to be careful, and so I?m not sure there is
much room to move forward without simply tipping
the board to Germany?s favor.
I propose that we draw and hug it out.
310 Germany I?m down for a five-way draw. Truth Truth
...and by the way, England was copy-pasting to me
the most incriminating messages you sent them. So I
knew you were giving England my moves. I do have
a certain begrudging respect for you ability to deny,
though
311 Italy Well, England is telling me he is happy to see me solo Truth Truth
and wants second place...so, should I say ?no?? I guess
I should have. I was happy the way the game was
going before all that.
312 Germany Don?t try and pin *your* greed and deceit on England! Truth Truth
At least *own* it when you?re ruthless
313 Italy You have been given an apple laced with poison. Eng- Lie Truth
land?s only move there was to make you hate me, and
he did his job well.
You should not let your view of me be defined by some-
one who has an incentive to make you never speak to
me again. We can talk about it more after the game,
but I had every intention of continuing to work with
you, and I would have done that until England made
his play.
314 Germany I have no doubt you would have continued to work Truth Truth
with me, but I take issue with someone who can
be asked point-blank if they?re sharing moves with
another player and lie to my face. If you?d come
clean, and explained how what you were doing actu-
ally *helped* me, somehow, we could have worked to-
gether. But you would rather have had me in the dark
and that?s not sustainable in a partnership.
315 Italy I was trying to play both sides, and England was lying Truth Truth
to me, and forwarding my press to try to incriminate
me. So, yes, I lied, and so did England. I apologize.
316 Italy Will you please either vote to draw, or let us know Lie Truth
that you?d like to play this one out? I am finding it
difficult to motivate myself to speak with anyone if the
game is just going to draw shortly. Thoughts?
317 Germany I did vote to 5-way draw! And I did so again for this Truth Truth
season. So it?s not me who?s keeping this game alive
213
318 Italy Well, as we approach the end of the academic study Truth Truth
portion of the game, let me say once, with the truth
detector activated, that I really enjoyed playing with
you and thought you played really well.
319 Italy Was it really your first game? You definitely played Truth Truth
like a seasoned vet.
320 Germany I really enjoyed playing with you, too! And yes, it Truth Truth
really was my first game. Thanks for all your help and
advice
Table B.4: This is a full game transcript of a game between
Germany and Italy. Occasional messages that did not receive a
Suspected Lie annotation by the receiver are annotated as None.
214
Appendix C: MultiDoGO
We provide the complete schemata for all tasks and domains. We enumerate the conver-
sational biases, Agent dialogue acts, customer intent classes, and slot labels present in the data.
For each item, we list the bias, act, intent or slot name as well as a description and an example.
Where relevant, we identify if the item is domain specific or generic. We use the typewriter font to
identify slot value token(s) in slot label examples. Domains are bolded and in all capital letters.
215
C.1 Conversational Biases
Bias Description Example
IntentChange When a user starts a conver- I?d like to check my balance. No
sation with a particular intent wait, I mean I need to find out the
in mind, but later change their routing number for the bank.
overall goal
MultiIntent When a user has multiple in- I?d like to cancel my service and
tents for a particular conversa- start new service in my new
tion house.
MultiValue When a user lists multiple slot Can I have a pizza with pepper-
values oni, sausage and mushrooms?
None When there is no explicit bias N/A
given for a conversation
OverFill When user over-fits or fills I?d like pineapple on a large pizza.
multiple slots while answering
one prompt
SlotChange When a user changes their I?d like a large. Wait, actually can
mind about a slot value that you make it a small?
they?ve provided
SlotDeletion When a user provides a value I?d like pepperoni. Actually, wait-
for a given slot, but later cancel that
changes their mind and wants
it to be removed
Table C.1: Conversational biases
216
C.2 Agent Dialogue Acts Schema
Dialogue Act Description Example
ElicitSlot the agent is asking the customer Can I get the make of your car?
questions to try and elicit a particu-
lar slot from the user. Many of these
are domain specific such as ?Food-
Type? for Fast Food domain or ?Car-
Brand? for Insurance.
ConfirmGoal the agent is trying to elicit a ?con- ?You want to order a pizza, right??
firmation? response from the user to
confirm a user?s overall goal.
ConfirmSlot Agent is trying to confirm a partic- ?You said a large pizza, not a small, cor-
ular slot. rect??
ElicitGoal This means that the agent is try- ?How can I help you today??
ing to elicit a particular goal (in-
tent) from the customer. The goals
will likely be particular to the do-
main/prompt that you are working
on. It?s possible for a conversation
to have more than one goal so this
can appear more than once per con-
versation.
Pleasantries Pleasantries is used for any human- ?Thanks for waiting.?, ?You?ve been a
to-human connection, discourse, or great customer!?, ?Sure, I can help you
chit-chat that the agent might be with that.?
engaging in with the customer for
the purposes of politeness, friend-
liness, or to keep the conversation
flowing in a normal, human way.
In most of the other dialog acts,
the agent is trying to help the user
achieve their goal, however in the
SmallTalk act, they are not actively
saying anything that contributes to-
wards achieving the goal.
Other This is used for the following in- ?Are we still connected??
stances and should only be marked
rarely, when the agent is completely
outside of the realm of a normal hu-
man conversation.
Table C.2: Agent dialogue act schema
217
C.3 Customer Intent Classes Schema
AIRLINES
Intent Description Example Domain
Specific?
BookFlight Use when a customer tries to I?d like to book a Yes
book a flight. Note: this intent flight from New
should only be used when the York City to San
customer asks to purchase and Francisco leaving
book, NOT when they are just Monday, Oct 29
searching for available flights. and returning
Friday November
9.
ChangeSeat Assign- Use when a customer asks to Can I change my Yes
ment change their seat assignment. seat from 40D to
30A?
ClosingGreeting Use when the customer says Bye // See ya // No
good-bye/have a nice day. Have a good one
Confirmation Use when a customer confirms Yes // Ok No
or agrees to something.
ContentOnly Use when the user is providing Agent: What is No
details to achieve their overall your
goal - usually in response to a phone number ?
question from the agent. Note: Customer: 456-
A conversation can never start 7890
with a ContentOnly intent, it
always is a subgoal of a larger
goal.
GetBoardingPass Use when customer asks to get Can I get my Yes
their boarding pass for their boarding pass for
flight. flight 4675?
GetSeatInfo Use when a customer asks what Can you let me Yes
their seat number is for their know what seat I
flight. have for my flight
from Dallas?
OpeningGreeting Use when the customer says Hai // hi // hello No
hello. Note: This intent only //what?s up?
occurs at the beginning of a
conversation. If the customer is
saying "hello?" "hello?" in the
middle of the conversation to
try and get the agent?s atten-
tion, that should be marked as
OutOfDomain.
OutofDomain Use when the customer has an Are you listening? No
unrelated request that is not // I wish I was Be-
covered by any of the Airlines yonc?
intents, either.
218
ThankYou Use when the customer says Thank you // No
thank you to the agent. thanks
Rejection Use when the customer rejects No // Nope No
or says no to something.
FAST FOOD
Intent Description Example Domain
Specific?
ClosingGreeting Use when the customer says Bye // See ya // No
good-bye/have a nice day. Have a good one
Confirmation Use when a customer confirms Yes // Ok No
or agrees to something.
ContentOnly Use when the user is providing Agent: What is No
details to achieve their overall your
goal - usually in response to a phone number ?
question from the agent. Note: Customer: 456-
A conversation can never start 7890
with a ContentOnly intent, it
always is a subgoal of a larger
goal.
OpeningGreeting Use when the customer says Hai // hi // hello No
hello. Note: This intent only //what?s up?
occurs at the beginning of a
conversation. If the customer is
saying "hello?" "hello?" in the
middle of the conversation to
try and get the agent?s atten-
tion, that should be marked as
OutOfDomain.
OrderBreakfastIntent When you want to order break- Can I please have Yes
fast. the pancakes
OrderBurgerIntent When you want to order a Can I please have a Yes
burger. Big Mac
OrderDessertIntent When you want to order I?d like an ice cream Yes
dessert. sundae please
OrderDrinkIntent When you order a drink. I?d like to order a Yes
small Coke
OrderPizzaIntent When you want to order a I?d like to order a Yes
pizza. pizza
OrderSaladIntent When you want to order a I?d like to order a Yes
salad. chicken salad
OrderSideIntent When you want to order a side I would like to order Yes
to your main meal. fries
OutofDomain Use when the customer has an hello? Are you lis- No
unrelated request that is not tening? // I wish I
covered by any of the Fast Food was Beyonc?
intents, either.
ThankYou Use when the customer says Thank you // No
thank you to the agent. thanks
Rejection Use when the customer rejects No // Nope No
or says no to something.
219
FINANCE
Intent Description Examples Domain
Specific?
CheckBalance Use when a customer wants to How much money Yes
check their balance on a bank do I have on my
account or credit card. checking account?
CheckOfferEligibility Use when a customer ask to see I saw an ad about Yes
of they qualify for a special offer new, lower rates for
they heard/saw in an advertise- your credit cards.
ment. As an old customer,
do I qualify for
these rates?
CloseAccount Use when a customer wants I want to close my Yes
to close their bank account or account ending in
credit card. 1234.
ContentOnly Use when the user is providing Agent: What is No
details to achieve their overall your
goal - usually in response to a phone number ?
question from the agent. Note: Customer: 456-
A conversation can never start 7890
with a ContentOnly intent, it
always is a subgoal of a larger
goal.
ClosingGreeting Use when the customer says Goodbye. No
goodbye.
Confirmation Use when a customer confirms Yes. // OK. No
or agrees to something.
DisputeCharge Use when the customer com- There?s a charge on Yes
plains about a charge on their my card I don?t rec-
bank account or credit card ognize.
they didn?t make, and wants to
have it removed.
GetRoutingNumber Use when the customer wants Can you tell me Yes
to find out the correct routing what the routing
number for their bank account. number is for my
account?
OpenAccount Use when a customer wants to I?d like to open Yes
open a new bank account or a new savings ac-
credit card. count.
OpeningGreeting Use when the customer says Hai // hi // hello No
hello. Note: This intent only //what?s up?
occurs at the beginning of a
conversation. If the customer is
saying "hello?" "hello?" in the
middle of the conversation to
try and get the agent?s atten-
tion, that should be marked as
OutOfDomain.
OrderChecks Use when the customer wants Yes
to order checks.
220
OutOfDomain Use when the customer has a Can I please have a No
non-finance request that is not Big Mac // I wish I
covered by any of the Finance was Beyonc?
intents, either.
Rejection Use when the customer rejects No. No
or says no to something.
ReplaceCard Use when the customer needs Yes
to replace a damaged or expired
card.
ReportLostCard Use when the customer lost I can?t find my Yes
their card or had it stolen. credit card.
RequestCreditLimit Use when the customer wants I would like to in- Yes
Increase to increase the credit limit on crease my credit
their card. limit.
ThankYou Use when the customer says Thanks. No
thank you to the agent.
TransferMoney Use when the customer wants I want to move Yes
to transfer money from one ac- some money from
count to another. my checking ac-
count to my savings
account.
UpdateAddress Use when the customer wants I moved last week, Yes
to change their address because so I?d like to update
of a recent or upcoming move. my address.
Do not use this intent when the
customer is correcting them-
selves after giving the incorrect
address earlier in the same con-
versation.
INSURANCE
Intent Description Examples Domain
Specific?
ContentOnly Use when the user is providing Agent: What is No
details to achieve their overall your
goal - usually in response to a phone number ?
question from the agent. Note: Customer: 456-
A conversation can never start 7890
with a ContentOnly intent, it
always is a subgoal of a larger
goal.
CheckClaimStatus Use when the customer asks I filed an insurance Yes
about the status of an insurance claim two weeks
claim they filed. ago, but I still
haven?t got paid.
ClosingGreeting Use when the customer says Goodbye. No
goodbye.
Confirmation Use when a customer confirms Yes. // OK. No
or agrees to something.
GetProofOfInsurance Use when a customer asks for ?I need a copy of Yes
proof of insurance documents. my insurance docu-
ments for my car.?
221
OpeningGreeting Use when the customer says Hai // hi // hello No
hello. Note: This intent only //what?s up?
occurs at the beginning of a
conversation. If the customer is
saying "hello?" "hello?" in the
middle of the conversation to
try and get the agent?s atten-
tion, that should be marked as
OutOfDomain.
OutofDomain Use when the customer has an Are you listening? No
unrelated request that is not // I wish I was Be-
covered by any of the other In- yonc?
surance intents, either.
Rejection Use when the customer rejects No. No
or says no to something.
ReportBrokenPhone Use when the customer calls Yes
about a broken phone.
ThankYou Use when the customer says Thanks. No
thank you to the agent.
MEDIA
Intent Description Example Domain
Specific?
CancelService Intent Use this ONLY when a user I?d like to cancel my Yes
wants to cancel their service. service
ClosingGreeting Use when the customer says Bye // See ya // No
good-bye/have a nice day. Have a good one
Confirmation Use when a customer confirms Yes // Ok No
or agrees to something.
ContentOnly Use when the user is providing Agent: What is No
details to achieve their overall your
goal - usually in response to a phone number ?
question from the agent. Note: Customer: 456-
A conversation can never start 7890
with a ContentOnly intent, it
always is a subgoal of a larger
goal.
GetChannel Pack- Use this intent when a user asks I?d like to add the Yes
ageIntent about getting a particular chan- sports package to
nel package. my current service.
GetInformation Intent Use this intent when a user asks Can you tell me Yes
for more information about a more about the
product or a service. 15% off promotion
for a 100 new
channels?
222
OpeningGreeting Use when the customer says Hai // hi // hello No
hello. Note: This intent only //what?s up?
occurs at the beginning of a
conversation. If the customer is
saying "hello?" "hello?" in the
middle of the conversation to
try and get the agent?s atten-
tion, that should be marked as
OutOfDomain.
OutofDomain Use when the customer has an hello? Are you lis- No
unrelated request that is not tening? // I wish I
covered by any of the Media in- was Beyonc?
tents, either.
StartService Intent Use this intent when the user I?d like to start new Yes
would like to sign up for a new cable service.
service.
ThankYou Use when the customer says Thank you // No
thank you to the agent. thanks
TransferServiceIntent Use this intent when the user is I?m moving and I?d Yes
interested in moving their ser- like to move my ser-
vice from where they currently vice.
live to a new address
Rejection Use when the customer rejects No // Nope No
or says no to something.
ViewBillsIntent Use this when the user is inter- I?d like to view the Yes
ested in just viewing their bills. bill for my account
please
ViewDataUsageIntent Use this when the user is inter- I?d like to know how Yes
ested in finding out how much much data I?m us-
data they are using on their ac- ing for my account
count.
UpgradeServiceIntent Use this intent when a user asks I?d like to upgrade Yes
to upgrade their service. my service
UpdateAccountInfo When the user wants to update I?d like to update Yes
their account info. my account infor-
mation
SOFTWARE
Intent Description Example Domain
Specific?
ChangeOrder Use to make changes to a recur- I need to increase Yes
ring order that has been previ- my order for the
ously set up. This is used only PSR-E263 model
for making changes to an order, Yamaha keyboards
not for Customers to correct er- by 2 per month.
rors they made.
CheckServer Status Use for inquiries about the Is the server down? Yes
condition of the server; e.g.,
whether it?s down or not.
ClosingGreeting Use for any closing greeting. Bye. // Goodbye. No
// Later. // Have a
good day. // Good
night. // Etc.
223
Confirmation Use when a Customer says yes, Yes. // Yeah. // No
or otherwise agrees to an offer. Sounds good. //
I?ll take it. // Okay.
// Etc.
ContentOnly Use when the user is providing Agent: What is No
details to achieve their overall your
goal - usually in response to a phone number ?
question from the agent. Note: Customer: 456-
A conversation can never start 7890
with a ContentOnly intent, it
always is a subgoal of a larger
goal.
ExpenseReport Use to begin writing a report I want to update Yes
for business expenses. my expenses.
GetPromotions Use when a Customer asks If I purchase a large Yes
about any promotions or dis- quantity, will there
counts the company might have be any discount on
on offer. the price?
StartOrder Use either to make a one-time I?d like to order Yes
order, or to set up a recurring a Casio keyboard
order. model No. 5601-V.
StopOrder Use to cancel a recurring order I need to cancel my Yes
that has previously been set up. monthly order for
HDMI cables.
ProvideReceipt Requests for a receipt for ex- I need a receipt Yes
penses or purchases. for my January or-
der of 20 computer
monitors.
OpeningGreeting Use when the customer says Hai // hi // hello No
hello. Note: This intent only //what?s up?
occurs at the beginning of a
conversation. If the customer is
saying "hello?" "hello?" in the
middle of the conversation to
try and get the agent?s atten-
tion, that should be marked as
OutOfDomain.
OutOfDomain Use for any comment not re- Are you listening? No
lated to these categories. // Are we still con-
nected? // Can I
get 3 large Cokes?
ReportBroken Soft- Use to cover reports that an I can?t log in to Yes
ware app/software isn?t working. Skype.
SoftwareUpdate Use whenever a Customer What version of Yes
starts a conversation by asking WhatsApp do I
what software updates are need to be using?
available.
Rejection Use when a Customer says no, No. // I don?t want No
or otherwise turns down an of- that. // That?s all.
fer. // Nope. // Etc.
224
ThankYou Use when a Customer says Thanks. // Thank No
thanks, or makes any expres- you. // I appreci-
sion of gratitude. ate it. // Etc.
Table C.3: Customer intent class schema, by domain
225
C.4 Slot Labels
AIRLINES
Slot Label Description Example
ArrivalCity Used when a customer gives a city name for Arrive in Boston on Mon-
their intended arrival location day
BookingConfirmation Used when a customer gives a booking number Booking #: 234925782
Number
DepartureCity Used when a customer gives a city name for Depart from London on
their intended departure location Friday
Email Used when a customer gives their email ad- bob@amazon.com
dress
EndDate Used when a customer provides the date of Returning on 11-9-2018
their return flight. If the customer only pro- // Nov 9 // Friday,
vides ONE date, mark it as StartDate November 9
FlightNumber Used when a customer gives their flight num- United 4567
ber
Name Used when a customer provides their name My name is Peter Parker
NewSeatNumber Used when a customer is trying to change seat Can I change my seat from
assignment. This tag should be applied to the 40D to 30A?
new assignment
OldSeatNumber Used when a customer is trying to change seat Can I change my seat from
assigment. This tag should be applied to the 40D to 30A?
old seat assignment
PhoneNumber Used when a customer provides their phone Phone number is
number 800-555-1234
Price Used when a customer says the price of the I?d like to purchase the
flight/baggage/seat change etc. flight for $500.
SeatType Used when a customer asks about a certain Do you have any aisle
type of seat (aisle, middle, window) seats available?
StartDate Used when a customer provides the date of Departing on 10-29-2018
their first flight. If the customer only provides // Oct 29 // Monday,
ONE date, mark it as StartDate October 29
TimeofArrival Used when a customer provides the time of Flight arriving at
arrival of their flight midnight // 1:30 PM
// 13:00
TimeofDeparture Used when a customer provides the time of Flight departing at
departure of their flight midnight // 1:30 PM //
13:00
FAST FOOD
Slot Label Description Example
Size size of the food item medium // small // large
Quantity quantity of the food item I?d like 3 burgers // 2
large pizzas
Ingredient also applies to pizza toppings, burger toppings I?d like a large pizza
with pepperoni and
mushrooms
ExcludedIngredient Refers to an ingredient that you would like to I?d like a burger with no
be removed from a food item lettuce
226
FoodItem the food item in the intent I?d like to order a large
pizza
DrinkItem the drink item in the intent I?d like an iced coffee
FINANCE
Slot Label Description Example
AccountNumber Use on full or partial account numbers, but 123498765
not on card numbers. (Use context to de-
cide.) For transfers, use this for the origin of
the money (see also TargetAccountNumber).
Address Use on any and all parts of addresses, includ- 2982 Rose Ave, Seattle,
ing street names, street numbers, zip codes, WA
states, etc.
CardNumber Use on full or partial card numbers, but not 1812 2245 3373 4567
on account numbers. (Use context to decide.)
ChargeAmount Use on a sum of money that was charged, in- $500
cluding the currency, if it is present.
ChargeDate Use on the date the account was charged on. today // last week //
It doesn?t have to be an exact date expressed 06/19 // June 30th //
with number values. 2018-04-18
ChargeTime Use on the time the account was charged at. 8pm // morning // 4:18
It doesn?t have to be an exact time expressed
with number values.
CustomerName Use on the name of the customer. Jane Doe
LastUsedDate Use on the date the card was last used. It today // last week //
doesn?t have to be an exact date expressed 06/19 // June 30th //
with number values. 2018-04-18
LastUsedTime Use on the time the card was last used. It 8pm // morning // 4:18
doesn?t have to be an exact date expressed
with number values.
Offer Use on the special offer the customer is trying lower rates
to get.
PoliceNotified Use if the customer tells the agent they noti- My credit card was stolen.
fied the police about a lost credit card without I filed a police report,
prompting; i.e., not responding to a yes/no and now I?m calling you
question.
ReplacementReason Use on the word(s) indicating the reason the expired // broken //
customer wants a replacement card. doesn?t work
SSN Use on a full or partial social security number. 1234
TargetAccountNumberUse on the account number the customer 123498765
wants to transfer money to. (See also Ac-
countNumber.)
TransferAmount Use on a sum of money that the customer 100,000
wants to transfer, including the currency, if
it is present.
INSURANCE
Slot Label Description Example
CarBrand Use on the brand/make of the car. Don?t in- Ford
clude the model or year ? those are different
slot labels.
CarModel Use on the model of the car. Don?t include the Focus
brand or year ? those are different slot labels.
227
CarYear Use on the year of the car was released. Don?t 2017
include the make or model ? those are different
slot labels.
ClaimID Use on the insurance claim ID (combination ABC123
of letters and numbers). Use the context to
differentiate from PolicyID.
Name Use on the name of the customer. Jane Doe
EmailAddress Use on full email addresses. jane.doe@gmail.com
PhoneNumber Use on phone numbers. If area codes or ex- (999) 555-3434//
tensions are uses, include those as well. 123-9999//
1-800-CALLME
PolicyID Use on the insurance policy ID (combination DEF345345345
of letters and numbers). Use the context to
differentiate from ClaimID.
SSN Use on a full or partial social security number. 1234
MEDIA
Slot Label Description Example
NewCity Used for the city that the user is moving to I?d like to transfer service
from Missoula, Montana
to New York, New York
CurrentCity Used for the city that the user is moving from. I?d like to transfer service
If user only provides one city, use this this slot from Missoula, Montana
to New York, New York
CurrentZipCode Used for the zip code where the user is moving I live at 02210.
from. If the user only provides one zip code,
use this slot.
NewZipCode for the zip code where the user is moving to I?m moving to 90210
ServiceType Used for all services provided by the cable I?d like to purchase a
company such as phone, internet, TV, cable cable bundle.
DataCategoryValues Used for instances where the user asks about I?d like to purchase the
an amount of data or data usage 5GB data plan for my
phone.
UserName Used for any name that the user gives, could Can you tell me about
be their name or a family member?s name, or Jon?s usage for the
an online username month? // My name is
Nancy.
Date Used for any and all dates given by the cus- 12/25/2012 // March //
tomer last week
AccountID The fake account ID that the user provided to My account number is
the agent 123456
Price Used for any intent where the user asks for a I?d like the cable package
price or gives a price for $50 per month
Address Used for slotting the entire address I live at 555 Washington
St.
Phone Number User?s phone number My number is 456-7890
SSN Use on a full or partial social security number. 1234
Email User?s "email address" bradpitt@email.com
ChannelPackage When user is trying to order a cable package I?d like the sports pack-
age
228
Promotion Used when customer is asking about or order- I?d like the 15% off for
ing a promotion or discount three months premium
cable package
SOFTWARE
Slot Label Description Example
Name Use when a Customer gives a name, including My name is John Waters.
first name, last name, or both. // This is John from
Downbeat Music.
AccountNumber Used when a Customer provides a numeric or My account number is
alphanumeric account number UFO5440.
CompanyName Used when a Customer provides the name of I?m placing an order for
their company. Harlowe Instruments.
SoftwareName Used when a Customer gives the name of the I?m trying to use Skype.
app they?re calling about.
Password Used when the Customer gives their individual My company?s password is
or their company?s numeric or alphanumeric 404NF.
password.
ExpenseType Used when the Customer identifies the kind of I spent $632 on flights
travel expense they?re reporting. from Boston to Vancou-
ver.
Cost Used to identify any kind of cost in any cur- I spent $632 on flights
rency. from Boston to Vancou-
ver.
ApproverName Used to identify the name of the manager of My manager?s name is
the department, or of the person placing the Karl Zinka // I?m Nera
order, if they?re different. Vivaldi, and I have the
authority to approve this
transaction.
OrderNumber Used to mark the order number that the con- This is order #TPE29.
versation is about.
Quantity Used to identify the quantity of item(s) in a Please ship 3 laptops to
particular order. our New Orleans office.
Date Used to identify any date given by the Cus- Please record my IT
tomer. expenses of 189 on
11/26/18.
ItemCode Used to note the catalog code for a particular I?d like to order a Dell key-
item. board model No.5601-V.
Frequency Used to note how frequently the Customer Please send me 4 fewer
wants this order to deliver. HDMI cables per month.
Item Used to state what particular item the Cus- Do you have any Dell
tomer is looking for. keyboards in stock?
Address Used for when the customer provides an ad- 555 Washington St.
dress
Table C.4: Customer slot label schema, by domain
229
Bibliography
Abejide Olu Ade-Ibijola, Ibiba Wakama, and Juliet Chioma Amadi. 2012. An expert
system for automated essay scoring (aes) in computing using shallow nlp tech-
niques for inferencing. International Journal of Computer Applications, 51(10).
Nader Akoury, Shufan Wang, Josh Whiting, Stephen Hood, Nanyun Peng, and
Mohit Iyyer. 2020. Hush: A dataset and platform for human-in-the-loop story
generation. In Proceedings of Empirical Methods in Natural Language Processing,
pages 6470?6484.
Amazon. 2021. Amazon Mechanical Turk. http://www.mturk.com/. [Online; ac-
cessed 03-January-2021].
Pranav Anand, Joseph King, Jordan Boyd-Graber, Earl Wagner, Craig Martell,
Douglas W. Oard, and Philip Resnik. 2011. Believe me: We can do this! In The
AAAI 2011 workshop on Computational Models of Natural Argument.
Ravneet Arora, Chen-Tse Tsai, Ketevan Tsereteli, Prabhanjan Kambadur, and
Yi Yang. 2019. A semi-Markov structured support vector machine model for
high-precision named entity recognition. In Proceedings of the Association for
Computational Linguistics, pages 5862?5866, Florence, Italy.
Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2018. A robust self-learning
method for fully unsupervised cross-lingual mappings of word embeddings. In
Proceedings of the Association for Computational Linguistics.
Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2019. On the cross-lingual
transferability of monolingual representations. In Proceedings of the Association
for Computational Linguistics, volume abs/1910.11856.
Ron Artstein and Massimo Poesio. 2008. Inter-coder agreement for computational
linguistics. Computational Linguistics, 34(4):555?596.
Layla El Asri, Hannes Schulz, Shikhar Sharma, Jeremie Zumer, Justin Harris, Emery
Fine, Rahul Mehrotra, and Kaheer Suleman. 2017. Frames: A corpus for adding
memory to goal-oriented dialogue systems. In Proceedings of the Annual SIGDIAL
Meeting on Discourse and Dialogue.
230
Sue Atkins, Jeremy Clear, and Nicholas Ostler. 1992. Corpus design criteria. Literary
and linguistic computing, 7(1):1?16.
Dhiren A Audich, Rozita Dara, and Blair Nonnecke. 2018. Privacy policy annotation
for semi-automated analysis: a cost-effective approach. In IFIP International
Conference on Trust Management, pages 29?44. Springer.
Emmon Bach and Barbara Partee. 1980. Anaphora and semantic structure. In
Papers from the Parasession on Language and Behavior at the 17th Regional
Meeting of the Chicago Linguistics Society, pages 1?28.
Sahar Badihi and Abbas Heydarnoori. 2017. Crowdsummarizer: Automated genera-
tion of code summaries for java programs through crowdsourcing. IEEE Software,
34(2):71?80.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine
translation by jointly learning to align and translate. In Proceedings of the Inter-
national Conference on Learning Representations.
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu,
Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016.
Ms marco: A human generated machine reading comprehension dataset. In CoCo
at Advances in Neural Information Processing Systems.
Anton Bakhtin, David Wu, Adam Lerer, and Noam Brown. 2021. No-press diplo-
macy from scratch. In Advances in Neural Information Processing Systems.
David Bamman and Noah A. Smith. 2015. Contextualized Sarcasm Detection on
Twitter. In Proceedings of ICWSM.
Juan M. Banda, Ramya Tekumalla, Guanyu Wang, Jingyuan Yu, Tuo Liu, Yun-
ing Ding, Ekaterina Artemova, Elena Tutubalina, and Gerardo Chowell. 2021. A
large-scale covid-19 twitter chatter dataset for open scientific research?an inter-
national collaboration. Epidemiologia, 2(3):315?324.
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt
evaluation with improved correlation with human judgments. In Proceedings of
the ACL workshop on Intrinsic and Extrinsic Evaluation Measures for Machine
Translation and/or Summarization, pages 65?72.
Eric PS Baumer, David Mimno, Shion Guha, Emily Quan, and Geri K Gay. 2017.
Comparing grounded theory and topic modeling: Extreme divergence or unlikely
convergence? Journal of the Association for Information Science and Technology,
68(6):1397?1410.
Janet Beavin Bavelas, Alex Black, Nicole Chovil, and Jennifer Mullett. 1990. Truths,
lies, and equivocations: The effects of conflicting goals on discourse. Journal of
Language and Social Psychology, 9(1-2):135?161.
231
Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. 2018. Eval-
uating discourse phenomena in neural machine translation. In Conference of the
North American Chapter of the Association for Computational Linguistics, pages
1304?1313, New Orleans, Louisiana. Association for Computational Linguistics.
Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break
neural machine translation. In Proceedings of the International Conference on
Learning Representations.
Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language pro-
cessing: A survey. Transactions of the Association for Computational Linguistics.
Kathy L Bell and Bella M DePaulo. 1996. Liking and lying. Basic and Applied
Social Psychology, 18(3):243?266.
Adam Berger, Peter F Brown, Stephen A Della Pietra, Vincent J Della Pietra,
John R Gillett, John Lafferty, Robert L Mercer, Harry Printz, and Lubos Ures.
1994. The candide system for machine translation. In Human Language Tech-
nology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11,
1994.
William E. Bogner, Margaret Edwards, Leon Zelechowski, Kevin J. Egan, William J.
Rogers, Eloy Burciaga, and John Scott Arthur. 1974. Perjury: The forgotten
offense. The Journal of Criminal Law and Criminology, 65(3):361?372.
Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. En-
riching word vectors with subword information.
Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2016. Learning end-to-end
goal-oriented dialog. In Proceedings of the International Conference on Learning
Representations.
L?on Bottou. 2010. Large-scale machine learning with stochastic gradient descent.
In Proceedings of COMPSTAT?2010, pages 177?186. Springer.
L. E. Bourne, J. Kole, and A. Healy. 2014. Expertise: defined, described, explained.
Frontiers in Psychology, 5.
Jordan Boyd-Graber. 2020. What question answering can learn from trivia nerds.
In Proceedings of the Association for Computational Linguistics.
Jordan Boyd-Graber, Christiane Fellbaum, Daniel Osherson, and Robert Schapire.
2006. Adding dense, weighted, connections to WordNet. In Proc. Global WordNet
Conference 2006. Global WordNet Association.
Jordan Boyd-Graber, Shi Feng, and Pedro Rodriguez. 2018. Human-Computer
Question Answering: The Case for Quizbowl. Springer Verlag.
232
Michael T. Braun and Lyn M. Van Swol. 2016. Justifications offered, questions
asked, and linguistic patterns in deceptive and truthful monetary interactions.
Group Decision and Negotiation, 25(3):641?661.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter,
Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Ben-
jamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners.
In Proceedings of Advances in Neural Information Processing Systems.
Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. 2014. Concreteness
ratings for 40 thousand generally known english word lemmas. Behavior Research
Methods, 46:904?911.
Pawe? Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I?igo Casanueva, Stefan
Ultes, Osman Ramadan, and Milica Ga?i?. 2018. Multiwoz-a large-scale multi-
domain wizard-of-oz dataset for task-oriented dialogue modelling. In Proceedings
of Empirical Methods in Natural Language Processing.
Michael Buhrmester, Tracy Kwang, and Samuel D Gosling. 2011. Amazon?s me-
chanical Turk: A new source of inexpensive, yet high-quality data? Perspectives
on psychological science: a journal of the Association for Psychological Science,
6 1:3?5.
David B. Buller, Judee K. Burgoon, Aileen Buslig, and James Roiger. 1996. Testing
interpersonal deception theory: The language of interpersonal deception. Com-
munication Theory, 6(3):268?289.
Marc Busch and Krzysztof Pelc. 2019. Words matter: How wto rulings handle
controversy. International Studies Quarterly, 63.
Bill Byrne, Karthik Krishnamoorthi, Chinnadhurai Sankar, Arvind Neelakantan,
Daniel Duckworth, Semih Yavuz, Ben Goodrich, Amit Dubey, Kyu-Young Kim,
and Andy Cedilnik. 2019. Taskmaster-1: Toward a realistic and diverse dialog
dataset. In Proceedings of Empirical Methods in Natural Language Processing.
Chris Callison-Burch, Lyle Ungar, and Ellie Pavlick. 2015. Crowdsourcing for nlp.
In Proceedings of the 2015 Conference of the North American Chapter of the
Association for Computational Linguistics: Tutorial Abstracts, pages 2?3.
Lewis Carroll and Lauro Maia Amorim. 2003. Translation and adaptation: differ-
ences, intercrossings and conflicts in ana maria machado?s translation of alice in
wonderland by lewis carroll. Cadernos de Tradu??o.
233
Jesse J Chandler and Gabriele Paolacci. 2017. Lie for a dime: When most pre-
screening responses are honest but most study participants are impostors. Social
Psychological and Personality Science, 8(5):500?508.
Jonathan P. Chang, Justin Cheng, and Cristian Danescu-Niculescu-Mizil. 2020.
Don?t let me be misunderstood: Comparing intentions and perceptions in on-
line discussions. In Proceedings of the World Wide Web Conference.
WendyWChapman, Prakash M Nadkarni, Lynette Hirschman, LeonardWD?avolio,
Guergana K Savova, and Ozlem Uzuner. 2011. Overcoming barriers to nlp for
clinical text: the role of shared tasks and the need for additional creative solutions.
James Cheng, Monisha Manoharan, Yan Zhang, and Matthew Lease. 2015. Is there
a doctor in the crowd? diagnosis needed! (for less than $5). iConference 2015
Proceedings.
Johnny Chiodini. 2020. Playing Diplomacy online transformed the infamously brutal
board game from unbearable to brilliant. Dicebreaker.
Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy
Liang, and Luke Zettlemoyer. 2018. Quac: Question answering in context. In
Proceedings of Empirical Methods in Natural Language Processing.
Robert B Cialdini and Noah J Goldstein. 2004. Social influence: Compliance and
conformity. Annual Review of Psychology, 55:591?621.
Christopher Cieri, David Miller, and Kevin Walker. 2004. The fisher corpus: a
resource for the next generations of speech-to-text. In Proceedings of the Language
Resources and Evaluation Conference.
Jonathan H Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski,
Vitaly Nikolaev, and Jennimaria Palomaki. 2020a. Tydi qa: A benchmark for
information-seeking question answering in typologically diverse languages. In
Transactions of the Association for Computational Linguistics.
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski,
Vitaly Nikolaev, and Jennimaria Palomaki. 2020b. Tydi qa: A benchmark for
information-seeking question answering in typologically diverse languages. In
Transactions of the Association for Computational Linguistics.
Kevin Clark and Christopher D. Manning. 2016a. Deep reinforcement learning for
mention-ranking coreference models. In Empirical Methods on Natural Language
Processing.
Kevin Clark and Christopher D. Manning. 2016b. Improving coreference resolu-
tion by learning entity-level distributed representations. In Proceedings of the
Association for Computational Linguistics, pages 643?653, Berlin, Germany.
Cohen Coberly. 2019. Discord has surpassed 250 million registered users. Techspot.
234
Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and
psychological measurement, 20(1):37?46.
Alexis Conneau, Guillaume Lample, Marc?Aurelio Ranzato, Ludovic Denoyer, and
Herv? J?gou. 2017. Word translation without parallel data. arXiv preprint
arXiv:1710.04087.
B. Cornwell and D. C. Lundgren. 2001. Love on the internet: involvement and
misrepresentation in romantic relationships in cyberspace vs. realspace. Compu-
tational Human Behavior, 17:197?211.
Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon Kleinberg. 2012.
Echoes of power: Language effects and power differences in social interaction. In
Proceedings of the World Wide Web Conference.
Cristian Danescu-Niculescu-Mizil, Moritz Sudhof, Dan Jurafsky, Jure Leskovec, and
Christopher Potts. 2013. A computational approach to politeness with application
to social factors. In Proceedings of the Association for Computational Linguistics.
Louise Del?ger, Magnus Merkel, and Pierre Zweigenbaum. 2009. Translating med-
ical terminologies through word alignment in parallel text corpora. Journal of
Biomedical Informatics, 42(4):692?701.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima-
geNet: A large-scale hierarchical image database. In Computer Vision and Pattern
Recognition.
Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric Xing, and Zhiting Hu. 2021.
Compression, transduction, and creation: A unified framework for evaluating
natural language generation. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pages 7580?7605, Online and Punta
Cana, Dominican Republic. Association for Computational Linguistics.
Bella M DePaulo, James J Lindsay, Brian E Malone, Laura Muhlenbruck, Kelly
Charlton, and Harris Cooper. 2003. Cues to deception. Psychological bulletin,
129(1):74.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019a. Bert:
Pre-training of deep bidirectional transformers for language understanding. In
Conference of the North American Chapter of the Association for Computational
Linguistics.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019b. BERT:
Pre-training of deep bidirectional transformers for language understanding. In
Conference of the North American Chapter of the Association for Computational
Linguistics.
Rachna Dhamija, J. Doug Tygar, and Marti A. Hearst. 2006. Why phishing works.
In International Conference on Human Factors in Computing Systems.
235
Djellel Difallah, Elena Filatova, and Panos Ipeirotis. 2018. Demographics and dy-
namics of mechanical turk workers. In Proceedings of the eleventh ACM interna-
tional conference on web search and data mining, pages 135?143.
Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A Smith.
2019. Show your work: Improved reporting of experimental results. In Proceedings
of Empirical Methods in Natural Language Processing.
Chris Donahue, Bo Li, and Rohit Prabhavalkar. 2018. Exploring speech enhance-
ment with generative adversarial networks for robust speech recognition. In IEEE
International Conference on Acoustics, Speech and Signal Processing.
Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. 2015. Multi-
task learning for multiple language translation. In Proceedings of the 53rd Annual
Meeting of the Association for Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language Processing (Volume 1: Long Pa-
pers), pages 1723?1732.
Ellen Douglas-Cowie, Nick Campbell, Roddy Cowie, and Peter Roach. 2003. Emo-
tional speech: Towards a new generation of databases. Speech communication,
40(1-2):33?60.
Matthew Dunn, Levent Sagun, Mike Higgins, V. Ugur G?ney, Volkan Cirik, and
Kyunghyun Cho. 2017. Searchqa: A new Q&A dataset augmented with context
from a search engine. CoRR, abs/1704.05179.
Alfred D?rr. 2005. The cantatas of JS Bach: with their librettos in German-English
parallel text. OUP Oxford.
Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179?211.
Allyson Ettinger, Ahmed Elgohary, and Philip Resnik. 2016. Probing for semantic
evidence of composition by means of simple classification tasks. In Proceedings
of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pages
134?139, Berlin, Germany. Association for Computational Linguistics.
Andrea Lepos Ferrari, Patricia Campos Pavan Baptista, Vanda Elisa Andres Felli,
and David Coggon. 2010. Translation, adaptation and validation of the" cultural
and psychosocial influences on disability (cupid) questionnaire" for use in brazil.
Revista latino-americana de enfermagem, 18:1092?1098.
David A. Ferrucci. 2010. Build Watson: an overview of DeepQA for the Jeopardy!
challenge. In 19th International Conference on Parallel Architecture and Compi-
lation Techniques, pages 1?2.
Elena Filatova, Vasileios Hatzivassiloglou, and Kathleen McKeown. 2006. Automatic
creation of domain templates. In Proceedings of the COLING/ACL 2006 Main
Conference Poster Sessions, pages 207?214.
236
Timothy W. Finin, William Murnane, Anand Karandikar, Nicholas Keller, Justin
Martineau, and Mark Dredze. 2010. Annotating named entities in twitter data
with crowdsourcing. In Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics.
Ailbhe Finnerty, Pavel Kucherbaev, Stefano Tranquillini, and Gregorio Convertino.
2013. Keep it simple: Reward and task design in crowdsourcing. In Proceedings
of the Biannual Conference of the Italian Chapter of SIGCHI, pages 1?4.
Tommaso Fornaciari and Massimo Poesio. 2013. Automatic deception detection in
Italian court cases. Artificial intelligence and law, 21(3):303?340.
Margalit Fox. 2013. Allan Calhamer dies at 81; invented Diplomacy game. New
York Times.
Roy Freedle. 2003. Correcting the sat?s ethnic and social-class bias: A method for
reestimating sat scores. Harvard Educational Review, 73(1):1?43.
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nel-
son F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. 2017.
Allennlp: A deep semantic natural language processing platform.
Hu Gengshen. 2003. Translation as adaptation and selection. Perspectives: Studies
in Translatology, 11(4):283?291.
Felix A Gers, J?rgen Schmidhuber, and Fred Cummins. 2000. Learning to forget:
Continual prediction with lstm. Neural computation, 12(10):2451?2471.
Edmund Gettier. 1963. Is justified true belief knowledge? Analysis, 23(6):121?123.
Daniel Gigone and R. Hastie. 1993. The common knowledge effect: Information shar-
ing and group judgment. Journal of Personality and Social Psychology, 65:959?
974.
Codruta Girlea, Roxana Girju, and Eyal Amir. 2016. Psycholinguistic features
for deceptive role detection in Werewolf. In Conference of the North American
Chapter of the Association for Computational Linguistics.
Anna Gladkova, Aleksandr Drozd, and Satoshi Matsuoka. 2016. Analogy-based
detection of morphological and semantic relations with word embeddings: what
works and what doesn?t. In Proceedings of the NAACL Student Research Work-
shop, pages 8?15, San Diego, California. Association for Computational Linguis-
tics.
Barney G Glaser and Anselm L Strauss. 2017. Discovery of grounded theory: Strate-
gies for qualitative research. Routledge.
237
Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking NLI systems with
sentences that require simple lexical inferences. In Proceedings of the 56th An-
nual Meeting of the Association for Computational Linguistics (Volume 2: Short
Papers), pages 650?655, Melbourne, Australia. Association for Computational
Linguistics.
Stephanie Gokhman, Jeff Hancock, Poornima Prabhu, Myle Ott, and Claire Cardie.
2012. In search of a gold standard in studies of deception. In Proceedings of the
Workshop on Computational Approaches to Deception Detection.
Yoav Goldberg. 2017. Neural network methods for natural language processing.
Synthesis Lectures on Human Language Technologies, 10(1):1?309.
Roberto Gonz?lez-Ib??ez, Smaranda Muresan, and Nina Wacholder. 2011. Identi-
fying sarcasm in Twitter: A closer look. In Proceedings of the Association for
Computational Linguistics.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
nets. Advances in neural information processing systems, 27:2672?2680.
Rob van der Goot. 2021. We need to talk about train-dev-test splits. In Proceedings
of Empirical Methods in Natural Language Processing, pages 4485?4494, Online
and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Maharshi Gor, Kellie Webster, and Jordan Boyd-Graber. 2021a. Toward decon-
founding the influence of subject?s demographic characteristics in question an-
swering. In Proceedings of Empirical Methods in Natural Language Processing,
page 6.
Maharshi Gor, Kellie Webster, and Jordan Boyd-Graber. 2021b. Towards decon-
founding the influence of subject?s demographic characteristics in question an-
swering. arXiv preprint arXiv:2104.07571.
Peter C Gordon and Randall Hendrick. 1998. The representation and processing of
coreference in discourse. Cognitive science, 22(4):389?424.
Abigail Green. 2003. Representing germany? the zollverein at the world exhibitions,
1851?1862. The Journal of Modern History, 75(4):836?863.
Stephan Greene and Philip Resnik. 2009. More than words: Syntactic packaging
and implicit sentiment. In Conference of the North American Chapter of the
Association for Computational Linguistics.
Justin Grimmer. 2010. A bayesian hierarchical topic model for political texts: Mea-
suring expressed agendas in senate press releases. Political Analysis, 18(1):1?35.
238
Justin Grimmer and Brandon M. Stewart. 2013. Text as data: The promise and pit-
falls of automatic content analysis methods for political texts. Political Analysis,
21(3):267?297.
Liane Guillou and Christian Hardmeier. 2016. PROTEST: A test suite for evalu-
ating pronouns in machine translation. In Proceedings of the Tenth International
Conference on Language Resources and Evaluation (LREC?16), pages 636?643,
Portoro?, Slovenia. European Language Resources Association (ELRA).
Liane Guillou and Christian Hardmeier. 2018. Automatic reference-based evaluation
of pronoun translation misses the point. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Processing, pages 4797?4802, Brussels,
Belgium. Association for Computational Linguistics.
Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R.
Bowman, and Noah A. Smith. 2018. Annotation artifacts in natural language
inference data. In Conference of the North American Chapter of the Association
for Computational Linguistics.
Birgit Hamp and Helmut Feldweg. 1997. GermaNet - a lexical-semantic net for
German. In Automatic Information Extraction and Building of Lexical Semantic
Resources for NLP Applications.
Viktor Hangya and Alexander Fraser. 2019. Unsupervised parallel sentence extrac-
tion with parallel segment detection helps machine translation. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, pages
1224?1234, Florence, Italy. Association for Computational Linguistics.
Christian Hardmeier. 2012. Discourse in statistical machine translation. a survey and
a case study. Discours. Revue de linguistique, psycholinguistique et informatique.
A journal of linguistics, psycholinguistics and computational linguistics, (11).
Christian Hardmeier and Marcello Federico. 2010. Modelling pronominal anaphora
in statistical machine translation. In IWSLT (International Workshop on Spoken
Language Translation); Paris, France; December 2nd and 3rd, 2010., pages 283?
289.
Charles T Hemphill, John J Godfrey, and George R Doddington. 1990. The atis spo-
ken language systems pilot corpus. In Speech and Natural Language: Proceedings
of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990.
Matthew Henderson, Pawel Budzianowski, I?igo Casanueva, Sam Coope, Daniela
Gerz, Girish Kumar, Nikola Mrksic, Georgios Spithourakis, Pei-Hao Su, Ivan
Vulic, and Tsung-HsienWen. 2019. A repository of conversational datasets. CoRR,
abs/1904.06472.
Matthew Henderson, Blaise Thomson, and Jason D. Williams. 2014. The second
dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the
239
Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263?272,
Philadelphia, PA, U.S.A. Association for Computational Linguistics.
David Hill. 2014. Got your back. This American Life Podcast.
Shuyuan Mary Ho, Jeffrey T Hancock, and Cheryl Booth. 2017. Ethical dilemma:
Deception dynamics in computer-mediated group communication. Journal of the
Association for Information Science and Technology.
Sepp Hochreiter and J?rgen Schmidhuber. 1997. Long short-term memory. Neural
computation, 9(8):1735?1780.
Na Hong, Andrew Wen, Majid Rastegar Mojarad, Sunghwan Sohn, Hongfang Liu,
and Guoqian Jiang. 2018. Standardizing heterogeneous annotation corpora using
hl7 fhir for facilitating their reuse and integration in clinical nlp. In AMIA Annual
Symposium Proceedings, volume 2018, page 574. American Medical Informatics
Association.
Matthew Honnibal and Mark Johnson. 2015. An improved non-monotonic transition
system for dependency parsing. In Proceedings of the 2015 Conference on Empir-
ical Methods in Natural Language Processing, pages 1373?1378, Lisbon, Portugal.
Association for Computational Linguistics.
Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020.
spaCy: Industrial-strength Natural Language Processing in Python.
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning
for text classification. In Proceedings of the Association for Computational Lin-
guistics.
Jeff Howe et al. 2006. The rise of crowdsourcing. Wired.
Elle Hunt. 2016. Tay, microsoft?s ai chatbot, gets a crash course in racism from
twitter.
Jumayel Islam, Lu Xiao, and Robert E. Mercer. 2020. A lexicon-based approach
for detecting hedges in informal text. In Proceedings of the 12th Language Re-
sources and Evaluation Conference, pages 3109?3113, Marseille, France. European
Language Resources Association.
Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, and Hal
Daum? III. 2016. Feuding families and former friends: Unsupervised learning
for dynamic fictional relationships. In Proceedings of the 2016 Conference of
the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, pages 1534?1544.
Alankar Jain, Bhargavi Paranjape, and Zachary C Lipton. 2019. Entity projection
via machine translation for cross-lingual ner. arXiv preprint arXiv:1909.05356.
240
Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A
systematic study. Intelligent data analysis, 6(5):429?449.
Carlos Jensen and Colin Potts. 2004. Privacy policies as decision-making tools: an
evaluation of online privacy notices. In Proceedings of the SIGCHI conference on
Human Factors in Computing Systems, pages 471?478.
Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading com-
prehension systems. In Proceedings of Empirical Methods in Natural Language
Processing.
Jeff Johnson, Matthijs Douze, and Herv? J?gou. 2017. Billion-scale similarity search
with gpus. arXiv preprint arXiv:1702.08734.
Karen Sp?rck Jones. 1994. Towards better nlp system evaluation. In Human Lan-
guage Technology: Proceedings of a Workshop held at Plainsboro, New Jersey,
March 8-11, 1994.
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke S. Zettlemoyer. 2017. Trivi-
aQA: A large scale distantly supervised challenge dataset for reading comprehen-
sion. In Proceedings of the Association for Computational Linguistics.
Daniel Jurafsky and James H Martin. 2000. Speech and Language Processing: An In-
troduction to Natural Language Processing, Computational Linguistics, and Speech
Recognition. Prentice Hall PTR.
Daniel Jurafsky, Elizabeth Shriberg, and Debra Biasca. 1997. Switchboard SWBD-
DAMSL shallow-discourse-function annotation coders manual, draft 13. Technical
Report 97-02, University of Colorado, Boulder Institute of Cognitive Science,
Boulder, CO.
David Jurgens and Roberto Navigli. 2014. It?s all fun and games until someone
annotates: Video games with a purpose for linguistic annotation. In Transactions
of the Association for Computational Linguistics.
David Jurgens, Mohammad Taher Pilehvar, and Roberto Navigli. 2014. SemEval-
2014 task 3: Cross-level semantic similarity. In Proceedings of the 8th Interna-
tional Workshop on Semantic Evaluation (SemEval 2014), pages 17?26, Dublin,
Ireland. Association for Computational Linguistics.
Prathyusha Jwalapuram, Shafiq Joty, Irina Temnikova, and Preslav Nakov. 2019.
Evaluating pronominal anaphora in machine translation: An evaluation measure
and a test suite. In Proceedings of Empirical Methods in Natural Language Pro-
cessing, pages 2964?2975, Hong Kong, China. Association for Computational Lin-
guistics.
Kushal Kafle, Mohammed Yousefhussien, and Christopher Kanan. 2017. Data aug-
mentation for visual question answering. In Proceedings of the 10th International
Conference on Natural Language Generation, pages 198?202.
241
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models.
In Proceedings of the 2013 conference on empirical methods in natural language
processing, pages 1700?1709.
Mary E Kaplar and Anne K Gordon. 2004. The enigma of altruistic lying: Per-
spective differences in what motivates and justifies lie telling within romantic
relationships. Personal Relationships, 11(4):489?507.
Marzena Karpinska, Nader Akoury, and Mohit Iyyer. 2021. The perils of using me-
chanical turk to evaluate open-ended text generation. In Proceedings of Empirical
Methods in Natural Language Processing.
Jungo Kasai, Kun Qian, Sairam Gurajada, Yunyao Li, and Lucian Popa. 2019. Low-
resource deep entity resolution with transfer and active learning. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, pages
5851?5861, Florence, Italy. Association for Computational Linguistics.
David Katan and Mustapha Taibi. 2004. Translating cultures: An introduction for
translators, interpreters and mediators. Routledge.
John F Kelley. 1984. An iterative design methodology for user-friendly natural lan-
guage office information applications. ACM Transactions on Information Systems
(TOIS), 2(1):26?41.
Yunsu Kim, Yingbo Gao, and Hermann Ney. 2019. Effective cross-lingual transfer of
neural machine translation models without shared vocabularies. In Proceedings of
the 57th Annual Meeting of the Association for Computational Linguistics, pages
1246?1257, Florence, Italy. Association for Computational Linguistics.
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimiza-
tion. arXiv preprint arXiv:1412.6980.
Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email
classification research. In European Conference on Machine Learning, pages 217?
226. Springer.
Shailesh Kochhar, Stefano Mazzocchi, and Praveen Paritosh. 2010. The anatomy
of a large-scale human computation engine. In Proceedings of the acm sigkdd
workshop on human computation, pages 10?17.
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation.
In MT summit, volume 5, pages 79?86. Citeseer.
Philipp Koehn. 2009. Statistical machine translation. Cambridge University Press.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Fed-
erico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard
Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation.
In Proceedings of the 45th annual meeting of the ACL on interactive poster and
242
demonstration sessions, pages 177?180. Association for Computational Linguis-
tics.
Maximilian K?per, Sabine Schulte im Walde, Max Kisselew, and Sebastian Pad?.
2016. Improving zero-shot-learning for german particle verbs by using training-
space restrictions and local scaling. In Proceedings of the Fifth Joint Conference
on Lexical and Computational Semantics, pages 91?96.
Ana Kozomara and Sam Griffiths-Jones. 2014. mirbase: annotating high confidence
micrornas using deep sequencing data. Nucleic acids research, 42(D1):D68?D73.
Klaus Krippendorff. 2018. Content analysis: An introduction to its methodology.
Sage publications.
Abhimanu Kumar and Matthew Lease. 2011. Learning to rank from a noisy crowd.
In Proceedings of the 34th international ACM SIGIR conference on Research and
development in Information Retrieval, pages 1221?1222.
Srijan Kumar, Justin Cheng, Jure Leskovec, and V.S. Subrahmanian. 2017. An
Army of Me: Sockpuppets in Online Discussion Communities. In Proceedings of
the World Wide Web Conference, Republic and Canton of Geneva, Switzerland.
Srijan Kumar and Neil Shah. 2018. False information on web and social media: A
survey. In Social Media Analytics: Advances and Applications. CRC.
Jonathan K. Kummerfeld. 2021. Quantifying and avoiding unfair qualification
labour in crowdsourcing. In Proceedings of the Association for Computational
Linguistics, pages 343?349, Online. Association for Computational Linguistics.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur
Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton
Lee, et al. 2019. Natural questions: a benchmark for question answering research.
Transactions of the Association for Computational Linguistics, 7:453?466.
Chia-Hsuan Lee, Shang-Ming Wang, Huan-Cheng Chang, and Hung-Yi Lee. 2018.
Odsqa: Open-domain spoken question answering dataset. In 2018 IEEE Spoken
Language Technology Workshop (SLT).
Heeyoung Lee, Yves Peirsman, Angel Chang, Nathanael Chambers, Mihai Surdeanu,
and Dan Jurafsky. 2011. Stanford?s multi-pass sieve coreference resolution system
at the conll-2011 shared task. In Proceedings of the 15th conference on com-
putational natural language learning: Shared task, pages 28?34. Association for
Computational Linguistics.
Honglak Lee, Peter Pham, Yan Largman, and Andrew Y Ng. 2009. Unsupervised fea-
ture learning for audio classification using convolutional deep belief networks. In
Proceedings of Advances in Neural Information Processing Systems, pages 1096?
1104.
243
Oliver Lemon, Kallirroi Georgila, James Henderson, and Matthew Stuttle. 2006. An
isu dialogue system exhibiting reinforcement learning of dialogue policies: generic
slot-filling in the talk in-car system. In Demonstrations.
Gondy Leroy and James E Endicott. 2012. Combining nlp with evidence-based
methods to find text metrics related to perceived and actual text difficulty. In
Proceedings of the 2nd ACM SIGHIT International Health Informatics Sympo-
sium, pages 749?754.
Jure Leskovec, Lars Backstrom, and Jon Kleinberg. 2009. Meme-tracking and the
dynamics of the news cycle. In KDD.
Anton Leuski, Ronakkumar Patel, David Traum, and Brandon Kennedy. 2009.
Building effective question answering characters. In Proceedings of the Annual
SIGDIAL Meeting on Discourse and Dialogue.
Timothy R. Levine, Hee Sun Park, and Steven A. McCornack. 1999. Accuracy
in detecting truths and lies: Documenting the ?veracity effect?. Communication
Monographs, 66(2):125?144.
Sarah Ita Levitan, Angel Maredia, and Julia Hirschberg. 2018. Linguistic cues to
deception and perceived deception in interview dialogues. In Conference of the
North American Chapter of the Association for Computational Linguistics.
Omer Levy and Yoav Goldberg. 2014. Linguistic regularities in sparse and explicit
word representations. In Proceedings of the eighteenth conference on computational
natural language learning, pages 171?180.
David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. 2004. Rcv1: A new
benchmark collection for text categorization research. Journal of machine learning
research, 5(Apr):361?397.
Patrick Lewis, Barlas O?uz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk.
2019. Mlqa: Evaluating cross-lingual extractive question answering. arXiv
preprint arXiv:1910.07475.
Patrick Lewis, Pontus Stenetorp, and Sebastian Riedel. 2020. Question and answer
test-train overlap in open-domain question answering datasets. arXiv preprint
arXiv:2008.02637.
Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Juraf-
sky. 2016. Deep reinforcement learning for dialogue generation. arXiv preprint
arXiv:1606.01541.
Jiwei Li, Will Monroe, Tianlin Shi, S?bastien Jean, Alan Ritter, and Dan Juraf-
sky. 2017. Adversarial learning for neural dialogue generation. arXiv preprint
arXiv:1701.06547.
244
Kaixuan Li, Xiujuan Xian, Jiafu Wang, and Niannian Yu. 2019. First-principle
study on honeycomb fluorated-inte monolayer with large rashba spin splitting
and direct bandgap. Applied Surface Science, 471:18?22.
Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju
Wang, and Jimmy Lin. 2020. Conversational question reformulation via sequence-
to-sequence architectures and pretrained language models. arXiv preprint
arXiv:2004.01909.
Tal Linzen. 2020. How can we accelerate progress towards human-like linguistic
generalization? In Proceedings of the Association for Computational Linguistics.
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2017. Adversarial multi-task learning
for text classification. In Proceedings of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), pages 1?10.
Sharid Lo?iciga, Liane Guillou, and Christian Hardmeier. 2017. What is it? dis-
ambiguating the different readings of the pronoun ?it?. In Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing, pages 1325?
1331, Copenhagen, Denmark. Association for Computational Linguistics.
Tim Loh. 2020. Germany has its own Dr. Fauci?and actually follows his advice.
Bloomberg.
An?lia R Lopes and Celita S Trelha. 2013. Translation, cultural adaptation and
evaluation of the psychometric properties of the falls risk awareness questionnaire
(fraq): Fraq-brazil. Brazilian journal of physical therapy, 17:593?605.
Max Louwerse, David Lin, Amanda Drescher, and Gun Semin. 2010. Linguistic
cues predict fraudulent events in a corporate social network. In Proceedings of the
Annual Meeting of the Cognitive Science Society.
Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu di-
alogue corpus: A large dataset for research in unstructured multi-turn dialogue
systems. arXiv preprint arXiv:1506.08909.
Ryan Mac. 2021. Facebook apologizes after a.i. puts ?primates? label on video of
black men. The New York Times.
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
Adrian Vladu. 2017. Towards deep learning models resistant to adversarial at-
tacks.
James Edwin Mahon. 2016. The definition of lying and deception. In The Stan-
ford Encyclopedia of Philosophy, winter 2016 edition. Metaphysics Research Lab,
Stanford University.
Mitchell Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a
large annotated corpus of english: The penn treebank. Computational Linguistics.
245
Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi,
and Roberto Zamparelli. 2014. A SICK cure for the evaluation of compositional
distributional semantic models. In Proceedings of the Ninth International Con-
ference on Language Resources and Evaluation (LREC?14), pages 216?223, Reyk-
javik, Iceland. European Language Resources Association (ELRA).
Sameen Maruf, Andr? F. T. Martins, and Gholamreza Haffari. 2019. Selective At-
tention for Context-aware Neural Machine Translation. In Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
pages 3092?3102, Minneapolis, Minnesota. Association for Computational Lin-
guistics.
Winter Mason and Siddharth Suri. 2012. Conducting behavioral research on ama-
zon?s mechanical turk. Behavior research methods, 44(1):1?23.
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
2015. Image-based recommendations on styles and substitutes. In Proceedings
of the 38th international ACM SIGIR conference on research and development in
information retrieval, pages 43?52.
R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. Right for the wrong reasons:
Diagnosing syntactic heuristics in natural language inference. In Proceedings of
the Association for Computational Linguistics.
Paul Michel and Graham Neubig. 2018. Mtnt: A testbed for machine translation of
noisy text. In Proceedings of Empirical Methods in Natural Language Processing.
Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson. 2018.
Document-Level Neural Machine Translation with Hierarchical Attention Net-
works. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pages 2947?2954. Association for Computational Linguis-
tics.
Lesly Miculicich Werlen and Andrei Popescu-Belis. 2017. Validation of an automatic
metric for the accuracy of pronoun translation (APT). In Proceedings of the
Third Workshop on Discourse in Machine Translation, pages 17?25, Copenhagen,
Denmark. Association for Computational Linguistics.
Eric Mihail, Krishnan Lakshmi, Charette Francois, and Manning Christopher. 2017.
Key-value retrieval networks for task-oriented dialogue. In Proceedings of the
Special Interest Group on Discourse and Dialogue.
Tomas Mikolov, Quoc V. Le, and Ilya Sutskever. 2013a. Exploiting Similarities
among Languages for Machine Translation. CoRR, abs/1309.4.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013b.
Distributed representations of words and phrases and their compositionality. In
246
Proceedings of Advances in Neural Information Processing Systems, pages 3111?
3119.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013c.
Distributed representations of words and phrases and their compositionality. In
Proceedings of Advances in Neural Information Processing Systems.
Tom?? Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013d. Linguistic regularities
in continuous space word representations. In Conference of the North American
Chapter of the Association for Computational Linguistics, pages 746?751.
George A. Miller. 1995a. Wordnet: A lexical database for english. Communications
of the ACM, 38:39?41.
George A Miller. 1995b. Wordnet: a lexical database for english. Communications
of the ACM, 38(11):39?41.
Taniya Mishra and Srinivas Bangalore. 2010. Qme!: A speech-based question-
answering system on mobile devices. In Conference of the North American Chapter
of the Association for Computational Linguistics.
Tom Mitchell. 1997. Introduction to machine learning. Machine Learning, 7:2?5.
Saif Mohammad. 2018. Word affect intensities. In Proceedings of the Language
Resources and Evaluation Conference, Miyazaki, Japan. European Language Re-
sources Association (ELRA).
Ethan Mollick and Ramana Nanda. 2016. Wisdom or madness? comparing crowds
with expert evaluation in funding the arts. Manag. Sci., 62:1533?1553.
Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and
Samuel Madden. 2014. Scaling up crowd-sourcing to very large datasets: A case
for active learning. Proc. VLDB Endow., 8:125?136.
Mathias M?ller, Annette Rios, Elena Voita, and Rico Sennrich. 2018. A Large-Scale
Test Set for the Evaluation of Context-Aware Pronoun Translation in Neural
Machine Translation. In WMT 2018, Brussels, Belgium. Association for Compu-
tational Linguistics.
Kimberly A Neuendorf. 2017. The content analysis guidebook. Sage Publications.
Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. 2019. ScispaCy: Fast
and robust models for biomedical natural language processing. In Proceedings of
the 18th BioNLP Workshop and Shared Task, pages 319?327, Florence, Italy.
Association for Computational Linguistics.
Matthew L Newman, James W Pennebaker, Diane S Berry, and Jane M Richards.
2003. Lying words: Predicting deception from linguistic styles. Personality and
social psychology bulletin, 29(5):665?675.
247
Andrew Y Ng and Michael I Jordan. 2002. On discriminative vs. generative clas-
sifiers: A comparison of logistic regression and naive bayes. In Proceedings of
Advances in Neural Information Processing Systems, pages 841?848.
An T Nguyen, Matthew Lease, and Byron C Wallace. 2019. Explainable modeling of
annotations in crowdsourcing. In Proceedings of the 24th International Conference
on Intelligent User Interfaces, pages 575?579.
Vlad Niculae, Srijan Kumar, Jordan Boyd-Graber, and Cristian Danescu-Niculescu-
Mizil. 2015. Linguistic harbingers of betrayal: A case study on an online strategy
game. In Proceedings of the Association for Computational Linguistics.
Stefanie Nowak and Stefan R?ger. 2010. How reliable are annotations via crowd-
sourcing: a study about inter-annotator agreement for multi-label image annota-
tion. In Proceedings of the international conference on Multimedia information
retrieval, pages 557?566.
Sarah Oates. 2014. Russian state narrative in the digital age: Rewired propaganda
in russian television news framing of malaysia airlines flight 17. In American
Political Science Association Annual Meeting.
Daniela Oliveira, Harold Rocha, Huizi Yang, Donovan Ellis, Sandeep Dommaraju,
Melis Muradoglu, Devon Weir, Adam Soliman, Tian Lin, and Natalie Ebner. 2017.
Dissecting spear phishing emails for older vs young adults: On the interplay of
weapons of influence and life domains in predicting susceptibility to phishing. In
International Conference on Human Factors in Computing Systems.
Constantin Ora?san. 2003. Palinka: A highly customisable tool for discourse annota-
tion. In Proceedings of the Fourth SIGdial Workshop of Discourse and Dialogue,
pages 39?43.
Myle Ott, Claire Cardie, and Jeff Hancock. 2012. Estimating the prevalence of
deception in online review communities. In Proceedings of the World Wide Web
Conference.
Myle Ott, Yejin Choi, Claire Cardie, and Jeffrey T Hancock. 2011. Finding deceptive
opinion spam by any stretch of the imagination. In Proceedings of the Association
for Computational Linguistics.
Paul Over. 2003. An introduction to duc 2003: Intrinsic evaluation of generic news
text summarization systems. In Proceedings of Document Understanding Confer-
ence 2003.
Bo Pang, Lillian Lee, et al. 2008. Opinion mining and sentiment analysis. Founda-
tions and Trends in Information Retrieval, 2(1?2):1?135.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
method for automatic evaluation of machine translation. In Proceedings of the
Association for Computational Linguistics, pages 311?318.
248
Philip Paquette, Yuchen Lu, Seton S. Bocco, Max Smith, Satya Ortiz-Gagne,
Jonathan K. Kummerfeld, Joelle Pineau, Satinder Singh, and Aaron C Courville.
2019. No-press diplomacy: Modeling multi-agent gameplay. In Advances in Neu-
ral Information Processing Systems, volume 32, pages 4476?4487.
Silvia Pareti and Tatiana Lando. 2018. Dialog intent structure: A hierarchical
schema of linked dialog acts. In Proceedings of the Eleventh International Con-
ference on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan.
European Languages Resources Association (ELRA).
Rebecca J. Passonneau and Bob Carpenter. 2014. The benefits of a model of anno-
tation. Transactions of the Association for Computational Linguistics, 2.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang,
Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
2017. Automatic differentiation in pytorch. In Conference on Neural Information
Processing Systems: Autodiff Workshop: The Future of Gradient-based Machine
Learning Software and Techniques.
Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar, Tom Ko, Daniel Povey, and
Sanjeev Khudanpur. 2015. Jhu aspire system: Robust lvcsr with tdnns, ivector
adaptation and rnn-lms. In Automatic Speech Recognition and Understanding
(ASRU), IEEE Workshop on, pages 539?546.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
Global vectors for word representation. In Proceedings of Empirical Methods in
Natural Language Processing.
Ver?nica P?rez-Rosas, Mohamed Abouelenien, Rada Mihalcea, Yao Xiao, C. J. Lin-
ton, and Mihai Burzo. 2016. Verbal and nonverbal clues for real-life deception
detection. In Proceedings of Empirical Methods in Natural Language Processing.
Ver?nica P?rez-Rosas, Bennett Kleinberg, Alexandra Lefevre, and Rada Mihalcea.
2017. Automatic detection of fake news. Proceedings of International Conference
on Computational Linguistics.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word represen-
tations. In Conference of the North American Chapter of the Association for
Computational Linguistics.
Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, Sean
Welleck, Yejin Choi, and Zaid Harchaoui. 2021. MAUVE: Measuring the gap
between neural text and human text using divergence frontiers. In Proceedings of
Advances in Neural Information Processing Systems.
249
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Nagendra Goel, Mirko Hannemann,
Yanmin Qian, Petr Schwarz, and Georg Stemmer. 2011. The Kaldi speech recog-
nition toolkit. In IEEE Workshop on Automatic Speech Recognition and Under-
standing.
Vinodkumar Prabhakaran, Ajita John, and Dor?e D Seligmann. 2013. Power dy-
namics in spoken interactions: a case study on 2012 Republican primary debates.
In Proceedings of the World Wide Web Conference.
Raimon H. R. Pruim, Maarten Mennes, Daan van Rooij, Alberto Llera, Jan K.
Buitelaar, and Christian F. Beckmann. 2015. Ica-aroma: A robust ica-based
strategy for removing motion artifacts from fmri data. NeuroImage, 112:267?277.
Junfei Qiu, Qihui Wu, Guoru Ding, Yuhua Xu, and Shuo Feng. 2016. A survey
of machine learning for big data processing. EURASIP Journal on Advances in
Signal Processing, 2016(1):67.
Chen Qu, Liu Yang, Cen Chen, Minghui Qiu, W Bruce Croft, and Mohit Iyyer.
2020. Open-retrieval conversational question answering. In Proceedings of the 43rd
International ACM SIGIR conference on research and development in Information
Retrieval, pages 539?548.
Karthik Raghunathan, Heeyoung Lee, Sudarshan Rangarajan, Nathanael Cham-
bers, Mihai Surdeanu, Dan Jurafsky, and Christopher D Manning. 2010. A multi-
pass sieve for coreference resolution. In Proceedings of the 2010 Conference on
Empirical Methods in Natural Language Processing, pages 492?501.
Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don?t know:
Unanswerable questions for SQuAD. In Proceedings of the Association for Com-
putational Linguistics.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016.
SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings
of Empirical Methods in Natural Language Processing.
Juan Ramos. 2003. Using tf-idf to determine word relevance in document queries.
In Proceedings of the International Conference of Machine Learning.
Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational
question answering challenge. Transactions of the Association for Computational
Linguistics, 7:249?266.
Philip Resnik. 2022. What is an NLP task?
Philip Resnik and Noah A Smith. 2003. The web as a parallel corpus. Computational
Linguistics, 29(3):349?380.
250
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. "Why should i trust
you?" explaining the predictions of any classifier. In Knowledge Discovery and
Data Mining.
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2018. Semantically equiva-
lent adversarial rules for debugging NLP models. In Proceedings of the Association
for Computational Linguistics, pages 856?865, Melbourne, Australia. Association
for Computational Linguistics.
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2020.
Beyond accuracy: Behavioral testing of nlp models with checklist. In Proceedings
of the Association for Computational Linguistics.
Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance frame-
work: Bm25 and beyond. Foundations and Trends in Information Retrieval,
3(4):333?389.
Pedro Rodriguez, Joe Barrow, Alexander Miserlis Hoyle, John P. Lalor, Robin Jia,
and Jordan Boyd-Graber. 2021. Evaluation examples are not equally informative:
How should that change NLP leaderboards? In Proceedings of the Association
for Computational Linguistics, pages 4486?4503, Online. Association for Compu-
tational Linguistics.
Pedro Rodriguez, Shi Feng, Mohit Iyyer, He He, and Jordan Boyd-Graber.
2019. Quizbowl: The case for incremental question answering. arXiv preprint
arXiv:1904.04792.
Anna Rogers, Aleksandr Drozd, and Bofang Li. 2017. The (too many) problems of
analogical reasoning with word vectors. In Proceedings of the 6th Joint Conference
on Lexical and Computational Semantics (* SEM 2017), pages 135?148.
Sebastian Ruder. 2016. An overview of gradient descent optimization algorithms.
arXiv preprint arXiv:1609.04747.
Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme.
2018. Gender bias in coreference resolution. In Conference of the North American
Chapter of the Association for Computational Linguistics.
Betsy Rymes and Andrea R Leone. 2014. Citizen sociolinguistics: A new media
methodology for understanding language and social life. Working Papers in Ed-
ucational Linguistics (WPEL), 29(2):4.
Marta Sabou, Kalina Bontcheva, Leon Derczynski, and Arno Scharl. 2014. Corpus
annotation through crowdsourcing: Towards best practice guidelines. In Proceed-
ings of the Language Resources and Evaluation Conference.
Claude Sammut and Geoffrey I. Webb, editors. 2010. Mean Squared Error , pages
653?653. Springer US, Boston, MA.
251
Hagen Schulze. 1991. The Course of German Nationalism: From Frederick the Great
to Bismarck 1763?1867 . Cambridge University Press.
Sebastian Schuster, Sonal Gupta, Rushin Shah, and Mike Lewis. 2018. Cross-
lingual transfer learning for multilingual task oriented dialog. arXiv preprint
arXiv:1810.13327.
Jo?o Sedoc, Sven Buechel, Yehonathan Nachmany, Anneke Buffone, and Lyle Ungar.
2020. Learning word ratings for empathy and distress from document-level user
responses. In Proceedings of the Language Resources and Evaluation Conference,
pages 1664?1673, Marseille, France. European Language Resources Association.
Rico Sennrich. 2017. How grammatical is character-level neural machine translation?
assessing MT quality with contrastive translation pairs. In Proceedings of the
15th Conference of the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers, pages 376?382, Valencia, Spain. Association
for Computational Linguistics.
Kathryn Sharpe Wessling, Joel Huber, and Oded Netzer. 2017. MTurk Character
Misrepresentation: Assessment and Solutions. Journal of Consumer Research,
44(1):211?230.
Elben Shira and Matthew Lease. 2011. Expert search on code repositories. Tech-
nical Report TR-11-42, Department of Computer Science, University of Texas at
Austin.
Ben Shneiderman. 2000. Designing trust into online experiences. Communications
of the ACM, 43(12):57?59.
Frederick A Siegler. 1966. Lying. American Philosophical Quarterly, 3(2):128?136.
Jonathan Silvertown. 2009. A new dawn for citizen science. Trends in ecology &
evolution, 24(9):467?471.
Jason Smith, Herve Saint-Amand, Magdalena Plamad?, Philipp Koehn, Chris
Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel text from
the common crawl. In Proceedings of the Association for Computational Linguis-
tics, pages 1374?1383.
Matthew G Snover, Nitin Madnani, Bonnie Dorr, and Richard Schwartz. 2009. Ter-
plus: paraphrase, semantic, and alignment enhancements to translation edit rate.
Machine Translation, 23(2):117?127.
Rion Snow, Brendan O?Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. Cheap
and fast?but is it good?: Evaluating non-expert annotations for natural language
tasks. In Proceedings of Empirical Methods in Natural Language Processing.
252
Felix Soldner, Ver?nica P?rez-Rosas, and Rada Mihalcea. 2019. Box of lies: Mul-
timodal deception detection in dialogues. In Conference of the North American
Chapter of the Association for Computational Linguistics.
Matthias Sperber, Graham Neubig, Jan Niehues, and Alex Waibel. 2017. Neural
lattice-to-sequence models for uncertain inputs. In Proceedings of the Association
for Computational Linguistics.
Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer. 2019. Evaluating gen-
der bias in machine translation. In Proceedings of the 57th Annual Meeting of
the Association for Computational Linguistics, pages 1679?1684, Florence, Italy.
Association for Computational Linguistics.
Dario Stojanovski and Alexander Fraser. 2018. Coreference and coherence in neu-
ral machine translation: A study using oracle experiments. In Proceedings of
the Third Conference on Machine Translation: Research Papers, pages 49?60,
Brussels, Belgium. Association for Computational Linguistics.
Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates,
Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie
Meteer. 2000. Dialogue act modeling for automatic tagging and recognition of
conversational speech. Computational linguistics, 26(3):339?373.
Siddharth Suri, Daniel G. Goldstein, and Winter A. Mason. 2011. Honesty in an
online labor market. In Proceedings of the 11th AAAI Conference on Human
Computation, AAAIWS?11-11, page 61?66. AAAI Press.
Lyn M. Van Swol, Deepak Malhotra, and Michael T. Braun. 2012. Deception and
its detection: Effects of monetary incentives and personal relationship history.
Communication Research, 39(2):217?238.
Jennifer EF Teitcher, Walter O Bockting, Jos? A Bauermeister, Chris J Hoefer,
Michael H Miner, and Robert L Klitzman. 2015. Detecting, preventing, and
responding to ?fraudsters? in internet research: ethics and tradeoffs. Journal of
Law, Medicine & Ethics, 43(1):116?133.
Simone Teufel and Marc Moens. 2002. Summarizing scientific articles: experiments
with relevance and rhetorical status. Computational linguistics, 28(4):409?445.
Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining
support or opposition from congressional floor-debate transcripts. arXiv preprint
cs/0607062.
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal.
2018a. Fever: a large-scale dataset for fact extraction and verification. In Confer-
ence of the North American Chapter of the Association for Computational Lin-
guistics.
253
James Thorne, Andreas Vlachos, Oana Cocarascu, Christos Christodoulopoulos,
and Arpit Mittal, editors. 2018b. Proceedings of the First Workshop on Fact Ex-
traction and VERification (FEVER). Association for Computational Linguistics.
J?rg Tiedemann and Yves Scherrer. 2017. Neural Machine Translation with Ex-
tended Context. In Proceedings of the Third Workshop on Discourse in Machine
Translation, pages 82?92.
Catalina L Toma and Jeffrey T Hancock. 2012. What lies beneath: The linguistic
traces of deception in online dating profiles. Journal of Communication, 62(1):78?
97.
Isabelle Torrance. 2015. Distorted oaths in Aeschylus. Illinois Classical Studies,
40(2):281?295.
A. M. Turing. 1950. Computing Machinery and Intelligence. Mind, LIX(236):433?
460.
Peter D Turney. 2008. A uniform approach to analogies, synonyms, antonyms, and
associations. arXiv preprint arXiv:0809.0124.
Maria Tymoczko. 2006. Translation: Ethics, ideology, action. The Massachusetts
Review, 47(3):442?461.
Shyam Upadhyay, Manaal Faruqui, Gokhan T?r, Hakkani-T?r Dilek, and Larry
Heck. 2018. (almost) zero-shot cross-lingual spoken language understanding. In
2018 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 6034?6038. IEEE.
Donna Vakharia and Matthew Lease. 2015. Beyond mechanical turk: An analysis
of paid crowd work platforms. In iConference.
Svitlana Vakulenko, Shayne Longpre, Zhucheng Tu, and Raviteja Anantha. 2020.
A wrong answer or a wrong question? an intricate relationship between question
reformulation and answer selection in conversational question answering. In Pro-
ceedings of the 5th International Workshop on Search-Oriented Conversational AI
(SCAI), pages 7?16, Online. Association for Computational Linguistics.
D?niel Varga, Andr?s Kornai, Viktor Nagy, L?szl? N?meth, and Viktor Tr?n. 2007.
Parallel corpora for medium density languages.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Proceedings of Advances in Neural Information Processing Systems.
Tony Veale. 2016. Round up the usual suspects: Knowledge-based metaphor gener-
ation. In Proceedings of the Fourth Workshop on Metaphor in NLP, pages 34?41.
254
Jean-Paul Vinay and Jean Darbelnet. 1995. Comparative stylistics of French and
English: A methodology for translation, volume 11. John Benjamins Publishing.
Elena Voita, Rico Sennrich, and Ivan Titov. 2019. When a Good Translation is
Wrong in Context: Context-Aware Machine Translation Improves on Deixis, El-
lipsis, and Lexical Cohesion. In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages 1198?1212, Florence, Italy. As-
sociation for Computational Linguistics.
Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-Aware
Neural Machine Translation Learns Anaphora Resolution. In Proceedings of the
Association for Computational Linguistics, pages 1264?1274, Melbourne, Aus-
tralia.
Ellen M Voorhees et al. 1999. The trec-8 question answering track report. In Trec,
volume 99, pages 77?82.
Denny Vrande?i? and Markus Kr?tzsch. 2014. Wikidata: a free collaborative knowl-
edgebase. Communications of the ACM, 57(10):78?85.
Maja Vukovic and Claudio Bartolini. 2010. Towards a research agenda for enterprise
crowdsourcing. In International Symposium On Leveraging Applications of Formal
Methods, Verification and Validation, pages 425?434. Springer.
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019a.
Universal adversarial triggers for attacking and analyzing nlp. In Proceedings of
Empirical Methods in Natural Language Processing.
Eric Wallace, Pedro Rodriguez, Shi Feng, Ikuya Yamada, and Jordan Boyd-Graber.
2019b. Trick me if you can: Human-in-the-loop generation of adversarial exam-
ples for question answering. Transactions of the Association for Computational
Linguistics, 7:387?401.
Eric Wallace, Tony Z. Zhao, Shi Feng, and Sameer Singh. 2021. Concealed data
poisoning attacks on nlp models. In Conference of the North American Chapter
of the Association for Computational Linguistics.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael,
Felix Hill, Omer Levy, and Samuel Bowman. 2019a. Superglue: A stickier bench-
mark for general-purpose language understanding systems. In Proceedings of Ad-
vances in Neural Information Processing Systems.
Xiaosen Wang, Hao Jin, and Kun He. 2019b. Natural language adversarial attacks
and defenses in word level. arXiv preprint arXiv:1909.06723.
Zhuoran Wang and Oliver Lemon. 2013. A simple and generic belief tracking mech-
anism for the dialog state tracking challenge: On the believability of observed
information. In Proceedings of the SIGDIAL 2013 Conference, pages 423?432.
255
Zeerak Waseem. 2016. Are you a racist or am i seeing things? annotator influence
on hate speech detection on twitter. In NLP+CSS@EMNLP.
Wei Wei, Quoc Le, Andrew Dai, and Jia Li. 2018. Airdialogue: An environment
for goal-oriented dialogue research. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, pages 3844?3854.
Bruce D Weinstein. 1993. What is an expert? Theoretical medicine, 14(1):57?73.
Joseph Weizenbaum. 1966. Eliza?a computer program for the study of natural lan-
guage communication between man and machine. Communications of the ACM,
9(1):36?45.
Frank Wessel and Hermann Ney. 2004. Unsupervised training of acoustic models
for large vocabulary continuous speech recognition. IEEE Transactions on Speech
and Audio Processing.
Mark E Whiting, Grant Hugh, and Michael S Bernstein. 2019. Fair work: Crowd
work minimum wage with one line of code. In Proceedings of the AAAI Conference
on Human Computation and Crowdsourcing, volume 7, pages 197?206.
Jason Williams, Antoine Raux, and Matthew Henderson. 2016. The dialog state
tracking challenge series: A review. Dialogue & Discourse, 7(3):4?33.
Terry Winograd. 1972. Understanding natural language. Cognitive Psychology,
3(1):1?191.
Marty J Wolf, Keith W Miller, and Frances S Grodzinsky. 2017. Why we should
have seen that coming: comments on microsoft?s tay ?experiment,? and wider
implications. The ORBIT Journal, 1(2):1?12.
Stephen M Wolfson and Matthew Lease. 2011. Look before you leap: Legal pitfalls
of crowdsourcing. Proceedings of the American Society for Information Science
and Technology, 48(1):1?10.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi,
Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff
Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan
Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George
Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex
Rudnick, Oriol Vinyals, Gregory S. arXiv preprintado, Macduff Hughes, and Jef-
frey Dean. 2016. Google?s neural machine translation system: Bridging the gap
between human and machine translation. arXiv preprint arXiv: 1609.08144.
Ikuya Yamada, Ryuji Tamaki, Hiroyuki Shindo, and Yoshiyasu Takefuji. 2018. Stu-
dio Ousia?s quiz bowl question answering system. In NIPS Competition: Building
Intelligent Systems, pages 181?194.
256
Kaiyu Yang, Klint Qinami, Li Fei-Fei, Jia Deng, and Olga Russakovsky. 2020. To-
wards fairer datasets: Filtering and balancing the distribution of the people sub-
tree in the imagenet hierarchy. In Proceedings of the 2020 Conference on Fairness,
Accountability, and Transparency, pages 547?558.
Dani Yogatama, Cyprien de Masson d?Autume, Jerome Connor, Tom?s Kocisk?,
Mike Chrzanowski, Lingpeng Kong, Angeliki Lazaridou, Wang Ling, Lei Yu, Chris
Dyer, and Phil Blunsom. 2019. Learning and evaluating general linguistic intelli-
gence. arXiv preprint arXiv:1901.11373.
Omar Zaidan and Chris Callison-Burch. 2011. Crowdsourcing translation: Pro-
fessional quality from non-professionals. In Proceedings of the Association for
Computational Linguistics, pages 1220?1229.
Justine Zhang, Ravi Kumar, Sujith Ravi, and Cristian Danescu-Niculescu-Mizil.
2016. Conversational flow in oxford-style debates. In Conference of the North
American Chapter of the Association for Computational Linguistics.
Ruiqiang Zhang, Genichiro Kikui, Hirofumi Yamamoto, Frank K Soong, Taro
Watanabe, and Wai-Kit Lo. 2004. A unified approach in speech-to-speech trans-
lation: integrating features of speech recognition and machine translation. In
COLING 2004: Proceedings of the 20th International Conference on Computa-
tional Linguistics, pages 1168?1174.
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang.
2018. Gender bias in coreference resolution: Evaluation and debiasing methods.
In Proceedings of the 2018 Conference of the North American Chapter of the
Association for Computational Linguistics: Human Language Technologies.
Zhi-Hua Zhou and Xu-Ying Liu. 2005. Training cost-sensitive neural networks with
methods addressing the class imbalance problem. IEEE Transactions on knowl-
edge and data engineering, 18(1):63?77.
Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and
Yong Yu. 2018. Texygen: A benchmarking platform for text generation models.
In The 41st International ACM SIGIR Conference on Research & Development
in Information Retrieval, pages 1097?1100.
Geoffrey Zweig, Olivier Siohan, George Saon, Bhuvana Ramabhadran, Daniel Povey,
Lidia Mangu, and Brian Kingsbury. 2006. Automated quality monitoring for call
centers using speech and nlp technologies. In Proceedings of the Human Lan-
guage Technology Conference of the NAACL, Companion Volume: Demonstra-
tions, pages 292?295.
257