ABSTRACT
Title of thesis: Discourse-Level Language Understanding
with Deep Learning
Mohit Iyyer, Doctor of Philosophy, 2017
Thesis directed by: Professor Jordan Boyd-Graber
Computer Science, University of Maryland
Professor Hal Daume´ III
Computer Science, University of Maryland
Designing computational models that can understand language at a human level is
a foundational goal in the field of natural language processing (NLP). Given a sentence,
machines are capable of translating it into many different languages, generating a corre-
sponding syntactic parse tree, marking words that refer to people or places, and much
more. These tasks are solved by statistical machine learning algorithms, which leverage
patterns in large datasets to build predictive models. Many recent advances in NLP are due
to deep learning models (parameterized as neural networks), which bypass user-specified
features in favor of building representations of language directly from the text.
Despite many deep learning-fueled advances at the word and sentence level, however,
computers still struggle to understand high-level discourse structure in language, or the
way in which authors combine and order different units of text (e.g., sentences, paragraphs,
chapters) to express a coherent message or narrative. Part of the reason is data-related, as
there are few existing datasets for contextual language-based problems, and some tasks
are too complex to be framed as supervised learning problems; for the latter type, we
must either resort to unsupervised learning or devise training objectives that simulate the
supervised setting. Another reason is architectural: neural networks designed for sentence-
level tasks require additional functionality, interpretability, and efficiency to operate at
the discourse level. In this thesis, I design deep learning architectures for three NLP
tasks that require integrating information across high-level linguistic context: question
answering, fictional relationship understanding, and comic book narrative modeling. While
these tasks are very different from each other on the surface, I show that similar neural
network modules can be used in each case to form contextual representations. I conclude
by discussing potential avenues for future research that seeks to understand increasingly
large and complex context.
DISCOURSE-LEVEL LANGUAGE UNDERSTANDING
WITH DEEP LEARNING
by
Mohit Iyyer
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2017
Advisory Committee:
Professor Jordan Boyd-Graber (Chair, Co-Advisor)
Professor Hal Daume´ III (Co-Advisor)
Professor Marine Carpuat
Professor Kyunghyun Cho
Professor Philip Resnik (Dean’s Representative)
© Copyright by
Mohit Iyyer
2017
Acknowledgments
First and foremost, I would like to thank my two advisors, Jordan Boyd-Graber
and Hal Daume´ III, for all of their guidance throughout my time at UMD. In my first
and second year when I didn’t know anything, I really appreciated Jordan’s patience in
answering my constant stream of naı¨ve questions. For my first conference submission, he
stayed in the lab until the 3AM deadline to help me edit and polish it; the final product
would have been an incoherent mess if it weren’t for him. He also wrote thousands of lines
of code for our initial quiz bowl project, and his dedication to constantly improving the
system motivated me to work harder as well. Similarly, Hal significantly improved all of
my papers and talks with his comments and edits, and I greatly enjoyed our long meetings
(always scheduled for “a few minutes”) that would invariably go off into tangents. Most of
all, I want to thank both Jordan and Hal for the freedom they allowed me to pursue my own
ideas; even when it became clear that a particular research direction was hopeless (which
happened quite often!), they’d let me explore it until I was content; despite the inevitable
failures, these experiences were invaluable in building both my scientific knowledge and
my understanding of how to do research. I can only hope to become as great of an advisor
to my own students as they’ve been to me.
I’d also like to thank the other members of my committee, Philip Resnik, Marine
Carpuat, and Kyunghyun Cho, for their insightful comments and questions that helped me
improve this document. Philip deserves special mention for his timely criticism on many
of my papers and presentations, which would have otherwise been mind-numbingly dull,
as does Naomi Feldman (and other members of the UMD Language Science community)
ii
for our many educational conversations. I’d also like to thank my supervisors from two
internships: Richard Socher (MetaMind), Scott Yih (Microsoft Research), and Ming-Wei
Chang (Microsoft Research). In addition to being a great supervisor and collaborator,
Richard’s first paper on recursive neural networks inspired me to start working on deep
learning, and his prompt answers to my numerous nagging questions were instrumental
in helping me learn, especially since no one else at UMD was working on deep learning
at that time. Similarly, Scott and Ming-Wei taught me to look more critically at research
problems: behind the flashy method proposed in the latest paper there is often a simple
hack that is responsible for much of the method’s performance.
Of course, I couldn’t have completed this journey without the other students in the
CLIP lab: Jordan’s senior students, Viet-An Nguyen, Yuening Hu, and Ke Zhai, who helped
me tremendously when I first joined the lab (I will always remember our inadvertent hike
up the Green Mountain in Boulder wearing jeans); Peter Enns and Andy Garron (our antics
made coursework and day-to-day life in the lab tolerable); my collaborators, co-authors
and friends Snigdha Chaturvedi, Yogarshi Vyas, He He, and Leonardo Claudino; Alvin
Grissom II and our millions of lunches at Japanese restaurants; Thang Nguyen and our
futile efforts to understand deep learning theory; Anupam Guha and countless rants about
politics, television, and modern art; Varun Manjunatha, my main student collaborator over
the years, and the countless late nights we wasted in Board & Brew working on projects
going nowhere (it was worth it for the few that succeeded!); and other students in the
lab and elsewhere: Sudha Rao, Bharat Singh, John Wieting, Kun Wang (and the rest of
our basketball team!), Kiante Brantley, Weiwei Yang, Ahmed Elgohary, Pedro Rodriguez,
Fenfei Guo, Austin Myers, Rashmi Sankepally, John Morgan, and Jyothi Krishnamurthy.
iii
Thanks to all of my childhood and college friends, who have helped me keep my
sanity during the grind of graduate school, and finally, to my family: my loving parents,
who continue to encourage me to pursue my dreams, and my sister, who will always
remain a beacon of reason in the sea of my ignorance (her words, not mine).
iv
Table of Contents
Acknowledgments ii
List of Figures viii
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Question Answering from Paragraphs and Conversations . . . . . 3
1.2.2 Understanding Dynamic Fictional Relationships in Novels . . . . 4
1.2.3 Making Connective Inferences in Comic Books . . . . . . . . . . 5
2 Background: Deep Learning for Discourse-Level NLP 6
2.1 Representing Words with Vectors . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Simple Composition Functions: Neural Bag-of-Words . . . . . . . . . . 8
2.2.1 Training Models for Text Classification . . . . . . . . . . . . . . 10
2.3 Deep NBOW Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Tree-Structured Variants . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1.1 Dependency Tree RecNNs . . . . . . . . . . . . . . . 16
2.5 Discourse-Level Representation Learning . . . . . . . . . . . . . . . . . 17
2.5.1 Discourse-Level Understanding . . . . . . . . . . . . . . . . . . 18
2.5.2 Deep Discourse-Level Representation Learning . . . . . . . . . . 20
3 Deep Learning for Multi-Sentence Factoid Question Answering 21
3.1 Quiz Bowl: Factoid QA . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 Matching Text to Entities: Quiz Bowl . . . . . . . . . . . . . . . 22
3.1.2 Negative Sampling Instead of Softmax . . . . . . . . . . . . . . . 23
3.1.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1.4 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.5 DTreeNN Configurations . . . . . . . . . . . . . . . . . . . . . . 30
3.1.6 Human Comparison . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.7.1 Experimental Results . . . . . . . . . . . . . . . . . . 33
3.1.7.2 Where the Attribute Space Helps Answer Questions . . 34
3.1.7.3 Where all Models Struggle . . . . . . . . . . . . . . . 35
3.1.7.4 Examining the Attribute Space . . . . . . . . . . . . . 35
3.2 Simpler Quiz Bowl Models . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 DANs for Quiz Bowl . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1.1 Word Dropout Improves Robustness . . . . . . . . . . 41
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
v
4 Sequential Semantic Parsing: Integrating Context with Structured Prediction 45
4.1 Motivating Contextual Understanding for Semantic Parsing . . . . . . . . 45
4.2 A Dataset of Question Sequences . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Properties of SQA . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Dynamic Neural Semantic Parsing . . . . . . . . . . . . . . . . . . . . . 49
4.3.1 Semantic parse language . . . . . . . . . . . . . . . . . . . . . . 51
4.3.2 Model formulation . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.3 Policy function . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.4 Model learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 DynSP implementation details . . . . . . . . . . . . . . . . . . . 62
4.4.2 Results & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Dynamically Modeling Fictional Relationships 69
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 A Dataset of Character Interactions . . . . . . . . . . . . . . . . . . . . . 72
5.3 Relationship Modeling Networks . . . . . . . . . . . . . . . . . . . . . . 73
5.3.1 Formalizing the Problem . . . . . . . . . . . . . . . . . . . . . . 73
5.3.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3.2.1 Modeling Spans with Vector Averages . . . . . . . . . 75
5.3.2.2 Approximating Spans with Relationship Descriptors . . 76
5.3.2.3 Computing Weights over Descriptors . . . . . . . . . . 77
5.3.2.4 Interpreting Descriptors and Enforcing Uniqueness . . 78
5.4 Evaluating Descriptors and Trajectories . . . . . . . . . . . . . . . . . . 79
5.4.1 Topic Model Baselines . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . 80
5.4.3 Do Descriptors Make Sense? . . . . . . . . . . . . . . . . . . . . 81
5.4.4 Do Trajectories Make Sense? . . . . . . . . . . . . . . . . . . . 83
5.4.5 What Makes a Relationship Positive? . . . . . . . . . . . . . . . 85
5.5 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Understanding Panel-to-Panel Inferences in Comic Books 92
6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.2 Creating a dataset of comic books . . . . . . . . . . . . . . . . . . . . . 94
6.2.1 Where do our comics come from? . . . . . . . . . . . . . . . . . 96
6.2.2 Breaking comics into their basic elements . . . . . . . . . . . . . 97
6.2.3 OCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.3 OCR Post-Processing and Advertisement Removal . . . . . . . . . . . . . 99
6.4 Examples from Dataset Creation . . . . . . . . . . . . . . . . . . . . . . 100
6.5 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6 Tasks that test closure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
vi
6.6.1 Task Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.7 Models & Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.7.1 Model definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.8 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.9 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.10 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 117
7 Conclusion 119
7.1 Contextual Question Answering . . . . . . . . . . . . . . . . . . . . . . 119
7.1.1 Future Directions for QA . . . . . . . . . . . . . . . . . . . . . . 120
7.2 Comprehending Novels and Comics . . . . . . . . . . . . . . . . . . . . 121
7.2.1 Future Directions in Creative Understanding . . . . . . . . . . . 122
Bibliography 124
vii
List of Figures
2.1 Two dimensional visualization of word embeddings. On the left side,
words associated with numbers or counting are clustered together, while
on the right side we see an occupation cluster. (Credit: Christopher Olah) 8
2.2 Three different neural network architectures used as composition functions
for the phrase “so-called climate change”. Top: neural bag-of-words;
Middle: recurrent neural network; Bottom: tree-structured neural network. 9
2.3 Dependency parse tree of the opening sentence of Virginia Woolf’s Mrs.
Dalloway. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1 An example quiz bowl question about the Holy Roman Empire. The
first sentence contains no words or named entities that by themselves are
indicative of the answer, while subsequent sentences contain more and
more obvious clues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 A question on the play “No Exit” with human buzz position marked as 3.
Since the buzz occurs in the middle of the second sentence, our model is
only allowed to see the first sentence. . . . . . . . . . . . . . . . . . . . . 32
3.3 Comparisons of QANTA+IR-WIKI to human quiz bowl players. Each bar
represents an individual human, and the bar height corresponds to the
difference between the model score and the human score. Bars are ordered
by human skill. Red bars indicate that the human is winning, while blue
bars indicate that the model is winning. QANTA+IR-WIKI outperforms
most humans on history questions but fails to defeat the “average” human
on literature questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 t-SNE 2-D projections of 451 answer vectors divided into six major clus-
ters. The blue cluster is predominantly populated by U.S. presidents. The
zoomed plot reveals temporal clustering among the presidents based on
the years they spent in office. . . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 A question on the German novelist Thomas Mann that contains no named
entities, along with the five top answers as scored by QANTA. Each cell
in the heatmap corresponds to the score (inner product) between a node
in the parse tree and the given answer, and the dependency parse of the
sentence is shown on the left. All of our baselines, including IR-WIKI, are
wrong, while QANTA uses the plot description to make a correct guess. . 37
3.6 An extremely misleading question about John Cabot, at least to computer
models. The words muslim and mecca lead to three Mughal emperors in
the top five guesses from QANTA; other models are similarly led awry. . . 38
3.7 Randomly dropping out 30% of words from the vector average is optimal
for the quiz bowl task, yielding a gain in absolute accuracy of almost 3%
on the quiz bowl question dataset compared to the same model trained
with no word dropout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
viii
4.1 An example question sequence created from a compositional question
intent. Workers must write questions whose answers are subsets of cells in
the table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Possible action transitions based on their types (see Table 4.3.2). Shaded
circles are end states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3 Parses computed by DynSP for three test sequences (actions in blue
boxes, values from table in white boxes). Top: all three questions are
parsed correctly. Middle: semantic matching errors cause the model to
select incorrect columns and conditions. Bottom: The final question is
unanswerable due to limitations of our parse language. . . . . . . . . . . 65
5.1 An example trajectory depicting the dynamic relationship between Lucy
and Arthur in Bram Stoker’s Dracula, which starts with love and ends with
Arthur killing the vampiric Lucy. Each column describes the relationship
state at a particular time by weights over a set of descriptors (larger weights
shown as bigger boxes). Our goal is to learn—without supervision—both
the descriptors and the trajectories from raw fictional texts. . . . . . . . . 70
5.2 An example of the RMN’s computations at a single time step. The model
approximates the vector average of an input span (vst) as a linear com-
bination of descriptors from R. The descriptor weights dt define the
relationship state at each time step and—when viewed as a sequence—
form a relationship trajectory. . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Model precision results from our word intrusion task. The RMN learns
more interpretable descriptors than three topic model baselines. . . . . . . 82
5.4 An example from the Crowdflower summary matching task; workers are
asked to choose the trajectory (here, “A” is generated by the RMN and “B”
by the HTMM) that best matches a provided summary that describes the
relationship between Siddartha and Govinda (from Siddartha by Hesse). . 85
5.5 Left: the RMN is able to model Arthur and Lucy’s trajectory reasonably
well compared to our manually-created version in Figure 5. Middle: both
models agree on event-based descriptors such as foodand sex. Right: a
failure case for the RMN in which it is unable to learn that Lucie Manette
and Charles Darnay are in love. . . . . . . . . . . . . . . . . . . . . . . . 87
5.6 Clusters from PCA visualizations of the RMN’s learned book (left) and
character (right) embeddings. We see a cluster of books about war and
violence (many of which are authored by Tom Clancy) as well as a cluster
of lead female characters from primarily romance novels. These visual-
izations show that the RMN can recover useful static representations of
characters and books in addition to the dynamic relationship trajectories. . 88
6.1 Where did the snake in the last panel come from? Why is it biting the man?
Is the man in the second panel the same as the man in the first panel? To
answer these questions, readers form a larger meaning out of the narration
boxes, speech bubbles, and artwork by applying closure across panels. . . 92
ix
6.2 Different artistic renderings of lions taken from the COMICS dataset. The
left-facing lions are more cartoonish (and humorous) than the ones facing
right, which come from action and adventure comics that rely on realism
to provide thrills. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3 An advertisement from the dataset. The juxtaposition of text and image
causes it to slightly resemble a comics page. . . . . . . . . . . . . . . . . 101
6.4 A minor OCR error. Mistakes such as predicting “BG” for “BIG” are
understandable, since the ‘I’ in “BIG” is barely visible. Similarly, the “IC”
in “QUICKLY” looks a lot like “K” in this font. Finally, “SUB STANCE”
is predicted rather than “SUBSTANCE”, due to an end-of-line word break. 102
6.5 A major OCR error. In part a) of the figure, note the location of the panel
in the page. b) gives us the panel as predicted by the RCNN, but a critical
portion of the text is missing. As a consequence, the textbox extraction is
also faulty, rendering the OCR completely meaningless. . . . . . . . . . . 103
6.6 Five example panel sequences from COMICS, one for each type of in-
terpanel transition. Individual panel borders are color-coded to match
their intrapanel categories (legend in bottom-left). Moment-to-moment
transitions unfold like frames in a movie, while scene-to-scene transitions
are loosely strung together by narrative boxes. Percentages are the relative
prevalance of the transition or panel type in an annotated subset of COMICS.104
6.7 In the character coherence task (top), a model must order the dialogues in
the final panel, while visual cloze (bottom) requires choosing the image of
the panel that follows the given context. For visualization purposes, we
show the original context panels; during model training and evaluation,
textboxes are blacked out in every panel. . . . . . . . . . . . . . . . . . . 106
6.8 The image-text architecture applied to an instance of the text cloze task.
Pretrained image features are combined with learned text features in a
hierarchical LSTM architecture to form a context representation, which is
then used to score text candidates. . . . . . . . . . . . . . . . . . . . . . 109
6.9 Three text cloze examples from the development set, shown with a single
panel of context (boxed candidates are predictions by the text-image
model). The airplane artwork in the top row helps the image-text model
choose the correct answer, while the text-only model fails because the
dialogue lacks contextual information. Conversely, the bottom two rows
show the image-text model ignoring the context in favor of choosing a
candidate that mentions something visually present in the last panel. . . . 114
x
Chapter 1
Introduction
1.1 Overview
Designing computational models that can understand language at a human level is a
foundational goal in the field of natural language processing (NLP). Given a sentence,
machines are now capable of translating it into many different languages, generating a
corresponding syntactic parse tree, and marking words within the sentence that refer to
people or places. These tasks are solved by statistical machine learning (ML) algorithms,
which leverage patterns in large datasets to build predictive models. Many recent advances
in NLP are due to deep learning models, which distinguish themselves from other ML
models by building representations of language directly from the text instead of relying on
user-specified features. My thesis explores deep learning for language-based tasks that
go beyond the sentence level, requiring understanding of both immediate and high-level
context to solve.
Deep learning refers to a family of models called deep neural networks that pass
their input through multiple layers, each of which performs a nonlinear transformation. In
the common training paradigm of supervised learning, networks learn from large datasets
of input/output pairs; for example, consider the task of machine translation, where neural
networks reach state-of-the-art performance when trained on parallel corpora containing
millions of sentence-to-sentence translations (e.g., English-to-French). The deeper and
1
larger the network, the more labeled examples it needs to generalize. Many low-level tasks
have large annotated datasets that were created decades ago (e.g., Europarl, OntoNotes,
Penn Treebank), making them popular choices for deep learning researchers.
Despite these advances at the word and sentence level, however, computers still
struggle to understand high-level discourse structure in language, or the way in which
authors combine and order different units of text (e.g., sentences, paragraphs, chapters) to
express a coherent message or narrative. Part of the reason is data-related, as there are no
existing datasets for many contextual language-based problems, and some tasks are too
complex to be framed as supervised learning problems. For the latter type, we must either
resort to unsupervised learning or devise training objectives that simulate the supervised
setting.
1.2 Roadmap
In this thesis, I design deep learning architectures for three different NLP tasks that require
integrating information across high-level linguistic contexts: question answering, fictional
relationship understanding, and comic book narrative modeling. On the surface these tasks
are very different from one another, and if we were to use a traditional feature-engineering
based approach to solve them, each task’s feature set would be very different as well.
However, within the deep learning framework, the neural network architectures designed
for these tasks all use the same building blocks; I will go over these basic modules in
Chapter 2. The following three chapters, detailed more fully below, correspond to each of
the three tasks. Finally, Chapter 7 concludes the thesis by discussing the pros and cons
2
of deep learning methods for large-scale language understanding and also presents future
directions for this kind of research.
1.2.1 Question Answering from Paragraphs and Conversations
Computerized question answering (QA) is a diverse subfield of NLP. There are many
types of questions that we may want a computer to be able to answer, of which we focus
on (1) factoid questions, whose answers are well-known entities (e.g., trivia questions),
and (2) logical questions, which require parsing natural language into an intermediate
logical representation that is then executed on a database to retrieve the answer. While
fundamentally different from one another, both of these types contain instances where
discourse understanding is required to answer the question.
Chapter 3 focuses specifically on quiz bowl, a factoid QA trivia game whose questions
are similar to those in the TV show “Jeopardy!”. In the quiz bowl setting, questions are four
to five sentences long, each of which independently identifies the answer. We formulate
quiz bowl as a supervised classification problem using a large dataset of question-answer
pairs; this setting is most similar to the low-level NLP tasks discussed earlier. Here most
of our modeling is at the sentence-level, and we look at simple techniques for aggregating
information across the entire question.
In Chapter 4, I look at sequential semantic parsing, which requires models to learn
how to convert natural language questions to an SQL-like parse language. The answers to
these sequential questions are located within corresponding HTML tables from Wikipedia,
which act as question-specific databases. Each sequence contains multiple questions that
3
often require reasoning about information from previously-observed context (e.g., “What
are all of the countries that participated in the 2012 Olympics? Of those, which won at
least two gold medals?”). Unlike the quiz bowl case, we do not have full supervision for
this task; thus, we treat it as a structured prediction problem solved using a reward-guided
search process reminiscent of reinforcement learning methods.
1.2.2 Understanding Dynamic Fictional Relationships in Novels
Both of the QA tasks involve relatively short contexts, but in many language domains human
readers must make sense of large, complex text. Consider novels, which contain character-
centric narratives; relationships between characters develop chapter by chapter, and events
in the story often have huge impact on these relationships. Developing computer models
for understanding stories is a far cry from sentential NLP tasks as it requires integrating
information across huge contexts. Furthermore, the task requires processing dependencies
that span thousands of sentences or more, not just a paragraph as in the QA tasks.
In Chapter 5, I design a new deep neural network architecture for temporally model-
ing fictional relationships across entire books. The network is trained in an unsupervised
fashion, as there is no existing annotated dataset for this task and creating one is not
possible given the subjective nature of the task. Since evaluation is a challenge without
ground-truth annotations, I propose a novel interpretable dictionary layer whose output
can be directly analyzed by human evaluators.
4
1.2.3 Making Connective Inferences in Comic Books
In the real world, humans process many different sources of information, not just text.
For example, comic books have panels that contain both visual information (in the form
of artwork) and language by way of dialogue. Stories are told in sequences of panels; to
understand the action and dialogue in a particular panel, readers must have understood
everything that happened in the preceding panels. One interesting aspect of comics is
that their artists cannot possibly draw every action of every scene due to the physical
constraints of the page. In practice, this limitation manifests itself in adjacent panels that
often differ wildly in terms of space and time. Thus, integrating information across comic
book panels is more difficult than in video frames, as readers must make many connective
inferences to form a coherent story. In Chapter 6, I focus on computationally modeling
these inferences by proposing several tasks that test a neural network’s ability to predict
aspects of future panels given some context.
5
Chapter 2
Background: Deep Learning for Discourse-Level NLP
In traditional machine learning models for NLP, humans inject substantial prior knowledge
of the task into the algorithmic design in the form of features. In text classification
problems, a classifier is then trained over these features to predict a given label. As a
running example for this chapter, I consider the task of detecting the political ideology
of an author (e.g., liberal or conservative, in United States politics) based on a sentence
they wrote.1 For this task, our features could be counts of words or phrases that we deem
indicative of political ideology; for example, the phrase small business leans conservative,
while climate change leans liberal.
Manually designing features is a fine strategy if we assume that the input sentences
are simple. However, this assumption is often not valid; consider the phrase so-called
“climate change”. It becomes difficult to manually design features that accurately represent
the content of more complex constructions such as the scare quotes in this one, which flip
the polarity of the phrase from liberal to conservative. Do we want to have an individual
feature for so-called “climate change”, or perhaps make a separate feature for words in
scare quotes? If the latter, how do we detect which quotation marks are used as scare
quotes and which are not?
Deep learning removes the human burden of feature engineering by bypassing
predefined features (or feature templates) in favor of learned representations from the
1I built a recursive neural network for this task in Iyyer et al. (2014b).
6
raw text input. In NLP, this process involves learning a composition function, which tells
the model how to combine individual words to obtain useful representations of sentences
or larger units of text. In this chapter, I introduce different neural network architectures
for performing composition; the models discussed here will be expanded upon in future
chapters. Then, I describe how neural models (as well as traditional NLP methods) have
previously been used for discourse-level language comprehension, which is the topic of
this thesis.
2.1 Representing Words with Vectors
Setting aside deep learning for the moment, let us consider a standard bag-of-unigrams
classifier for our ideology problem. Here, we have a separate feature for each word in
our vocabulary, which means that each word has a “one-hot” vector representation. The
corresponding vector for a given word is of the size of the vocabulary (usually thousands
of words) where each dimension k represents the identity of a distinct word wk. For any
wk, the value at dimension k is 1, while all other dimensions are zeros.
A major problem with one-hot vectors is that they cannot by themselves give an idea
of the similarity between two words, as all words are equally different from each other
since the inner product between the vectors for any two words is zero. Low-dimensional
embeddings address this issue by restricting the number of dimensions to a hyperparameter
d (usually between 50 and 500) and allowing each dimension to take on a real value, as
opposed to a binary zero or one. As shown in Figure 2.1, synonymous and similar words
form clusters in the learned vector space when trained to capture word co-occurrence
7
Figure 2.1: Two dimensional visualization of word embeddings. On the left
side, words associated with numbers or counting are clustered together, while
on the right side we see an occupation cluster. (Credit: Christopher Olah)
statistics of large datasets. These representations form the basis for the neural network
architectures that I discuss in the rest of this chapter.
2.2 Simple Composition Functions: Neural Bag-of-Words
A composition function takes as input a sequence of word embeddings X and outputs a
single vector z. To be more concrete, we define the word embedding matrix for a given
vocabulary V as L, which is of size d× |V |. Each column vw of L is the embedding for
the word w corresponding to that column. The input to a composition function g is then
the sequence of word embeddings vw for w ∈ X , and its output z is not necessarily of
dimensionality d.
8
so-called climate change
so-called climate change
so-called climate change
NBOW
RNN
TREE-NN
Figure 2.2: Three different neural network architectures used as composition
functions for the phrase “so-called climate change”. Top: neural bag-of-words;
Middle: recurrent neural network; Bottom: tree-structured neural network.
The simplest composition function is the neural bag-of-words model (NBOW, top of
Figure 2.2), which is an element-wise vector arithmetic operation that has no additional
9
parameters beyond L. “Bag-of-words” refers to the fact that the function does not consider
the order of the words in X when computing z; this is not the case with the more complex
models discussed later in this chapter. Popular choices for NBOW functions include the
vector sum or vector average; the latter can be written as
z = g(w ∈ X) = 1|X|
∑
w∈X
vw. (2.1)
2.2.1 Training Models for Text Classification
How do we train our composition function to learn representations useful for, say, our
ideology classification task? We can formalize the task as a binary classification problem,
where we map an input sequence of tokens X to one of two labels. Since this formalism
implies a supervised learning objective, we must have a dataset of sentences that have
been annotated with their authors’ ideologies.2 Our goal is then to learn a composition
function that does a good job of predicting these labels. Using the vector average as our
composition function, we first apply Equation 2.1 to X to get a vector z, which will serve
as input to a classification layer.
Specifically, we apply a logistic regression that takes in z and produces a prediction
yˆ. This is a softmax layer, which induces estimated probabilities for each output label:
yˆp = softmax(Wz), (2.2)
2In Iyyer et al. (2014b) I introduce the Ideological Books Corpus, a dataset for political ideology
classification.
10
where the softmax function is
softmax(q) =
exp q∑k
j=1 exp qj
(2.3)
and W is a 2× d matrix for our two-label problem.3
We want the predictions of the softmax layer to match our annotated data; the
discrepancy between categorical predictions and annotations is measured through the
cross-entropy loss. We optimize the model parameters to minimize the cross-entropy
loss over all sentences in the corpus. The cross-entropy loss of a single example X with
ground-truth label y is
`(yˆ) =
2∑
p=1
yp ∗ log(yˆp). (2.4)
This induces a supervised objective function over all sentences in the dataset: a
regularized sum over all losses normalized by the number of training examples N ,
C =
1
N
N∑
i
`(yˆi) +
λ
2
|θ|2. (2.5)
The model parameters θ (here the word embedding matrix L and the softmax matrix
W ) can be optimized using a variety of online or batch methods. Commonly-used algo-
rithms include Adam (Kingma and Ba, 2014) and the diagonal variant of AdaGrad (Duchi
et al., 2011). The gradient of the objective, shown in Eq. (2.6), is computed using back-
propagation (Rumelhart et al., 1986), which essentially involves repeated application of
3Bias terms should be included, but I omit them here for simplicity.
11
the chain rule.
∂C
∂θ
=
1
N
N∑
i
∂`(yˆi)
∂θ
+ λθ. (2.6)
2.3 Deep NBOW Models
The NBOW is a linear model; it does not have any nonlinear transformations between
its input and the classification layer, which limits its expressivity. The intuition behind
deep feed-forward neural networks is that each layer learns a more abstract representation
of the input than the previous one (Bengio et al., 2013). We can apply this concept to
the NBOW model discussed above with the expectation that each layer will increasingly
magnify small but meaningful differences in the word embedding average. To be more
concrete, take s1 as the sentence “I am supportive of legislation that aims to reduce global
warming” and generate s2 and s3 by replacing “supportive” with “indifferent” and then
again by “opposed”. The vector averages of these three sentences are almost identical, but
the averages associated with the synonymous sentences s1 and s2 are slightly more similar
to each other than they are to s3’s average.
Could adding depth to NBOW make small such distinctions as this one more ap-
parent? In Equation 2.1, we compute z, the vector representation for input text X , by
averaging the word vectors vw∈X . Instead of directly passing this representation to an
output layer, we can further transform z by adding more layers before applying the softmax.
Suppose we have n layers, z1...n. We compute each layer
zi = f(Wi · zi−1) (2.7)
12
and feed the final layer’s representation, zn, to a softmax layer for prediction. f here is an
element-wise nonlinearity such as tanh.
This model, which I call the deep averaging network (DAN), is still a bag-of-words
model, but its depth allows it to capture subtle variations in the input better than the
standard NBOW model.4 Furthermore, computing each layer requires just a single matrix
multiplication, so the complexity scales with the number of layers instead of the number
of words in the sentence as in more complex models. In practice, we find no significant
difference between the training time of a DAN and that of the shallow NBOW model.
2.4 Recurrent Neural Networks
Now that we know how to train simple composition functions, we move to models that
consider the word order and syntactic structure of their input. We will start with the
recurrent neural network, or RNN. Just like before, feeding a sequence of text X into an
RNN yields a vector that represents the sequence; we will call this vector a hidden state,
or hn. A softmax layer over hn predicts the output label.
RNNs read input sentences from left to right, which means that they compute a
hidden state at every word in the input. The computation of a hidden state ht depends on
both the current input embedding vwt and the previous hidden state ht−1. Formally, RNNs
are defined as
ht = f(W
 vwt
ht−1
), (2.8)
4This statement is corroborated by experimental findings in (Iyyer et al., 2015).
13
where, like before, f is an element-wise nonlinearity such as tanh. The model is then
trained just like before, except here we have an additional model parameter (the transition
matrix W), and the gradients are computed through a variant of backpropagation called
backpropagation through time (Werbos, 1990).
RNNs can suffer from the vanishing gradient problem, which refers to the phe-
nomenon of gradients from earlier time steps having much lower magnitudes than those at
more recent ones. This issue arises from repeated applications of the weight matrix W in
Equation 2.8, which is proven to result in vanishing gradients if the largest eigenvalue of
W is less than 1 (Pascanu et al., 2013). Multiple variants have been proposed to combat
this problem, the most popular of which are the long short-term memory (Hochreiter and
Schmidhuber, 1997, LSTM) and the gated recurrent unit (Cho et al., 2014, GRU). I will
return to LSTM units in Chapter 6.
2.4.1 Tree-Structured Variants
Equation 2.8 implies a fixed left-branching tree structure. The equation can, however, be
extended to operate on predefined tree structures, as in the case of tree neural networks
(TreeNN). These models were first introduced by Pollack (1990) and recently repopular-
ized by Richard Socher and colleagues (Socher et al., 2011b,a). Here, the composition
function g depends on a syntactic parse tree of the input sequence. The representation
for any internal node in a binary parse tree is computed as a nonlinear function of the
representations of its children. A more powerful TreeNN variant is the recursive neural
tensor network (RecNTN), which modifies g to include a costly tensor product (Socher
14
et al., 2013).
Given a binary constituency parse tree of a sentence, the leaves of the tree will be
the words of the sentence w1...n where each word is associated with a vector vw ∈ L. The
parse tree defines how these words form phrases (bottom of Figure 2.2). Each of these
phrases p also has an associated vector xp ∈ Rd of a user-specified dimensionality. These
phrase vectors should represent the meaning of the phrases composed of individual words.
As phrases themselves merge into complete sentences, the underlying vector representation
is trained to retain the sentence’s whole meaning.
The challenge is to describe how vectors combine to form complete representations.
If two words a and b merge to form phrase p, we posit that the phrase-level vector is
xp = f(W
va
vb
), (2.9)
where W is a 2d× d composition matrix shared across all nodes in the tree.
The “recursive” aspect of the TreeNN lies in its use of weight sharing. The composi-
tion matrix is repeatedly applied to nodes in the parse tree in a bottom-up fashion. Thus,
once xp is computed for a non-root node p, it will be fed into Eq. (2.9) again as part of the
computation for its parent’s phrase vector xpar:
xpar = f(W
vc
xp
). (2.10)
15
Mrs. Dalloway said she would buy the flowers herself
ROOT
NN NSUBJ
NSUBJ
AUX
CCOMP
DET NSUBJ
XCOMP
Figure 2.3: Dependency parse tree of the opening sentence of Virginia Woolf’s
Mrs. Dalloway.
2.4.1.1 Dependency Tree RecNNs
TreeNNs are not limited to only binary tree structures; in fact, dependency tree neural
networks (DTreeNN) have shown improved performance on image-to-text mapping and
question answering tasks (Socher et al., 2014; Iyyer et al., 2014a) compared to constituency
tree TreeNNs. Because every node of a dependency parse tree is associated with a word, it
has less nodes in total than a corresponding constituency tree, which reduces vanishing
gradient issues. However, internal nodes of dependency parse trees are unbounded in the
number of children they can have, which makes weight sharing more complicated than in
the constituency case. The solution is to “untie” the composition matrices by dependency
relations, which allows us to inject more syntactic information into the model.
Untying the model in this way requires more than one composition matrix, unlike
the binary tree case. In particular, we associate a separate d × d matrix Wr with each
dependency relation r in our dataset and learn these matrices during training. Syntactically
untying these matrices allows the model to take advantage of relation identity as well as
tree structure. We include an additional d× d matrix, Wv, to incorporate the word vector
vw at a node n into the hidden vector hn.
Given a dependency parse tree, we first compute hidden representations for the leaf
16
nodes. For example, the hidden representationhmrs. for the parse tree given in Figure 2.4.1.1
is
hmrs. = f(Wv · vmrs.). (2.11)
After finishing with the leaves, we move to interior nodes whose children have already
been processed. Continuing from mrs. to its parent, dalloway, we compute
hdalloway = f(WNN · hmrs. + Wv · vdalloway). (2.12)
We repeat this process up to the root, which is
hsaid = f(WNSUBJ · hdalloway + WCCOMP · hbuy + Wv · vsaid). (2.13)
The composition equation for any node n with children K(n) and word vector vw is hn =
f(Wv · vw + b1 +
∑
k∈K(n)
WR(n,k) · hk), (2.14)
where R(n, k) is the dependency relation between node n and child node k. The model is
trained using the same procedure as that for constituency-tree TreeNNs.
2.5 Discourse-Level Representation Learning
The neural network architectures described above have mainly been applied to sentence-
level NLP tasks. The objective of this thesis is to go beyond the sentence to discourse-level
language comprehension. In this section, I provide an overview of existing discourse-level
17
language understanding methods, and I also go over more recent deep learning-based
developments in this area. The latter discussion will be at a relatively high level, as specific
instances of these models will be described more fully in later chapters.
2.5.1 Discourse-Level Understanding
The term discourse refers to any unit of text that consists of more than one sentence.
Discourse coherence refers to the logical order of sentences and how each fits with
the others to produce a larger understandable meaning. As an example, a paragraph
constructed by randomly selecting four sentences from this chapter would almost certainly
have low coherence. There are many ways to formalize coherence, such as Hobbs’ set
of coherence relations (Hobbs, 1979) that categorizes inter-sentence relationships as
“explanations”, “elaborations”, or “results”, among others. A popular formalism in the
NLP community is rhetorical structure theory (RST), which categorizes units of text (not
just sentences) into different roles (e.g., “background”, “evidence”) that define how they
can be combined together in a hierarchical fashion (Mann and Thompson, 1988). The task
of discourse parsing (Soricut and Marcu, 2003; Feng and Hirst, 2012; Ji and Eisenstein,
2014) automatically discovers these sorts of relations in text, trained using annotated
datasets such as the RST Discourse Treebank (Carlson et al., 2003) or the Penn Discourse
Treebank (Miltsakaki et al., 2004). Other research in this vein looks at cross-sentential
coreference resolution (van Hoek, 1997).
Aside from coherence, sentence cohesion is also an important property of discourse.
Take the task of discourse segmentation (Passonneau and Litman, 1997), which involves
18
detecting boundaries in a given document where the topic of discussion shifts. Instances
of lexical cohesion are useful for this task; for example, a repetition of certain words or
their synonyms likely indicates that the topic has not changed (Halliday and Hasan, 1976).
These types of features are taken into account by methods such as TextTiling (Hearst,
1997), which automatically segment documents into their subtopics.
In my thesis, I focus on coherence relations between sentences and larger units of
text rather than cohesive ones. In contrast to traditional discourse research on coherence,
the work in my thesis does not focus on discovering specific connections between nearby
sentences (e.g., discourse relations). Instead, I focus on learning representations from
discourse-level language that are useful for downstream tasks; this is an atheoretical
approach that does not depend on a specified discourse formalism. Examples of simple
and general discourse-level representations are bag-of-words (or bag-of-ngrams) vectors,
or the slightly more involved TF-IDF formulation (Salton and McGill, 1986) popular in
information retrieval. However, these representations are constant across all tasks, unlike
machine learning models whose parameters are updating using a task-specific error signal.
More complex models that perform some sort of matrix factorization, such as supervised
latent Dirichlet allocation (SLDA) (Mcauliffe and Blei, 2008), are able to modify their
representations from downstream supervision. Finally, research on scripts, or sequenced
actions for common situations, is also relevant to the portions of my dissertation research
that focus on creative language; Section 5.6 discusses this line of work in more detail.
19
2.5.2 Deep Discourse-Level Representation Learning
Unsupervised deep learning models have shown increased effectiveness over the above non-
neural methods in terms of learning general purpose discourse-level representations. Two
popular examples are Paragraph Vector (Le and Mikolov, 2014), which applies algorithms
from word embedding learning to paragraphs, and recurrent autoencoder-based methods
that reconstruct entire documents through bottleneck representations learned by neural
networks (Li et al., 2015).
Task-specific discourse-level representations have also seen increased research inter-
est. For example, recurrent / convolutional network hybrids have been used for document-
level sentiment analysis (Tang et al., 2015), and some reading comprehension-style
question-answering tasks also rely on similar methods for representing passages (Dhingra
et al., 2017); these tasks and models are most related to the QA tasks I discuss in Chapters 3
and 4. However, building deep discourse-level models for underspecified tasks, such as
the fictional relationship analysis of Chapter 5, steps farther from the current trends in this
research area, and interpretability becomes a critical factor for these tasks.
20
Chapter 3
Deep Learning for Multi-Sentence Factoid Question Answering
Armed with both deep learning fundamentals and descriptions of both existing (RNN) and
newly-proposed (DAN) models from the last chapter, we are now prepared to jump into
applications of these models.1 Here we focus on the task of question answering (QA)
where input questions are multiple sentences long. How do we apply models like TreeNNs
across multiple sentences? Are such complex models even necessary? In this chapter, I
demonstrate that a very simple element-wise vector average is effective for a factoid QA
task called quiz bowl, which is a trivia game whose questions describe famous entities
(e.g., authors, battles, novels). The following chapter describes a more challenging QA task
for which averaging is insufficient to capture all of the necessary contextual information,
motivating the use of structured prediction methods.
3.1 Quiz Bowl: Factoid QA
Consider factoid question answering: given a description of an entity, identify the person,
place, or thing discussed. We describe a task with high-quality mappings from natural
language text to entities in Section 3.1.1. This task—quiz bowl—is a challenging natural
language problem with large amounts of diverse and compositional data.
To answer quiz bowl questions, we develop a dependency tree neural network in
1This chapter synthesizes work previously published in Iyyer et al. (2014a, EMNLP) and Iyyer et al.
(2015, ACL).
21
Later in its existence, this polity’s leader was chosen by a group that included three bishops and six
laymen, up from the seven who traditionally made the decision. Free imperial cities in this polity included
Basel and Speyer. Dissolved in 1806, its key events included the Investiture Controversy and the Golden
Bull of 1356. Led by Charles V, Frederick Barbarossa, and Otto I, for 10 points, name this polity, which
ruled most of what is now Germany through the Middle Ages and rarely ruled its titular city.
Figure 3.1: An example quiz bowl question about the Holy Roman Empire.
The first sentence contains no words or named entities that by themselves are
indicative of the answer, while subsequent sentences contain more and more
obvious clues.
Section 3.1.2 and extend it to combine predictions across sentences to produce a question
answering neural network with trans-sentential averaging (QANTA). We evaluate our model
against strong computer and human baselines in Section 3.1.3 and conclude by examining
the latent space and model mistakes.
3.1.1 Matching Text to Entities: Quiz Bowl
Every weekend, hundreds of high school and college students play a game where they
map raw text to well-known entities. This is a trivia competition called quiz bowl. Quiz
bowl questions consist of four to six sentences and are associated with factoid answers
(e.g., history questions ask players to identify specific battles, presidents, or events). Every
sentence in a quiz bowl question is guaranteed to contain clues that uniquely identify
its answer, even without the context of previous sentences. Players answer at any time—
ideally more quickly than the opponent—and are rewarded for correct answers.
Automatic approaches to quiz bowl based on existing NLP techniques are doomed
to failure. Quiz bowl questions have a property called pyramidality, which means that
sentences early in a question contain harder, more obscure clues, while later sentences
are “giveaways”. This design rewards players with deep knowledge of a particular subject
22
and thwarts bag of words methods. Sometimes the first sentence contains no named
entities—answering the question correctly requires an actual understanding of the sentence
(Figure 3.1). Later sentences, however, progressively reveal more well-known and uniquely
identifying terms.
Previous work answers quiz bowl questions using a bag of words (naı¨ve Bayes)
approach (Boyd-Graber et al., 2012). These models fail on sentences like the first one in
Figure 3.1, a typical hard, initial clue. Tree-structured neural networks (TreeNNs), in con-
trast to simpler models, can capture the compositional aspect of such sentences (Hermann
et al., 2013).
TreeNNs require many redundant training examples to learn meaningful representa-
tions, which in the quiz bowl setting means we need multiple questions about the same
answer. Fortunately, hundreds of questions are produced during the school year for quiz
bowl competitions, yielding many different examples of questions asking about any entity
of note (see Section 3.1.3.1 for more details). Thus, we have built-in redundancy (the
number of “askable” entities is limited), but also built-in diversity, as difficult clues cannot
appear in every question without becoming well-known.
3.1.2 Negative Sampling Instead of Softmax
In this chapter we use the dependency-tree TreeNN (DTreeNN) that we defined formally
in Chapter 2, with one important difference: we no longer use a softmax output layer. Our
goal is to map questions to their corresponding answer entities. Because there are a limited
number of possible answers, we can view this as a multi-class classification task. While a
23
softmax layer over every node in the tree could predict answers (Socher et al., 2011b; Iyyer
et al., 2014b), this method overlooks that most answers are themselves words (features) in
other questions (e.g., a question on World War II might mention the Battle of the Bulge
and vice versa). Thus, word vectors associated with such answers can be trained in the
same vector space as question text,2 enabling us to model relationships between answers
instead of assuming incorrectly that all answers are independent. To take advantage of this
observation, we train both the answers and questions jointly in a single model, rather than
training each separately and holding embeddings fixed during DTreeNN training.
Intuitively, we want to encourage the vectors of question sentences to be near their
correct answers and far away from incorrect answers. We accomplish this goal by using a
contrastive max-margin objective function described below. While we are not interested in
obtaining a ranked list of answers,3 we observe better performance by adding the weighted
approximate-rank pairwise (WARP) loss proposed in Weston et al. (2011) to our objective
function.
Given a sentence paired with its correct answer c, we randomly select j incorrect
answers from the set of all incorrect answers and denote this subset as Z. Since c is part of
the vocabulary, it has a vector xc ∈ L. An incorrect answer z ∈ Z is also associated with a
vector xz ∈ L. We define S to be the set of all nodes in the sentence’s dependency tree,
where an individual node s ∈ S is associated with the hidden vector hs. The error for the
2Of course, questions never contain their own answer as part of the text.
3In quiz bowl, all wrong guesses are equally detrimental to a team’s score, no matter how “close” a guess
is to the correct answer.
24
sentence is
C(S, θ) =
∑
s∈S
∑
z∈Z
R(rank(c, s, Z))max(0, 1− xc · hs + xz · hs), (3.1)
where the function rank(c, s, Z) provides the rank of correct answer c with respect to the
incorrect answers Z. We transform this rank into a loss function4 shown by Usunier et al.
(2009) to optimize the top of the ranked list, R(r) =
r∑
i=1
1/i.
Since rank(c, s, Z) is expensive to compute, we approximate it by randomly sam-
pling K incorrect answers until a violation is observed (xc · hs < 1 + xz · hs) and set
rank(c, s, Z) = (|Z| − 1)/K, as in previous work (Weston et al., 2011; Hermann et al.,
2014). The model minimizes the sum of the error over all sentences T normalized by the
number of nodes N in the training set,
J(θ) =
1
N
∑
t∈T
C(t, θ). (3.2)
The parameters θ = (Wr∈R,Wv,We, b), where R represents all dependency relations in
the data, are optimized using AdaGrad as before (Duchi et al., 2011).5 In Section 3.1.3 we
compare performance to an identical model (FIXED-QANTA) that excludes answer vectors
from L and show that training them as part of θ produces significantly better results.
4Adding this loss term to the objective function not only increases performance but also speeds up
convergence
5We set the initial learning rate η = 0.05 and reset the squared gradient sum to zero every five epochs.
25
The gradient of the objective function,
∂C
∂θ
=
1
N
∑
t∈T
∂J(t)
∂θ
, (3.3)
is again computed using backpropagation through structure (Goller and Kuchler, 1996).
3.1.3 Experiments
We compare the performance of QANTA against multiple strong baselines on two datasets.
QANTA outperforms all baselines trained only on question text and improves an information
retrieval model trained on all of Wikipedia. QANTA requires that an input sentence describes
an entity without mentioning that entity, a constraint that is not followed by Wikipedia
sentences.6 While IR methods can operate over Wikipedia text with no issues, we show
that the representations learned by QANTA over just a dataset of question-answer pairs can
significantly improve the performance of IR systems.
3.1.3.1 Datasets
We evaluate our algorithms on a corpus of over 100,000 question/answer pairs from
two different sources. First, we expand the dataset used in Boyd-Graber et al. (2012)
with publically-available questions from quiz bowl tournaments held after that work was
published. This gives us 46,842 questions in fourteen different categories. To this dataset
we add 65,212 questions from NAQT, an organization that runs quiz bowl tournaments and
6We tried transforming Wikipedia sentences into quiz bowl sentences by replacing answer mentions with
appropriate descriptors (e.g., “Joseph Heller” with “this author”), but the resulting sentences suffered from a
variety of grammatical issues and did not help the final result.
26
generously shared with us all of their questions from 1998–2013.
Because some categories contain substantially fewer questions than others (e.g.,
astronomy has only 331 questions), we consider only literature and history questions, as
these two categories account for more than 40% of the corpus. This leaves us with 21,041
history questions and 22,956 literature questions.
Data Preparation To make this problem feasible, we only consider a limited set of the
most popular quiz bowl answers. Before we filter out uncommon answers, we first need
to map all raw answer strings to a canonical set to get around formatting and redundancy
issues. Most quiz bowl answers are written to provide as much information about the entity
as possible. For example, the following is the raw answer text of a question on the Chinese
leader Sun Yat-sen: Sun Yat-sen; or Sun Yixian; or Sun Wen; or Sun Deming; or Nakayama
Sho; or Nagao Takano. Quiz bowl writers vary in how many alternate acceptable answers
they provide, which makes it tricky to strip superfluous information from the answers
using rule-based approaches.
Instead, we use Whoosh,7 an information retrieval library, to generate features in an
active learning classifier that matches existing answer strings to Wikipedia titles. If we are
unable to find a match with a high enough confidence score, we throw the question out of
our dataset. After this standardization process and manual vetting of the resulting output,
we can use the Wikipedia page titles as training labels for the DTreeNN and baseline
models.8
Quiz bowl answer distribution has a long-tail: 65.6% of answers only occur once or
7https://pypi.python.org/pypi/Whoosh/
8Code and non-NAQT data available at http://cs.umd.edu/˜miyyer/qblearn.
27
twice in the corpus. We filter out all answers that do not occur at least six times, which
leaves us with 451 history answers and 595 literature answers that occur on average twelve
times in the corpus. These pruning steps result in 4,460 usable history questions and 5,685
literature questions. While ideally we would have used all answers, our model benefits
from many training examples per answer to learn meaningful representations; this issue
can possibly be addressed with techniques from zero shot learning (Palatucci et al., 2009;
Pasupat and Liang, 2014), which we leave to future work.
We apply basic named entity recognition (NER) by replacing all occurrences of
answers in the question text with single entities (e.g., Ernest Hemingway becomes
Ernest Hemingway). While we experimented with more advanced NER systems to detect
non-answer entities, they could not handle multi-word named entities like the book Love
in the Time of Cholera (title case) or battle names (e.g., Battle of Midway). A simple
search/replace on all answers in our corpus works better for multi-word entities.
The preprocessed data are split into folds by tournament. We choose the past two
national tournaments9 as our test set as well as questions previously answered by players
in Boyd-Graber et al. (2012) and assign all other questions to train and dev sets. History
results are reported on a training set of 3,761 questions with 14,217 sentences and a test set
of 699 questions with 2,768 sentences. Literature results are reported on a training set of
4,777 questions with 17,972 sentences and a test set of 908 questions with 3,577 sentences.
Finally, we initialize the word embedding matrix We with word2vec (Mikolov et al.,
2013) trained on the preprocessed question text in our training set.10 We use the hierarchical
9The tournaments were selected because NAQT does not reuse any questions or clues within these
tournaments.
10Out-of-vocabulary words from the test set are initialized randomly.
28
skip-gram model setting with a window size of five words.
3.1.4 Baselines
We pit QANTA against two types of baselines: unordered bag of words models, which
enable comparison to a standard NLP baseline, and information retrieval models, which
allow us to compare against traditional question answering techniques.
BOW The BOW baseline is a logistic regression classifier trained on binary unigram
indicators.11 This simple discriminative model is an improvement over the generative quiz
bowl answering model of Boyd-Graber et al. (2012).
BOW-DT The BOW-DT baseline is identical to BOW except we augment the feature set
with dependency relation indicators. We include this baseline to isolate the effects of the
dependency tree structure from our compositional model.
IR-QB The IR-QB baseline maps questions to answers using the state-of-the-art Whoosh
IR engine. The knowledge base for IR-QB consists of “pages” associated with each answer,
where each page is the union of training question text for that answer. Given a partial
question, the text is first preprocessed using a query language similar to that of Apache
Lucene. This processed query is then matched to pages uses BM-25 term weighting, and the
top-ranked page is considered to be the model’s guess. We also incorporate fuzzy queries
to catch misspellings and plurals and use Whoosh’s built-in query expansion functionality
to add related keywords to our queries.
IR-WIKI The IR-WIKI model is identical to the IR-QB model except that each “page” in
11Raw word counts, frequencies, and TF-IDF weighted features did not increase performance, nor did
adding bigrams to the feature set (possibly because multi-word named entities are already collapsed into
single words).
29
its knowledge base also includes all text from the associated answer’s Wikipedia article.
Since all other baselines and DTreeNN models operate only on the question text, this is
not a valid comparison, but we offer it to show that we can improve even this strong model
using QANTA.
3.1.5 DTreeNN Configurations
For all DTreeNN models the vector dimension d and the number of wrong answers per
node j is set to 100. All model parameters other than L are randomly initialized. The
nonlinearity f is again the normalized tanh function.12
QANTA is our DTreeNN model with feature averaging across previously-seen sen-
tences in a question. To obtain the final answer prediction given a partial question, we
first generate a feature representation for each sentence within that partial question. This
representation is computed by concatenating together the word embeddings and hidden
representations averaged over all nodes in the tree as well as the root node’s hidden vector.
Finally, we send the average of all of the individual sentence features13 as input to a logistic
regression classifier for answer prediction.
FIXED-QANTA uses the same DTreeNN configuration as QANTA except the answer
vectors are kept constant as in the text-to-image model.
12The standard tanh function produced heavy saturation at higher levels of the trees, and corrective
weighting as in Socher et al. (2014) hurt our model because named entities that occur as leaves are often
more important than non-terminal phrases.
13Initial experiments with L2 regularization hurt performance on a validation set.
30
History Literature
Model Pos 1 Pos 2 Full Pos 1 Pos 2 Full
BOW 27.5 51.3 53.1 19.3 43.4 46.7
BOW-DT 35.4 57.7 60.2 24.4 51.8 55.7
IR-QB 37.5 65.9 71.4 27.4 54.0 61.9
FIXED-QANTA 38.3 64.4 66.2 28.9 57.7 62.3
QANTA 47.1 72.1 73.7 36.4 68.2 69.1
IR-WIKI 53.7 76.6 77.5 41.8 74.0 73.3
QANTA+IR-WIKI 59.8 81.8 82.3 44.7 78.7 76.6
Table 3.1: Accuracy for history and literature at the first two sentence positions
of each question and the full question. The top half of the table compares
models trained on questions only, while the IR models in the bottom half have
access to Wikipedia. QANTA outperforms all baselines that are restricted to
just the question data, and it substantially improves an IR model with access
to Wikipedia despite being trained on much less data.
3.1.6 Human Comparison
Previous work provides human answers (Boyd-Graber et al., 2012) for quiz bowl questions.
We use human records for 1,201 history guesses and 1,715 literature guesses from twenty-
two of the quiz bowl players who answered the most questions.14
The standard scoring system for quiz bowl is 10 points for a correct guess and -5
points for an incorrect guess. We use this metric to compute a total score for each human.
To obtain the corresponding score for our model, we force it to imitate each human’s
guessing policy. For example, Figure 3.1.6 shows a human answering in the middle of the
second sentence. Since our model only considers sentence-level increments, we compare
the model’s prediction after the first sentence to the human prediction, which means our
model is privy to less information than humans.
14Participants were skilled quiz bowl players and are not representative of the general population.
31
A minor character in this play can be summoned by a bell that does not always work; that character also
doesn’t have eyelids. Near the end, a woman who drowned her illegitimate child attempts to stab another
woman in the Second Empire-style 3 room in which the entire play takes place. For 10 points, Estelle
and Ines are characters in which existentialist play in which Garcin claims “Hell is other people”, written
by Jean-Paul Sartre?
Figure 3.2: A question on the play “No Exit” with human buzz position
marked as 3. Since the buzz occurs in the middle of the second sentence, our
model is only allowed to see the first sentence.
200
150
100
50
0
50
100
150
200
S
co
re
 D
if
fe
re
n
ce
History: Model vs. Human
Model loses
Model wins 400
300
200
100
0
100
200
S
co
re
 D
if
fe
re
n
ce
Literature: Model vs. Human
Model loses
Model wins
Figure 3.3: Comparisons of QANTA+IR-WIKI to human quiz bowl players.
Each bar represents an individual human, and the bar height corresponds
to the difference between the model score and the human score. Bars are
ordered by human skill. Red bars indicate that the human is winning, while
blue bars indicate that the model is winning. QANTA+IR-WIKI outperforms
most humans on history questions but fails to defeat the “average” human on
literature questions.
The resulting distributions are shown in Figure 3.1.6—our model does better than the
average player on history questions, tying or defeating sixteen of the twenty-two players,
but it does worse on literature questions, where it only ties or defeats eight players. The
figure indicates that literature questions are harder than history questions for our model,
which is corroborated by the experimental results discussed in the next section.
3.1.7 Discussion
In this section, we examine why QANTA improves over our baselines by giving examples of
questions that are incorrectly classified by all baselines but correctly classified by QANTA.
We also take a close look at some sentences that all models fail to answer correctly. Finally,
32
we visualize the answer space learned by QANTA.
3.1.7.1 Experimental Results
Table 3.1.5 shows that when bag of words and information retrieval methods are restricted
to question data, they perform significantly worse than QANTA on early sentence positions.
The performance of BOW-DT indicates that while the dependency tree structure helps by
itself, the compositional distributed representations learned by QANTA are more useful. The
significant improvement when we train answers as part of our vocabulary (see Section 3.1.2)
indicates that our model uses answer occurrences within question text to learn a more
informative vector space.
The disparity between IR-QB and IR-WIKI indicates that the information retrieval
models need lots of external data to work well at all sentence positions. IR-WIKI performs
better than other models because Wikipedia contains many more sentences that partially
match specific words or phrases found in early clues than the question training set. In
particular, it is impossible for all other models to answer clues in the test set that have no
semantically similar or equivalent analogues in the training question data. With that said,
IR methods can also operate over data that does not follow the special constraints of quiz
bowl questions (e.g., every sentence uniquely identifies the answer, answers don’t appear
in their corresponding questions), which QANTA cannot handle. By combining QANTA
and IR-WIKI, we are able to leverage access to huge knowledge bases along with deep
compositional representations, giving us the best of both worlds.
33
3.1.7.2 Where the Attribute Space Helps Answer Questions
We look closely at the first sentence from a literature question about the author Thomas
Mann: “He left unfinished a novel whose title character forges his father’s signature to get
out of school and avoids the draft by feigning desire to join”.
All baselines, including IR-WIKI, are unable to predict the correct answer given only
this sentence. However, QANTA makes the correct prediction. The sentence contains no
named entities, which makes it almost impossible for bag of words or string matching
algorithms to predict correctly. Figure 3.1.7.4 shows that the plot description associated
with the “novel” node is strongly indicative of the answer. The five highest-scored answers
are all male authors,15 which shows that our model is able to learn the answer type without
any hand-crafted rules.
Our next example, the first sentence in Table 3.1.7.4, is from the first position of a
question on John Quincy Adams, which is correctly answered by only QANTA. The bag
of words model guesses Henry Clay, who was also a Secretary of State in the nineteenth
century and helped John Quincy Adams get elected to the presidency in a “corrupt bargain”.
However, the model can reason that while Henry Clay was active at the same time and
involved in the same political problems of the era, he did not represent the Amistad slaves,
nor did he negotiate the Treaty of Ghent.
15three of whom who also have well-known unfinished novels
34
3.1.7.3 Where all Models Struggle
Quiz bowl questions are intentionally written to make players work to get the answer,
especially at early sentence positions. Our model fails to answer correctly more than half
the time after hearing only the first sentence. We examine some examples to see if there
are any patterns to what makes a question “hard” for machine learning models.
Consider this question about the Italian explorer John Cabot: “As a young man, this
native of Genoa disguised himself as a Muslim to make a pilgrimage to Mecca”.
While it is obvious to human readers that the man described in this sentence is
not actually a Muslim, QANTA has to accurately model the verb disguised to make that
inference. We show the score plot of this sentence in Figure 3.1.7.4. The model, after
presumably seeing many instances of muslim and mecca associated with Mughal emperors,
is unable to prevent this information from propagating up to the root node. On the bright
side, our model is able to learn that the question is expecting a human answer rather than
non-human entities like the Umayyad Caliphate.
More examples of impressive answers by QANTA as well as incorrect guesses by all
systems are shown in Table 3.1.7.4.
3.1.7.4 Examining the Attribute Space
Figure 3.1.7.4 shows a t-SNE visualization (Van der Maaten and Hinton, 2008) of the 451
answers in our history dataset. The vector space is divided into six general clusters, and
we focus in particular on the US presidents. Zooming in on this section reveals temporal
clustering: presidents who were in office during the same timeframe occur closer together.
35
TSNE-1
TS
NE
-2 Wars, rebellions, and battles
U.S. presidents
Prime ministers
Explorers & emperors
Policies
Other
tammany_hall
calvin_coolidge
lollardy
fourth_crusade
songhai_empire
peace_of_westphalia
inca_empire
atahualpa
charles_sumner
john_paul_jones
wounded_knee_massacre
huldrych_zwingli
darius_i
battle_of_ayacucho
john_cabotghana
ulysses_s._grant
hartford_convention
civilian_conservation_corps
roger_williams_(theologian)
george_h._pendleton
william_mckinley
victoria_woodhull
credit_mobilier_of_america_scandal henry_cabot_lodge,_jr.
mughal_empire
john_marshall
cultural_revolution
guadalcanal
louisiana_purchase
night_of_the_long_knives
chandragupta_maurya
samuel_de_champlain
thirty_years'_war
compromise_of_1850
battle_of_hastings
battle_of_salamis
akbar
lewis_cass
dawes_plan
hernando_de_soto
carthage
joseph_mccarthy
maine
salvador_allende
battle_of_gettysburg
mikhail_gorbachev
aaron_burr
equal_rights_amendment
war_of_the_spanish_succession
coxey's_army
george_meade
fourteen_points
mapp_v._ohio sam_houston
ming_ ynasty
boxer_rebellion
anti-masonic_party
porfirio_diaz
treaty_of_portsmouth
thebes,_greece
golden_horde
francisco_i._madero
hittites
james_g._blaineschenck_v._united_states
caligula
william_walker_(filibuster)
henry_vii_of_ ngland
konrad_adenauer
kellogg-briand_pact
battle_of_culloden
treaty_of_brest-litovsk
william_p nn
a._philip_randolph
h nry_l._stimson
whig_party_(united_states)
caroline_affair clarence_darrow
whiskey_rebellion
battle_of_midway
battle_of_lepanto
adolf_eichmann
georges_clemenceau
battle_of_the_little_bighornpontiac_(person)
black_hawk_war
battle_of_tannenberg
clayton_antitrust_act
provisions_of_oxford
battle_of_actium
suez_crisis
spartacus
dorr_rebellion
jay_treaty
triangle_shirtwaist_factory_fire
kamakura_shogunate
julius_nyerere
frederick_douglass
pierre_trudeau
nagasaki
suleiman_the_magnificent
falklands_war
war_of_devolution
charlemagne
daniel_boone
edict_of_nantes
harry_s._truman
shaka
pedro_alvares_cabral
thomas_hart_benton_(politician)
battle_of_the_coral_sea
peterloo_massacre
battle_of_bosworth_field
roger_b._taney
bernardo_o'higgins
neville_chamberlain
henry_hudson
cyrus_the_great
jane_addams
rough_riders
james_a._garfield
napoleon_iii
missouri_compromise
battle_of_leyte_gulf
ambrose_burnside
trent_affair
maria_theresa
william_ewart_gladstone
walter_mondale
barry_goldw terlouis_riel
hideki_tojo
marco_polo
brian_mulroney
truman_doctrine
roald_amundsen
tokugawa_shogunate
eleanor_of_aquitaine
louis_brandeis
battle_of_trenton
khmer_empire
benito_juarez
battle_of_antietam
whiskey_ring
otto_von_bismarck
book r_t._washington
battle_of_bannockburneugene_v._debs
erie_canal
jameson_raid
green_mountain_boys
haymarket_affair
finland
fashoda_incident
battle_of_shiloh
hannibal
john_jay
easter_rising
jamaica
brook_farm
umayyad_caliph te
muhammad
francis_drake
clara_barton
shays'_rebellion verdun
hadrianvyacheslav_molotov
oda_nobunaga
canossa
samuel_gompers
battle_of_bunker_hi l
battle_of_plassey
david_livingstone
sol n
pericles
tang_dynasty
teutonic_knights
second_vatican_council
alfred_dreyfus
henry_the_navigator
nelson_mandela
peasants'_revolt
gaius_marius
getulio_vargas
horatio_gates
john_t._scopes
league_of_nations
first_battle_of_bull_run
alfred_the_great
leonid_brezhn
cherokee
long_march
emiliano_zapata
james_monroe
woodrow_wilson
vandals
william_henry_harrison
battle_of_puebla
battle_of_zama
justinian_i
thaddeus_stevens
cecil_rhodes
kwame_nkrumah
diet_of_worms
george_armstrong_custer
battle_of_agincourt
seminole_wars
shah_jahan
amerigo_vespucci
john_foster_dulles
lester_b._pearson
oregon_trail
claudius
lateran_treaty
chester_a._arthur
opium_wars
treaty_of_utrecht
knights_of_labor
alexander_hamilton
plessy_v._ferguson
hora e_greeley
mary baker_eddy
alex nder_kerensky
jacquerie
treaty_of_ghent
b y_of_pigs_invasion
antonio_lopez_de_santa_anna
great_northern_war
henry_i_of_england
council_of_trent
chiang_kai-shek
samuel_j._tilden
fidel_castro
wilmot_proviso
yu n_dynasty
bastille
b njamin_harrison
war_of_the_austrian_successioncrimean_war
john_brown_(abolitionist)
teapot_dome_scandal
albert_b._fall
marcus_licinius_crassus
earl_warren
warren_g._harding
gunpowder_plot
homestead_strike
samuel_adams
john_peter_zenger
thomas_paine
free_soil_party
st._barth lomew's_day_massacre
arthur_wellesley,_1st_duke_of_wellington
charles_de_gaulle
leon_trotsky
hugh_capet
al xander_h._stephens
haile_selassie
illiam_h._seward
rutherford_b._hayes
safavid_dynasty
muhammad_ali_jinnah
kulturkampf
maximilien_de_robespierre
hub rt_humphrey
luddite
hull_house
philip_ii_of_macedon
guelphs_a d_ghibellines
byzantine_empire
albigensian_crusade
diocletian
fort_ticonderoga
parthian_empire
charles_martel
william_jennings_bryan
alexan er_ii_of_russia
ferdinand_magellan
state_of_franklin
ivan_the_terrible
martin_luther_(1953_film)
millard_fillmore
francisco_franco
aethelred_th _unready
ronald_reagan
benito_mussolini
henry_clay
kitchen_cabinet
black_hole_of_calcutta
ancient_corinth
john_wilkes_b th
john_tyler
robert_walpole
huey_long
tokugawa_ieyasu
thomas_nast
nikita_khrushchev
andrew_jackson
portug l
labour_party_(uk)
monroe_doctrine
john_quincy_adams
congress_of_berlin
tecumseh
jacques_cartier
battle_ f_the_thames
spanis _civil_war
ethiopia
fugitiv _slave_laws
joh _a._macdonald
council_of_chalcedon
pancho_villa
war_of_the_pacific
george_ allace
susan_b._anthony
marcus_garvey
grover_cleveland
john_hay
george_b._mcclellan
october_manifesto
vitus_bering
john_hanc ck
william_lloyd_garrison
platt_amendment
mary,_queen_of_scots
first_triumvirate
fra cisc _vasquez_de_coronado
margaret_thatcher
sherma _antitrust_act
hanseatic_league
henry_morton_stanley
ju y_revolution
stephen_a._douglas
xyz_affair
jimmy_ca ter
francisco_pizarro
kublai_khan
vasco_da_gama
sparta
battle_of_caporetto
ostend_manifesto
mustafa_kemal_ataturk
peter_the_great
gang_of_four
battle_of_chance lorsville
david_lloyd_george
cardinal_mazarin
embargo_act_of_1807
brigham_young
charles_lindbergh
hudson's_bay_company
attila
paris_commu e
jefferson_davis
amelia_earhart
mali_empire
adolf_hitler
benedict_arnold
camillo_benso,_count_of_cavour
meiji_restoration
black_panther_party
mark_antony
f anklin_p erce
molly_maguires
zachary_taylor
han_dynasty
adlai_stevenson_ii
james_k._polk
douglas_macarthur
boston_massacre
toyotomi_hidey shi
greenback_party
second_boer_war
third_crusade
j es_buchanan
joh _she man
george_washington
wars_of_the_roses
atlantic_charter
eleanor_roosevelt
congress_ f_vienna
john_wycliffe
winston_churchill
milio_aguinaldo
m guel_hidalgo_ costill
second_bank_of_the_united_states
council_of constance
seneca_falls_convention
first_crusade
spiro_agnew
taiping_rebellion
mao_zedong
paul_von_hindenburg
albany_congress
jawaharlal_nehru
battle_of_blenheim
ethan_allen
antonio_de_oliveira_salazar
herbert_hoover
pepin_the_short
indira_gandhi
william_howard_taftthomas_jefferson
ga a _abdel_nasser
oliver_cromw ll
salmon_p._chase
battle_of_austerlitz
benjamin_disraeli
gadsden purchase
girolamo_savonarola
treaty_of_tordesillas
battle_of_marathon
elizabeth_cady_stanton
battle_of_kings_mountainchristopher_colum us
william_the_conqueror
battle_of_trafalgar
charles_evans_hughes
cleisthenes
william_tecumseh_sherman
mobutu_sese ek
prague_spring
babur
peloponn sian_w r
jacques_marquette
nero
paraguay
hyksos
artin_van_buren
bonus rmy
cha les_stew rt_parnell
edward_the_confessor
bartolome _d s
salem_witch_trials
battle of_ he_bulge
john_ dams
aginot_lin
henry_cabot_lo
giuseppe_garib ldi
daniel_webster
john_c._calhoun
treaty_of_waitangi
zebulon_pike
genghis_khan
calvin_coolidge
william_mckinley
james_monroe
woodrow_wilson
william_henry_harrison
benjamin_harrison
millard_fillmore
ronald_reagan
john_tyler andrew_jackson
john_quincy_adams
grover_cleveland
jimmy_carter
franklin_pierce
zachary_taylor
james_buchanan
george_washington
herbert_hooverwilliam_howard_taft
thomas_jefferson
martin_van_buren
john_adams
Figure 3.4: t-SNE 2-D projections of 451 answer vectors divided into six
major clusters. The blue cluster is predominantly populated by U.S. presidents.
The zoomed plot reveals temporal clustering among the presidents based on
the years they spent in office.
This observation shows that QANTA is capable of learning attributes of entities during
training.
36
Thomas Mann
Joseph Conrad
Henrik Ibsen
Franz Kafka
Henry James
Figure 3.5: A question on the German novelist Thomas Mann that contains
no named entities, along with the five top answers as scored by QANTA. Each
cell in the heatmap corresponds to the score (inner product) between a node in
the parse tree and the given answer, and the dependency parse of the sentence
is shown on the left. All of our baselines, including IR-WIKI, are wrong, while
QANTA uses the plot description to make a correct guess.
37
Akbar
Shah Jahan
Muhammad
Babur
Ghana
Figure 3.6: An extremely misleading question about John Cabot, at least
to computer models. The words muslim and mecca lead to three Mughal
emperors in the top five guesses from QANTA; other models are similarly led
awry.
38
Q he also successfully represented the amistad slaves and negotiated
the treaty of ghent and the annexation of florida from spain during
his stint as secretary of state under james monroe
A john quincy adams, henry clay, andrew jackson
Q this work refers to people who fell on their knees in hopeless cathe-
drals and who jumped off the brooklyn bridge
A howl, the tempest, paradise lost
Q despite the fact that twenty six martyrs were crucified here in the late
sixteenth century it remained the center of christianity in its country
A nagasaki, guadalcanal, ethiopia
Q this novel parodies freudianism in a chapter about the protagonist ’s
dream of holding a live fish in his hands
A
billy budd, the ambassadors, all my sons
Q a contemporary of elizabeth i he came to power two years before her
and died two years later
A
grover cleveland, benjamin harrison, henry cabot lodge
Table 3.2: Five example sentences occuring at the first sentence position along
with their top three answers as scored by QANTA; correct answers are marked
with blue and wrong answers are marked with red. QANTA gets the first three
correct, unlike all other baselines. The last two questions are too difficult
for all of our models, requiring external knowledge (e.g., Freudianism) and
temporal reasoning.
3.2 Simpler Quiz Bowl Models
QANTA is a relatively complex model, containing many different composition matrices
and relying on sentential parse trees for the composition order. After finishing the QANTA
project, I wanted to explore different neural network architectures for quiz bowl. During
experimentation, a simple word vector average yielded highly competitive results, despite
the fact that it throws out all word order information. The deep averaging network,
described in Chapter 2, was borne out of these experiments. To conclude this chapter, I
revisit the quiz bowl task with a DAN instead of DTreeNN and discover that averaging is
an effective sentence composition method when paired with a novel dropout variant; this
result influences design decisions made for the RMN model described in the next chapter.
39
Model Pos 1 Pos 2 Full Time(s)
BoW-DT 35.4 57.7 60.2 —
IR 37.5 65.9 71.4 N/A
QANTA 47.1 72.1 73.7 314
DAN 46.4 70.8 71.8 18
IR-WIKI 53.7 76.6 77.5 N/A
QANTA-WIKI 46.5 72.8 73.9 1,648
DAN-WIKI 54.8 75.5 77.1 119
Table 3.3: The DAN achieves slightly lower accuracies than the more complex
QANTA in much less training time, even at early sentence positions where
compositionality plays a bigger role. When Wikipedia is added to the training
set (bottom half of table), the DAN outperforms QANTA and achieves com-
parable accuracy to a state-of-the-art information retrieval baseline, which
highlights a benefit of ignoring word order for this task.
l
l
l
l
l
l
69
70
71
0.0 0.1 0.2 0.3 0.4 0.5
Dropout Probability
H
is
to
ry
 Q
B 
Ac
cu
ra
cy
Effect of Word Dropout
Figure 3.7: Randomly dropping out 30% of words from the vector average is
optimal for the quiz bowl task, yielding a gain in absolute accuracy of almost
3% on the quiz bowl question dataset compared to the same model trained
with no word dropout.
3.2.1 DANs for Quiz Bowl
On the same quiz bowl dataset, the DAN outperforms other bag-of-words models and is
competitive with QANTA, but requires much less training time. More interestingly, we
40
find that unlike the TreeNN, the DAN significantly benefits from out-of-domain Wikipedia
training data. As a reminder, QANTA’s dependency-tree TreeNNshows substantial im-
provements over other QA methods, leading to the hypothesis that correctly modeling
compositionality is crucial for answering hard questions. Before describing the exper-
iments, I introduce word dropout, a regularization method that can benefit any neural
composition function.
3.2.1.1 Word Dropout Improves Robustness
Dropout regularizes neural networks by randomly setting hidden and/or input units to
zero with some probability p (Hinton et al., 2012; Srivastava et al., 2014). Given a neural
network with n units, dropout prevents overfitting by creating an ensemble of 2n different
networks that share parameters, where each network consists of some combination of
dropped and undropped units. Instead of dropping units, a natural extension for the DAN
model is to randomly drop word tokens’ entire word embeddings from the vector average.
Another way to view word dropout is as an application of dropout to the one-hot encoding
(Section 2.1) of a given sentence. Using this method, which we call word dropout, our
network theoretically sees 2|X| different token sequences for each input X .
We posit a vector r with |X| independent Bernoulli trials, each of which equals 1
with probability p. The embedding vw for token w in X is dropped from the average if
rw is 0, which exponentially increases the number of unique examples the network sees
41
during training. This allows us to modify Equation 2.1:
rw ∼ Bernoulli(p) (3.4)
Xˆ = {w|w ∈ X and rw > 0} (3.5)
z = g(w ∈ X) =
∑
w∈Xˆ vw
|Xˆ| . (3.6)
Depending on the choice of p, many of the “dropped” versions of an original training
instance will be very similar to each other, but for shorter inputs this is less likely. We
might drop a very important token, such as “horrible” in “the crab rangoon was especially
horrible”; however, since the number of word types that are predictive of the output labels
is low compared to non-predictive ones (e.g., neutral words in sentiment analysis), we
always see improvements using this technique.
Theoretically, word dropout can also be applied to other neural network-based
approaches. However, we observe no significant performance differences in preliminary
experiments when applying word dropout to leaf nodes in TreeNNs for sentiment analysis
(dropped leaf representations are set to zero vectors), and it slightly hurts performance on
the question answering task.
Dataset and Experimental Setup We train a DAN over the history questions from Iyyer
et al. (2014a).16 This dataset is augmented with 49,581 sentence/page-title pairs from the
Wikipedia articles associated with the answers in the dataset. For fair comparison with
QANTA, we use a normalized tanh activation function at the last layer instead of ReLu,
16The training set contains 14,219 sentences over 3,761 questions. For more detail about data and baseline
systems, see Iyyer et al. (2014a).
42
and we also change the output layer from a softmax to the margin ranking loss Weston
et al. (2011) used in QANTA. We initialize the DAN with the same pretrained 100-d word
embeddings that were used to initialize QANTA.
We also evaluate the effectiveness of word dropout on this task in Figure 3.2. Cross-
validation indicates that p = 0.3 works best for question answering, although the im-
provement in accuracy is negligible for sentiment analysis. Finally, continuing the trend
observed in the sentiment experiments, DAN converges much faster than QANTA.
DANs Improve with Noisy Data Table 3.2 shows that while DAN is slightly worse than
QANTA when trained only on question-answer pairs, it improves when trained on additional
out-of-domain Wikipedia data (DAN-WIKI), reaching performance comparable to that
of a state-of-the-art information retrieval system (IR-WIKI). QANTA, in contrast, barely
improves when Wikipedia data is added (QANTA-WIKI) possibly due to the syntactic
differences between Wikipedia text and quiz bowl question text.
The most common syntactic structures in quiz bowl sentences are imperative con-
structions such as “Identify this British author who wrote Wuthering Heights”, which are
almost never seen in Wikipedia. Furthermore, the subject of most quiz bowl sentences
is a pronoun or pronomial mention referring to the answer, a property that is not true
of Wikipedia sentences (e.g., “Little of Emily’s work from this period survives, except
for poems spoken by characters.”). Finally, many Wikipedia sentences do not uniquely
identify the title of the page they come from, such as the following sentence from Emily
Bronte¨’s page: “She does not seem to have made any friends outside her family.” While
noisy data affect both DAN and QANTA, the latter is further hampered by the syntactic
43
divergence between quiz bowl questions and Wikipedia, which may explain the lack of
improvement in accuracy.
3.3 Conclusion
In this chapter, I first introduce QANTA, a dependency-tree neural network for factoid
question answering that outperforms bag of words and information retrieval baselines. The
model improves upon a contrastive max-margin objective function from previous work to
dynamically update answer vectors during training with a single model. Additionally, I
show that sentence-level representations can be easily and effectively combined to generate
paragraph-level representations with more predictive power than those of the individual
sentences. In the final section of the chapter, I show that the complex machinery of the
DTreeNN is actually unnecessary for quiz bowl, and that deep bag-of-words models can
achieve similar performance with much less training time.
44
Chapter 4
Sequential Semantic Parsing: Integrating Context with Structured
Prediction
In the previous chapter, we formalize quiz bowl as a classification problem and used very
simple methods to extract information from discourse-level context. Semantic parsing,
which is the task I explore in this chapter, cannot be tackled in the same way. A semantic
parser maps natural language text to meaning representations in formal logic (Liang, 2016);
instead of producing an answer prediction as in classification, the goal here is to produce
a formal query (or semantic parse) of the question. Once a natural language question
has been mapped to a formal query, its answer can be retrieved by executing the query
on a back-end structured database. Semantic parsing is well-studied for single-sentence
questions, but how can it be extended to cover conversational QA with dependencies
between turns?
This chapter includes content and figures previously published in Iyyer et al. (2017b),
the work for which was done during an internship at Microsoft Research in 2016.
4.1 Motivating Contextual Understanding for Semantic Parsing
One of the main focuses of semantic parsing research is how to address compositionality
in language, and complicated questions have been specifically targeted in the design of a
recently-released QA dataset (Pasupat and Liang, 2015). Take for example the following
45
question: “of those actresses who won a Tony after 1960, which one took the most amount
of years after winning the Tony to win an Oscar?” The corresponding logical form is highly
compositional; in order to answer it, many sub-questions must be implicitly answered in
the process (e.g., “who won a Tony after 1960?”).
While semantic parsers should be able to answer very complicated questions, in
reality these questions are rarely issued by users.1 Because users can interact with a QA
system repeatedly, there is no need to assume a single-turn QA setting where the exact
question intent has to be captured with just one complex question. The same intent can be
more naturally expressed through a sequence of simpler questions, as shown below:
1. What actresses won a Tony after 1960?
2. Of those, who later won an Oscar?
3. Who had the biggest gap between their two award wins?
Decomposing complicated intents into multiple related but simpler questions is arguably a
more effective strategy to explore a topic of interest, and it reduces the cognitive burden on
both the person who asks the question and the one who answers it.2
In this work, we study semantic parsing for answering sequences of simple re-
lated questions. We collect a dataset of question sequences called SequentialQA (SQA;
Section 4.2)3 by asking crowdsourced workers to decompose complicated questions sam-
pled from the WikiTableQuestions dataset (Pasupat and Liang, 2015) into multiple easier
ones. SQA, which contains 6,066 question sequences with 17,553 total question-answer
1For instance, there are only 3.75% questions with more than 15 words in WikiAnswers (Fader et al.,
2014).
2Studies have shown increased sentence complexity links to longer reading times (Hale, 2006; Levy,
2008; Frank, 2013).
3Available at http://aka.ms/sqa
46
2010
Night Girl
Earth Fire breath
2011
2009
Elemental
Aarok
First 
Appeared
Harmonia
Character
TeleportingGates
Earth
Home 
World
Super 
strength2007
Vyrga
Powers
Kathoon
XS Super speed
2009
Dragonwing1. Who are all of the super heroes?
2. Which of them 
come from Earth?
3. Of those, who 
appeared most 
recently?
Original intent: 
What super hero 
from Earth appeared 
most recently?
Legion of Super Heroes Post-Infinite Crisis
Figure 4.1: An example question sequence created from a compositional
question intent. Workers must write questions whose answers are subsets of
cells in the table.
pairs, is to the best of our knowledge the first semantic parsing dataset for sequential
question answering. Section 4.3 describes our novel dynamic neural semantic parsing
framework (DynSP), a weakly supervised structured-output learning approach based on
reward-guided search that is designed for solving sequential QA. We demonstrate in Sec-
tion 4.4 that DynSP achieves higher accuracies than existing systems on SQA, and we
offer a qualitative analysis of question types that our method answers effectively, as well
as those on which it struggles.
4.2 A Dataset of Question Sequences
We collect the SequentialQA (SQA) dataset via crowdsourcing by leveraging WikiTable-
Questions (Pasupat and Liang, 2015, henceforth WTQ), which contains highly composi-
47
tional questions associated with HTML tables from Wikipedia.
Each crowdsourcing task contains a long, complex question originally from WTQ as
the question intent. The workers are asked to compose a sequence of simpler questions
that lead to the final intent; an example of this process is shown in Figure 4.2.
To simplify the task for workers, we only use questions from WTQ whose answers
are cells in the table, which excludes those involving arithmetic and counting. We likewise
also restrict the questions our workers can write to those answerable by only table cells.
These restrictions speed the annotation process because workers can just click on the table
to answer their question. They also allow us to collect answer coordinates (row and column
in the table) as opposed to answer text, which removes many normalization issues for
answer string matching in evaluation. Finally, we only use long questions that contain nine
or more words as intents; shorter questions tend to be simpler and are thus less amenable
to decomposition.
4.2.1 Properties of SQA
In total, we use 2,022 question intents from the train and test folds of the WTQ for
decomposition. Three workers decompose each intent, resulting in 6,066 unique questions
sequences containing 17,553 total question-answer pairs (for an average of 2.9 questions
per sequence). We divide the dataset into train and test using the original WTQ folds,
resulting in an 83/17 train/test split. Importantly, just like in WTQ, none of the tables in
the test set are in the training set.
We identify three frequently-occurring question classes: column selection, subset
48
selection, and row selection.4 In column selection questions, the answer is an entire
column of the table; these questions account for 23% of all questions in SQA. Subset
and row selection are more complicated than column selection, as they usually contain
coreferences to the previous question’s answer. In subset selections, the answer is a subset
of the previous question’s answer; similarly, the answers to row selections occur in the
same row(s) as the previous answer but in a different column. Subset selections make up
27% of SQA, while row selections are an additional 19%. The remaining 31% contains
more complex combinations of these three types.
We also observe dramatic differences in the types of questions that are asked at each
position of the sequence. For example, 51% of the first questions in the sequences are
column selections (e.g., “what are all of the teams?”). This number dwindles to just 18%
when we look at the second question of each sequence, which indicates that the collected
sequences start with general questions and progress to more specific ones.
4.3 Dynamic Neural Semantic Parsing
The unique setting of SQA provides both opportunities and challenges. On the one hand,
it contains short questions with less compositionality, which in theory should reduce the
difficulty of the semantic parsing problem; on the other hand, the additional contextual
dependencies of the preceding questions and their answers increase modeling complexity.
These observations lead us to propose a dynamic neural semantic parsing framework
(DynSP) trained using a reward-guided search procedure for solving SQA.
4In the example sequence “what are all of the tournaments? in which one did he score the least points?
on what date was that?”, the first question is a column selection, the second is a subset selection, and the last
one is a row selection.
49
Given a question (optionally along with previous questions and answers) and a table,
DynSP formulates the semantic parsing problem as a state–action search problem. Each
state represents a complete or partial parse, while each action corresponds to an operation
to extend a parse. The goal during inference is to find an end state with the highest score
as the predicted parse.
The quality of the induced semantic parse obviously depends on the scoring func-
tion. In our design, the score of a state is determined by the scores of actions taken
from the initial state to the target state, which are predicted by different neural network
modules based on action type. By leveraging a margin-based objective function, the model
learning procedure resembles several structured-output learning algorithms such as struc-
tured SVMs (Tsochantaridis et al., 2005), but can take either strong or weak supervision
seamlessly.
DynSP is inspired by STAGG, a search-based semantic parser (Yih et al., 2015), as
well as the dynamic neural module network (DNMN) of Andreas et al. (2016). Much like
STAGG, DynSP chains together different modules as search progresses; however, these
modules are implemented as neural networks, which enables end-to-end training as in
DNMN. The key difference between DynSP and DNMN is that in DynSP the network
structure of an example is not predetermined. Instead, different network structures are
constructed dynamically as our learning procedure explores the state space.
It is straightforward to answer sequential questions using our framework: we allow
the model to take the previous question and its answers as input, with a slightly modified
action space to reflect a dependent semantic parse. The same search and learning procedure
is then able to effortlessly adapt to the new setting. In this section, we first describe the
50
formal language underlying DynSP, followed by the model formulation and learning
algorithm.
4.3.1 Semantic parse language
Because tables are used as the data source to answer questions in SQA, we decide to form
our semantic parses in an SQL-like language5. Our parses consist of two parts: a select
statement and conjunctions of zero or more conditions.
A select statement is associated with a column name, which is referred to as the
answer column. Conditions enforce additional constraints on which cells in the answer
column can be chosen; a select statement without any conditions indicates that an entire
column of the table is the answer to the question. In particular, each condition contains a
column name as the condition column and an operator with zero or more arguments. The
operators in this work include: =, 6=, >,≥, <,≤, argmin, argmax. A cell in the answer
column is only a legitimate answer if the cell of the corresponding row in the condition
column satisfies the constraint defined by the operator and its arguments.
As a concrete example, suppose the data source is the same table in Fig. 4.2. The
semantic parse of the question “Which super heroes came from Earth and first appeared
after 2009?” is “Select Character Where {Home World = Earth} ∧ {First Appeared >
2009}” and the answers are {Dragonwing, Harmonia}.
To handle the sequential aspect of SQA, we extend the semantic parse language by
adding a preamble statement subsequent. A subsequent statement contains only condi-
5Our framework is not restricted to the formal language we use in this work. In addition, the structured
query can be straightforwardly represented in other formal languages, such as the lambda DCS logic used
in (Pasupat and Liang, 2015).
51
tions, as it essentially adds constraints to the semantic parse of the previous question. For
instance, if the follow-up question is “Which of them breathes fire?”, then the correspond-
ing semantic parse is “Subsequent Where {Powers = Fire breath}”. The answer to this
question is {Dragonwing}, a subset of the previous answer.
4.3.2 Model formulation
We start introducing our model design by first defining the state and action space. Let S be
the set of states and A the set of all actions. A state s ∈ S is simply a sequence of variable
length of actions {a1, a2, a3, · · · , at}, where ai ∈ A. An empty sequence, s0 = φ, is a
special state used as the starting point of search.
As mentioned earlier, a state represents a (partial) semantic parse of one question.
Each action is thus a legitimate operation that can be added to grow the semantic parse.
Our action space design is tied closely to the statements defined by our parse language;
in particular, an action instance is either a complete or partial statement, and action
instances are grouped by type. For example, select and subsequent operations are two
action types. A condition statement is formed by two different action types: (1) selection
of the condition column, and (2) the comparison operator. The instances of each action
type differ in their arguments (e.g., column names, or specific cells in a column). Because
conditions in a subsequent parse rely on previous questions and answers, they belong to
different action types from regular conditions. Table 4.3.2 summarizes the action space
defined in this work.
Any state that represents a complete and legitimate parse is an end state. Search does
52
Id Type # Action instances
A1 Select-column # columns
A2 Cond-column # columns
A3 Op-Equal (=) # rows
A4 Op-NotEqual (6=) # rows
A5 Op-GT (>) # numbers / datetimes
A6 Op-GE (≥) # numbers / datetimes
A7 Op-LT (<) # numbers / datetimes
A8 Op-LE (≤) # numbers / datetimes
A9 Op-ArgMin # numbers / datetimes
A10 Op-ArgMax # numbers / datetimes
A11 Subsequent 1
A12 S-Cond-column # columns
A13 S-Op-Equal (=) # rows
A14 S-Op-NotEqual (6=) # rows
A15 S-Op-GT (>) # numbers / datetimes
A16 S-Op-GE (≥) # numbers / datetimes
A17 S-Op-LT (<) # numbers / datetimes
A18 S-Op-LE (≤) # numbers / datetimes
A19 S-Op-ArgMin # numbers / datetimes
A20 S-Op-ArgMax # numbers / datetimes
Table 4.1: Types of actions and the number of action instances in each type.
Numbers / datetimes are the mentions discovered in the question (plus the
previous question if it is a subsequent condition).
not necessarily need to stop at an end state, because adding more actions (e.g., condition
statements) can lead to another end state. Take the same example question from before:
“Which super heroes came from Earth and first appeared after 2009?”. One action sequence
that represents the parse is {(A1) select-column Character, (A2) cond-column Home
World, (A3) op-equal Earth, (A2) cond-column First Appeared, (A5) op-gt 2009}.
Many states represent semantically equivalent parses (e.g., those with the same
actions ordered differently, or states with repeated conditions). To prune the search space,
we introduce the function Act(s) ⊂ A, which defines the actions that can be taken when
given a state s. Borrowing the idea of staged state generation in Yih et al. (2015), we
choose a default ordering of actions based on their types, dictating that a select action must
53
A1 A2 A3...A10
s0
A2
A11 A12 A13...A20
A12
Figure 4.2: Possible action transitions based on their types (see Table 4.3.2).
Shaded circles are end states.
be picked first and that a condition-column needs to be determined before the operator is
chosen. The full transition diagram is presented in Fig. 4.3.2. To implement this transition
order, we only need to check the last action in the state. In addition, we also disallow
adding duplicates of actions that already exist in the state.
We use beam search to find an end state with the highest score for inference. Let
st be a state consisting of a sequence of actions a1, a2, · · · , at. The state value function
V is defined recursively as V (st) = V (st−1) + pi(st−1, at), V (s0) = 0, where the policy
function pi(s, a) scores an action a ∈ Act(s) given the current state.
4.3.3 Policy function
The intuition behind the policy function can be summarized as follows. Halfway through
the construction of a semantic parse, the policy function measures the quality of an
immediate action that can be taken next given the current state (i.e., the question and
actions that have previously been chosen). To enable integrated, end-to-end learning,
the policy function in our framework is parameterized using neural networks. Because
54
each action type has very different semantics, we design different network structures (i.e.,
modules) accordingly.
Most of our network structures encourage learning semantic matching functions
between the words in the question and table (either the column names or cells). Here
we illustrate the design using the select-column action type (A1). Conceptually, the
corresponding module is a combination of various matching scores. Let WQ be the
embeddings of words in the question and WC be the embeddings of words in the target
column name. The component matching functions are:
fmax =
1
|WC |
∑
wc∈WC
max
wq∈WQ
wTq wc
favg =
 1
|WC |
∑
wc∈WC
wc
T  1
|WQ|
∑
wq∈WQ
wq

Essentially, for each word in the column name, fmax finds the highest matching
question word and outputs the average score. Conversely, favg simply uses the average
word vectors of the question and column name and returns their inner product. In another
variant of favg, we replace the question representation with the output of a bi-directional
LSTM model. These matching component functions are combined by a 2-layer feed-
forward neural network, which outputs a scalar value as the action score.
55
4.3.4 Model learning
Because the state value function V is defined recursively as the sum of scores of actions
in the sequence, the goal of model optimization is to learn the parameters in the neural
networks behind the policy function. Let θ be the collection of all the model parameters.
Then the state value function can be written as: Vθ(st) =
∑t
i=1 piθ(si−1, ai).
In a fully supervised setting where the correct semantic parse of each question is
available, learning the policy function can be reduced to a sequence prediction problem.
However, while having full supervision leads to a better semantic parser, collecting the
correct parses requires a much more sophisticated UI design (Yih et al., 2016). In many
scenarios, such as the one in the SQA dataset, it is often the case that only the answers
to the questions are available. Adapting a learning algorithm to this weakly supervised
setting is thus critical.
Generally speaking, weakly supervised semantic parsers operate on one assumption—
a candidate semantic parse is treated as a correct one if it results in answers that are identical
to the gold answers. Therefore, a straightforward modification of existing structured
learning algorithms in our setting is to use any semantic parse found to evaluate to the
correct answers during beam search as a reference parse, and then update the model
parameters accordingly. In practice, however, this approach is often problematic: the
search space can grow enormously, and when coupled with poor model performance early
during training, this leads to beams that contain no parses evaluating to the correct answer.
As a result, learning becomes inefficient and takes a long time to converge.
In this work, we propose a conceptually simple learning algorithm for weakly
56
supervised training that sidesteps the inefficient learning problem. Our key insight is
to conduct inference using a beam search procedure guided by an approximate reward
function. The search procedure is executed twice for each training example, one for finding
the best possible reference semantic parse and the other for finding the predicted semantic
parse to update the model. Our framework is suitable for learning from either implicit or
explicit supervision. Below we describe how we adapt it to the semantic parsing problem
in this work.
Approximate reward LetA(s) be the answers retrieved by executing the semantic parse
represented by state s, and let A∗ be the set of gold answers of a given question. We
define the reward R(s;A∗) = 1[A(s) = A∗], or the accuracy of the retrieved answers. We
use R(s) as the abbreviation for R(s;A∗). A state s with R(s) = 1 is called a goal state.
Directly using this reward function in search of goal states can be difficult, as rewards
of most states are 0. However, even when the answers from a semantic parse are not
completely correct, some overlap with the gold answers can still hint that the state is
close to a goal state, thus providing useful information to guide search. To formalize this
idea, we define an approximated reward R˜(s) in this work using the Jaccard coefficient
(R˜(s) = |A(s) ∩ A∗|/|A(s) ∪ A∗|). If s is a goal state, then obviously R˜(s) = R(s) = 1.
Also because our actions effectively add additional constraints to exclude some table cells,
any succeeding states of s′ with R˜(s′) = 0 will also have 0 approximate reward and can be
pruned from search immediately.
We use the approximate reward R˜ to guide our beam search to find the reference
parses (i.e., goal states). Some variations of the approximate reward can make learning
57
more efficient. For instance, we use the model score for tie-breaking, effectively making
the approximate reward function depend on the model parameters:
R˜θ(s) = |A(s) ∩ A∗|/|A(s) ∪ A∗|+ Vθ(s), (4.1)
where  is a small constant. When a goal state is not found, the state with the highest
approximate reward can still be used as a surrogate reference.
Updating parameters The model parameters are updated by first finding the most
violated state sˆ and then comparing sˆ with a reference state s∗ to compute a loss. The idea
of finding the most violated state comes from Taskar et al. (2004), with the intuition that
the learning algorithm should make the state value function behave similarly to the reward.
Formally, for every state s, we would like the value function to satisfy the following
constraint:
Vθ(s
∗)− Vθ(s) ≥ R(s∗)−R(s) (4.2)
R(s∗)−R(s) is thus the margin. As discussed above, we use approximate reward function
R˜θ instead of the true reward. We want to update the model parameters θ to make sure that
the constraint is satisfied. When the constraint is violated, the degree of violation can be
written as:
L(s) = Vθ(s)− Vθ(s∗)− R˜θ(s) + R˜θ(s∗) (4.3)
In the algorithm, we want to find the state such that the corresponding constraint is most
violated. Finding the most violated state is then equivalent to finding the state with the
58
Algorithm 1 Model parameter updates
1: for pick a labeled data (x,A∗) do
2: s∗ ← argmax
s∈E(x)
R˜(s;A∗)
3: sˆ← argmax
s∈E(x)
Vθ(s)− R˜(s;A∗)
4: update θ by minimizing max(L(s), 0)
5: end for
highest value of Vθ(s)− R˜θ(s) as the other two terms are constant.
Algorithm 4.3.4 sketches the key steps of our method in each iteration. It first picks
a training instance (x and y), where x represents the table and the question, and y is the
gold answer set. The approximate reward function R˜ is defined by y, while E(x) is the set
of end states for this instance. Line 2 finds the best reference and Line 3 finds the most
violated state, both relying on beam search for approximate inference. Line 4 computes
the gradient of the loss in Eq. Eq. (4.3), which is then used in backpropagation to update
the model parameters.
4.4 Experiments
Since the questions in SQA are decomposed from those in WTQ, we compare our method,
DynSP, to two existing semantic parsers designed for WTQ: (1) the floating parser (FP)
of Pasupat and Liang (2015), and (2) the neural programmer (NP) of Neelakantan et al.
(2017). We describe below each system’s configurations in more detail and qualitatively
compare and contrast their performance on SQA.
Floating parser: The floating parser (Pasupat and Liang, 2015) maps questions to
logical forms and then executes them on the table to retrieve the answers. It was designed
59
specifically for the WTQ task (achieving 37.0% accuracy on the WTQ test set) and differs
from other semantic parsers by not anchoring predicates to tokens in the question, relying
instead on typing constraints to reduce the search space. Using FP as-is results in poor
performance on SQA because the system is configured for questions with single answers,
while SQA contains many questions with multiple-cell answers. We address this issue by
removing a pruning hyperparameter (tooManyValues) and features that add bias on the
denotation size.
Neural programmer: The neural programmer (NP) proposed by Neelakantan et al.
(2017) has shown promising results on WTQ, achieving accuracies on par with those of
FP. Similar to our method, NP contains specialized neural modules that perform discrete
operations such as argmax and argmin, and it is able to chain together multiple modules
to answer a single question. However, module selection in NP is computed via soft
attention (Cho et al., 2014), and information is propagated from one module to the next
using a recurrent neural network. Since module selection is not tied to a pre-defined
parse language like DynSP, NP simply runs for a fixed number of recurrent timesteps per
question rather than growing a parse until it is complete.
Comparing the baseline systems: FP and NP exemplify two very different paradigms
for designing a semantic parsing system to answer questions using structured data. FP
is a feature-rich system that aims to output the correct semantic parse (in a logical parse
language) for a given question. On the other hand, the end-to-end neural network of NP
relies on its modular architectures to output a probability distribution over cells in a table
60
given a question. While NP can learn more powerful neural matching functions between
questions and tables than FP’s simpler feature-based matching, NP cannot produce a
complete, discrete semantic parse, which means that its actions can only be interpreted
coarsely by looking at the order of the modules selected at each timestep.6 Furthermore,
FP’s design theoretically allows it to operate on partial tables indirectly through an API,
which is necessary if tables are large and stored in a backend database, while NP requires
upfront access to the full tables to facilitate end-to-end model differentiability.7
Even though FP and NP are powerful systems designed for the more difficult, com-
positional questions in WTQ, our method outperforms both systems on SQA when we
consider all questions within a sequence independently of each other (a fair comparison),
demonstrating the power of our search-based semantic parsing framework. More interest-
ingly, when we leverage the sequential information by including the subsequent action,
our method improves almost 3% in absolute accuracy.
DynSP combines the best parts of both FP and NP. Given a question, we try to
generate its correct semantic parse in a formal language that can be predefined by the
choice of structured data source (e.g., SQL). However, we push the burden of feature
engineering to neural networks as in NP. Our framework is easier to extend to the sequential
setting of SQA than either baseline system, requiring just the additional subsequent action.
FP’s reliance on a hand-designed grammar necessitates extra rules that operate over
partial tables from the previous question, which if added would blow up the search space.
Meanwhile, modifying NP to handle sequential QA is non-trivial due to soft module and
6Since NP uses a fixed number of timesteps for each question, the module order is not guaranteed to
correspond to a complete parse.
7In fact, NP is restricted during training to only questions whose associated tables have fewer than a
certain threshold of rows and columns due to computational constraints.
61
answer selection; it is not immediately clear how to constrain predictions for one question
based on the probability distribution over table cells from the previous question in the
sequence.
To more fairly compare DynSP to the baseline systems, we also experiment with a
“concatenated questions” setting, which allows the baselines to access sequential context.
Here, we treat concatenated question prefixes of a sequence as additional training examples,
where a question prefix includes all questions prior to the current question in the sequence.
For example, suppose the question sequence is: 1. what are all of the teams? 2. of
those, which won championships? For the second question, in addition to the original
question–answer pair, we add the concatenated question sequence “what are all of the
teams? of those, which won championships?” paired with the second question’s answer.
We refer to these concatenated question baselines as FP+ and NP+.
4.4.1 DynSP implementation details
Unlike previous dynamic neural network frameworks (Andreas et al., 2016; Looks et al.,
2017), where each example can have different but predetermined structure, DynSP needs to
dynamically explores and constructs different neural network structures for each question.
Therefore, we choose DyNet (Neubig et al., 2017) as our implementation platform for its
flexibility in composing computation graphs. We optimize our model parameters using
standard stochastic gradient descent. The word embeddings are initialized with 100-d
pretrained GloVe vectors (Pennington et al., 2014) and fine-tuned during training with
dropout rate 0.5. For follow-up questions, we choose uniformly at random to use either
62
gold answers to the previous question or the model’s previous predictions.8 We constrain
the maximum length of actions to 3 for computational efficiency and set the beam size to
15 in our reported models, as accuracy gains are negligible with larger beam sizes. We
train our model for 30 epochs, although the best model on the validation set is usually
found within the first 20 epochs. Only CPU is used in model training, and each epoch in
the beam size 15 setting takes about 30 minutes to complete.
4.4.2 Results & Analysis
Table 4.4.2 shows the results of the baseline systems as well as our method on SQA’s
test set. For each system, we show both the overall accuracy, the sequence accuracy
(the percentage of sequences for which every question was answered correctly), and the
accuracy at each position in the sequence. Our method without any sequential information
(DynSP) outperforms the standard baselines, and when the subsequent action is added
(DynSP∗), we improve both overall and sequence accuracy over the concatenated-question
baselines.
With that said, all of the systems struggle to answer all questions within a sequence
correctly, despite the fact that each individual question is simpler on average than those
in WTQ. Most of the errors made by our system are due to either semantic matching
challenges or limitations of the underlying parse language. In the middle example of
Figure 4.4.2, the first question asks for a list of super heroes; from the model’s point of
view, Real name is a more relevant column than Character, although the latter is correct.
The second question also contains a challenging matching problem where the unlisted
8Only predicted answers are used at test time.
63
Model All Seq Pos 1 Pos 2 Pos 3
FP 34.1 7.2 52.6 25.6 25.9
NP 39.4 10.8 58.9 35.9 24.6
DynSP 42.0 10.2 70.9 35.8 20.1
FP+ 33.2 7.7 51.4 22.2 22.3
NP+ 40.2 11.8 60.0 35.9 25.5
DynSP∗ 44.7 12.8 70.4 41.1 23.6
Table 4.2: Accuracies of all systems on SQA; the models in the first half of
the table treat questions independently, while those in the second half consider
sequential context. Our method outperforms existing ones both in terms of
overall accuracy as well as sequence accuracy.
home worlds referred to in the question are marked as Unknown in the table. Many of
these matching issues are resolved by humans using common sense, which for computers
requires far more data than is available in SQA to learn.
Even when there are no tricky discrepancies between question and table text, ques-
tions are often complex enough that their semantic parses cannot be expressed in our parse
language. Although trivial on the surface, the final question in the bottom sequence of
Figure 4.4.2 is one such example; the correct semantic parse requires access to the answers
of both the first and second question, actions that we have not currently implemented in our
language due to concerns with the search space size. Increasing the number of complex
actions requires designing smarter optimization procedures, which we leave to future work.
64
1. Which nations competed in the FINA women’s water polo cup?
2. Of these nations, which ones took home at least one gold medal? 
3. Of those, which ranked in the top 2 positions?
SELECT Nation
SUBSEQUENT WHERE Gold != 0
SUBSEQUENT WHERE Rank <= 2
1. Who are all of the super heroes?
2. Which of those does not have a home world listed? 
SELECT
SUBSEQUENT WHERE !=
CharacterReal name
Home world UnknownVyrga
1. How many naturalizations did Maghreb have in 2000?
2. How many naturalizations did North America have in 2000?
3. Which had more?
SELECT 2000
SUBSEQUENT WHERE …Origin = North America
WHERE =…Origin Maghreb
SELECT 2000 WHERE =…Origin North America
MAX SUBSEQUENT 1 SUBSEQUENT 2
SELECT …Origin WHERE 2000 =
Figure 4.3: Parses computed by DynSP for three test sequences (actions in
blue boxes, values from table in white boxes). Top: all three questions are
parsed correctly. Middle: semantic matching errors cause the model to select
incorrect columns and conditions. Bottom: The final question is unanswerable
due to limitations of our parse language.
65
4.5 Related Work
Previous work on conversational QA has focused on small, single-domain datasets. Perhaps
most related to our task is the context-dependent sentence analysis described in (Zettle-
moyer and Collins, 2009), where conversations between customers and travel agents are
mapped to logical forms after resolving referential expressions. (Artzi and Zettlemoyer,
2011) use another dataset of travel booking conversations to learn a semantic parser for
complicated queries given user clarifications. More recently, Long et al. (2016) collect
three contextual semantic parsing datasets (from synthetic domains) that contain coref-
erences to entities and actions. We differentiate ourselves from these prior works in two
significant ways: first, our dataset is not restricted to a particular domain, and second, a
major goal of our work is to analyze the different types of sequence progressions people
create when they are trying to express a complicated intent.
Complex, interactive QA tasks have also been proposed in the information retrieval
community, where the data source is a corpus of newswire text (Kelly and Lin, 2007).
We also build on aspects of some existing interactive question-answering systems. For
example, Harabagiu et al. (2005) include a module that predicts what a user will ask next
given their current question.
Other than FP and NP, the work of Neural Symbolic Machines (NSM) (Liang et al.,
2017) is perhaps the closest to ours. NSM aims to generate formal semantic parses of
questions that can be executed on Freebase to retrieve answers, and is trained using the
REINFORCE algorithm (Williams, 1992) augmented with approximate gold parses found
in a separate curriculum learning stage. In comparison, finding reference parses is an
66
integral part of our algorithm. Our non-probabilistic, margin-based objective function also
helps avoid the need for empirical tricks to handle normalization and proper sampling,
which are crucial when applying REINFORCE in practice.
4.6 Conclusion & Future Work
In this work we move towards a conversational, multi-turn QA scenario in which systems
must rely on prior context to answer the user’s current question. To this end, we introduce
SQA, a dataset that consists of 6,066 unique sequences of inter-related questions about
Wikipedia tables, with 17,553 questions-answer pairs in total. To the best of our knowledge,
SQA is the first semantic parsing dataset that addresses sequential question answering.
We propose DynSP, a dynamic neural semantic parsing framework, for solving SQA. By
formulating semantic parsing as a state–action search problem, our method learns modular
neural network models through reward-guided search. DynSP outperforms existing state-
of-the-art systems designed for answering complex questions when applied to SQA, and
increases the gain after incorporating the subsequent actions.
In the future, we plan to investigate several interesting research questions triggered by
this work. For instance, although our current formal language design covers most question
types in SQA, it is nevertheless important to extend it further to make the semantic parser
more robust (e.g., by including UNION or allowing comparison of multiple previous
answers). Practically, allowing a more complicated semantic parse structure—either by
increasing the number of primitive statements or the length of the parse—poses serious
computational challenges in both model learning and inference. Because of the dynamic
67
nature of our framework, it is not trivial to leverage the computational capabilities of GPUs
using minibatched training; we plan to investigate ways to take full advantage of modern
computing machinery in the near future. Finally, better resolution of semantic matching
errors is a top priority, and unsupervised learning from large external corpora is one way
to make progress in this direction.
68
Chapter 5
Dynamically Modeling Fictional Relationships
The quiz bowl questions in Chapter 3 contain paragraph-length contexts, while the conver-
sational histories in the sequential QA problem of Chapter 4 average about three sentences
in length. Many language domains contain contexts that are substantially longer and more
complicated. Consider novels, which contain character-centric narratives; relationships
between characters develop chapter by chapter, and events in the story often have huge
impact on these relationships. In this chapter, we consider the following question: how
do we build neural networks that can produce valuable insights from just the raw texts of
novels without any annotated data?1
5.1 Motivation
When two characters in a book break bread, is their meal just a result of biological needs
or does it mean more? Cognard-Black et al. (2014) argue that this simple interaction
reflects the diversity and background of the characters, while Foster (2009) suggests that
the tone of a meal can portend either good or ill for the rest of the book. To support such
theories, scholars use their literary expertise to draw connections between disparate books:
Gabriel Conroy’s dissonance from his family at a sumptuous feast in Joyce’s The Dead,
the frustration of Tyler’s mother in Dinner at the Homesick Restaurant, and the grudging
1This chapter covers models and datasets previously proposed in Iyyer et al. (2016, NAACL).
69
love love
sadness
joy
love
fantasy
love
fantasy
sickness
death
sadness
death
marriage
sickness
murder
passage of time
I love him more 
than ever. We are 
to be married on 
28 September.
I feel so weak and worn 
out … looked quite grieved 
… I hadn't the spirit
poor girl, there is 
peace for her at 
last. It is the end!
Arthur placed the 
stake over her 
heart … he struck 
with all his might. 
The Thing in the 
coffin writhed …
Figure 5.1: An example trajectory depicting the dynamic relationship between
Lucy and Arthur in Bram Stoker’s Dracula, which starts with love and ends
with Arthur killing the vampiric Lucy. Each column describes the relationship
state at a particular time by weights over a set of descriptors (larger weights
shown as bigger boxes). Our goal is to learn—without supervision—both the
descriptors and the trajectories from raw fictional texts.
respect for a blind man eating meatloaf in Carver’s Cathedral.
However, these insights do not come cheap. It takes years of careful reading
and internalization to make connections across books, which means that relationship
symmetries and archetypes are likely to remain hidden in the millions of books published
every year unless literary scholars are actively searching for them.
Natural language processing techniques have been increasingly used to assist in
these literary investigations by discovering patterns in texts (Jockers, 2013). In Section 5.6
we review existing techniques that classify or cluster relationships between characters
in books using a fixed set of labels (e.g., friend or enemy). However, such approaches
70
ignore interactions between characters that lie outside of the established lexicon and cannot
account for the dynamic nature of relationships that evolve through the course of a book,
such as the vampiric downfall of Lucy and Arthur’s engagement in Dracula (Figure 5) or
Winston Smith’s rat-induced betrayal of Julia in 1984.
To address these issues, we propose the task of unsupervised relationship modeling,
in which a model jointly learns a set of relationship descriptors as well as relationship
trajectories for pairs of literary characters. Instead of assigning a single descriptor to a
particular relationship, the trajectories learned by the model are sequences of descriptors
as in Figure 5.
The Bayesian hidden topic Markov model (HTMM) of Gruber et al. (2007) emerges as
a natural choice for our task because it is capable of computing relationship descriptors (in
the form of topics) and has an additional temporal component. However, our experiments
show that the descriptors learned by the HTMM are not coherent and focus more on
events or environments (e.g., meals, outdoors) than interpersonal states like happiness and
sadness.
Motivated by recent advances in deep learning, we propose the relationship modeling
network (RMN), which is a novel variant of a deep recurrent autoencoder that incorporates
dictionary learning to learn relationship descriptors. We show that the RMN achieves
better descriptor coherence and trajectory accuracy than the HTMM and other topic model
baselines in two crowdsourced evaluations described in Section 5.4. In Section 5.5 we
show qualitative results and make connections to existing literary scholarship.
71
5.2 A Dataset of Character Interactions
Our dataset consists of 1,383 fictional works pulled from Project Gutenberg and other
Internet sources. Project Gutenberg has a limited selection (outside of science fiction) of
mostly classic literature, so we add more contemporary novels from various genres such
as mystery, romance, and fantasy to our dataset.
To identify character mentions, we run the Book-NLP pipeline of Bamman et al.
(2014), which includes character name clustering, quoted speaker identification, and coref-
erence resolution.2 For every detected character mention, we define a span as beginning
100 tokens before the mention and ending 100 tokens after the mention. We do not use
sentence or paragraph boundaries because they vary considerably depending on the author
(e.g., William Faulkner routinely wrote single sentences longer than many of Hemingway’s
paragraphs). All spans in our dataset contain mentions to exactly two characters. This is a
rather strict requirement that forces a reduction in data size, but spans in which more than
two characters are mentioned are generally noisier.
Once we have identified usable spans in the dataset, we apply a second filtering step
that removes relationships containing fewer than five spans. Without this filter, our dataset
is dominated by fleeting interactions between minor characters; this is undesirable since
our focus is on longer, mutable relationships. Finally, we filter our vocabulary by removing
the 500 most frequently occurring words, as well as all words that occur in fewer than 100
books. The latter step helps correct for variation in time period and genre (e.g., “thou” and
2While this pipeline works reasonably well, it is unreliable for first-person narratives; we leave the
necessary improvements to character name clustering, which are further expanded upon in Vala et al. (2015),
for future work.
72
“thy” found in older works like the Canterbury Tales). Our final dataset contains 20,013
relationships and 380,408 spans, while our vocabulary contains 16,223 words.3
5.3 Relationship Modeling Networks
This section mathematically describes how we apply the RMN to relationship modeling
on our dataset. Our model is similar in spirit to topic models: for an input dataset, the
output of the RMN is a set of relationship descriptors (topics) and—for each relationship in
the dataset—a trajectory, or a sequence of probability distributions over these descriptors
(document-topic assignments). However, the RMN uses recent advances in deep learning to
achieve better control over descriptor coherence and trajectory smoothness (Section 5.4).
5.3.1 Formalizing the Problem
Assume we have two characters c1 and c2 in book b. We define Sc1,c2 as a sequence of
token spans where each span st ∈ Sc1,c2 is itself a set of tokens {w1, w2, . . . , wl} of fixed
size l that contains mentions (either directly or by coreference) to both c1 and c2. In other
words, Sc1,c2 includes the text of every scene, chronologically ordered, in which c1 and c2
are present together.
5.3.2 Model Description
As in other neural network models for natural language processing, we begin by associating
each word type w in our vocabulary with a real-valued embedding vw ∈ Rd. These
3Code and span data available at http://github.com/miyyer/rmn.
73
rt = R
Tdt
Mrs. Reilly looked at her son slyly and asked, 
"Ignatius, you sure you not a communiss?" 
"Oh, my God!" Ignatius bellowed. "Every 
day I am subjected to a McCarthyite 
witchhunt in this crumbling building. No!"
Mrs. Reilly Ignatius “A Confederacy
  of Dunces”
ht = f(Wh · [vst ;vc1 ;vc2 ;vb])
vst vc1 vc2 vb
dt 1
R
dt = ↵ · softmax(Wd · [ht;dt 1])+
(1  ↵) · dt 1
: previous state
: descriptor
   matrix
: reconstruction
   of input span
: distribution over      
  descriptors
Figure 5.2: An example of the RMN’s computations at a single time step.
The model approximates the vector average of an input span (vst) as a linear
combination of descriptors from R. The descriptor weights dt define the
relationship state at each time step and—when viewed as a sequence—form a
relationship trajectory.
embeddings are rows of a V × d matrix L, where V is the vocabulary size. Similarly,
characters and books have their own embeddings in rows of matrices C and B. We want
B to capture global context information (e.g., “Moby Dick” takes place at sea) and C to
74
capture immutable aspects of characters not related to their relationships (e.g., Javert is a
police officer). Finally, the RMN learns embeddings for relationship descriptors, which
requires a second matrix R of size K × d where K is the number of descriptors, analogous
to the number of topics in topic models.
Each input to the RMN is a tuple that contains identifiers for a book and two charac-
ters, as well as the spans corresponding to their relationship: (b, c1, c2, Sc1,c2). Given one
such input, our objective is to reconstruct Sc1,c2 using a linear combination of relationship
descriptors from R as shown in Figure 5.3; we now describe this process formally.
5.3.2.1 Modeling Spans with Vector Averages
We use the DAN architecture detailed in Chapter 2 for span representation; below are the
specific details. We compute a vector representation for each span st in Sc1,c2 by averaging
the embeddings of the words in that span,
vst =
1
l
∑
w∈st
vw. (5.1)
Then, we concatenate vst with the character embeddings vc1 and vc2 as well as the book
embedding vb and feed the resulting vector into a standard feed-forward layer to obtain a
hidden state ht,
ht = f(Wh · [vst ;vc1 ;vc2 ;vb]). (5.2)
In all experiments, the transformation matrix Wh is d × 4d, and we set f to the ReLu
function, ReLu(x) = max(0, x).
75
5.3.2.2 Approximating Spans with Relationship Descriptors
Now that we can obtain representations of spans, we move on to learning descriptors
using a variant of dictionary learning (Olshausen and Field, 1997; Elad and Aharon, 2006),
where our descriptor matrix R is the dictionary and we are trying to approximate input
spans as a linear combination of items from this dictionary.
Suppose we compute a hidden state for every span st in Sc1,c2 (Equation 5.2). Now,
given an ht, we compute a weight vector dt over K relationship descriptors with some
composition function g, which is fully specified in the next section. Conceptually, each
dt is a relationship state, and a relationship trajectory is a sequence of chronologically-
ordered relationship states as shown in Figure 5. After computing dt, we use it to compute
a reconstruction vector rt by taking a weighted average over relationship descriptors,
rt = RTdt. (5.3)
Our goal is to make rt similar to vst . We use a contrastive max-margin objective function
similar to previous work (Weston et al., 2011; Socher et al., 2014). We randomly sample
spans from our dataset and compute the vector average vsn for each sampled span as
in Equation 5.1. This subset of span vectors is N . The unregularized objective J is a
hinge loss that minimizes the inner product between rt and the negative samples while
simultaneously maximizing the inner product between rt and vst ,
J(θ) =
|Sc1,c2 |∑
t=0
∑
n∈N
max(0, 1− rtvst + rtvsn), (5.4)
76
where θ represents the model parameters.
5.3.2.3 Computing Weights over Descriptors
What function should we choose for our composition function g to represent a relationship
state at a given time step? On the face of it, this seems trivial; we can project ht to K
dimensions and then apply a softmax or some other nonlinearity that yields non-negative
weights.4 However, this method ignores the relationship states at previous time steps. To
model the temporal aspect of relationships, we can add a recurrent connection,
dt = softmax(Wd · [ht;dt−1]) (5.5)
where Wd is of size K × (d+K) and softmax(q) = exp q/∑kj=1 exp qj.
Our hope is that this recurrent connection will carry some of the previous relationship
state over to the current time step. It should be unlikely for two characters in love at time
t to fall out of love at time t + 1 even if st+1 does not include any love-related words.
However, because the objective function in Equation 5.4 maximizes similarity with the
current time step’s input, the model is not forced to learn a smooth interpolation between
the previous state and the current one. A natural remedy is to have the model predict the
next time step’s input instead, but this proves hard to optimize.
We instead force the model to use the previous relationship state by modifying
4We experiment with a variety of nonlinearities but find that the softmax yields the most interpretable
results due to its predisposition to select a single descriptor.
77
Equation 5.5 to include a linear interpolation between dt and dt−1,
dt = α · softmax(Wd · [ht;dt−1])+
(1− α) · dt−1.
(5.6)
Here, α is a scalar between 0 and 1. We experiment with setting α to a fixed value of 0.5
as well as allowing the model to learn α as in
α = σ(vTα · [ht;dt−1;vst ]), (5.7)
where σ is the sigmoid function and vα is a vector of dimensionality 2d + K. Fixing
α = 0.5 initially and then tuning it after other parameters have converged improves training
stability; for the specific hyperparameters we use see Section 5.4.5
5.3.2.4 Interpreting Descriptors and Enforcing Uniqueness
Recall that each descriptor is a d-dimensional row of R. Because our objective function J
forces these descriptors to be in the same vector space as that of the word embeddings L,
we can interpret them by looking at nearest neighbors in L using cosine distance as the
similarity metric.
To discourage learning descriptors that are too similar to each other, we add another
penalty term X to our objective function,
X(θ) =
∥∥RRT − I∥∥ , (5.8)
5This strategy is reminiscent of alternative minimization strategies for dictionary learning (Agarwal et al.,
2014), where the dictionary and weights are learned separately by keeping the other fixed.
78
where I is the identity matrix. This term comes from the component orthogonality con-
straint in independent component analysis (Hyva¨rinen and Oja, 2000).
We add J and X together to obtain our final training objective L,
L(θ) = J(θ) + λX(θ), (5.9)
where λ is a hyperparameter that controls the magnitude of the uniqueness penalty.
5.4 Evaluating Descriptors and Trajectories
Because no previous work explores the interpretability of unsupervised relationship mod-
eling over time, evaluating the RMN is tricky. Further compounding the problem is the
subjective nature of the task; for example, is a trajectory that ignores a key event better
than one that hallucinates episodes absent from source text?
With these issues in mind, we conduct three evaluations to show that our output
is reasonable. First, we conduct a crowdsourced interpretability experiment that shows
RMNs produce significantly more coherent descriptors than three topic model baselines.
A second crowdsourced task indicates that our model produces trajectories that match
plot summaries more accurately than topic models. Finally, we qualitatively compare the
RMN’s output to existing static annotations of literary relationships and find both expected
and surprising results.
79
5.4.1 Topic Model Baselines
Before moving onto the evaluations, we briefly describe three baseline models, all of
which are Bayesian generative models. Latent Dirichlet allocation (Blei et al., 2003,
LDA) learns a single document-topic distribution per document; we can apply LDA to our
dataset by concatenating all spans from a relationship into a single document. Similarly,
NUBBI (Chang et al., 2009a) learns separate sets of topics for relationships and individual
characters.6
LDA and NUBBI are incapable of taking into account the chronological ordering
of the spans because they view all relationships tokens as exchangeable. While we can
compare the descriptors learned by these models to those of the RMN, we cannot evaluate
their trajectories. We turn instead to the hidden topic Markov model (Gruber et al., 2007,
HTMM), which foregoes the bag-of-words assumption of LDA and NUBBI in favor of
modeling topic segments within a document as a Markov chain. This model outputs a
smooth sequence of topic assignments over a document, so we can compare the trajectories
it learns on our dataset to those of the RMN.
5.4.2 Experimental Settings
In our descriptor interpretability experiments, we vary the number of descriptors (topics)
for all models (K = 10, 30, 50). We train LDA and NUBBI for 100 iterations with a
collapsed Gibbs sampler, and the HTMM uses the default setting of 100 EM iterations.
For the RMN, we initialize the word embedding matrix L with 300-dimensional
6NUBBI requires additional spans that mention only a single character to differentiate character topics
from relationship topics. None of the other models receives these extra data.
80
GloVe embeddings trained on the Common Crawl (Pennington et al., 2014). The character
and book embeddings (C and B) are initialized randomly. We fix α to 0.5 for the first 15
epochs of training; after the descriptor matrix R has converged, we fix R and tune α using
Equation 5.6 for 15 more epochs.7 Since the topic model baselines do not have access
to character and book metadata, for fair comparison we also train a “generic” version of
the RMN (GRMN) where the metadata embeddings are removed from Equation 5.2. The
uniqueness penalty λ is set to 10−4.
All of the RMN model parameters except L are optimized using Adam (Kingma
and Ba, 2014) with a learning rate of 0.001 for 30 epochs; the word embeddings are
not fine-tuned during training.8 We also apply word dropout (see Section 3.2.1.1) to the
input spans, removing words from the vector average computation in Equation 5.1 with
probability 0.5.
5.4.3 Do Descriptors Make Sense?
The goal of our first experiment is to compare the descriptors R learned by the RMN
to the topics learned by the topic model baselines. We conduct a word intrusion exper-
iment (Chang et al., 2009b): workers identify an “intruder” word from a set of words
that—other than the intruder—come from the same topic. For the topic models, the five
most probable words are joined by a highly-probable word from a different topic as the
intruder. We use the same procedure for the RMN and GRMN, except that cosine similarity
to descriptor embeddings replaces topic-word probability. To control for randomness in
7Preliminary experiments show that learning α and R simultaneously results in less interpretable descrip-
tors.
8Tuning L ruins descriptor interpretability; pretrained embeddings are likely already a good solution for
our problem.
81
RMN HTMM
Label MP Nearest Neighbors Label MP Most Probable Words
sadness 1.0 regretful rueful pity pained
despondent
violence 1.0 sword shot blood shouted
swung
love 1.0 love delightful happiness en-
joyed
boats 1.0 ship boat captain deck crew
murder 1.0 autopsy arrested homicide
murdered
food 1.0 kitchen mouth glass food
bread
worship 0.1 toil pray devote yourselves
gather
sci-fi 0.0 suppose earth robots com-
puter certain
moodiness 0.3 glumly snickered quizzically
guiltily
fantasy 0.0 agreed magician dragon cas-
tle talent
informal 0.4 kinda damn heck guess shitty military 0.1 ship captain lucky hour gen-
eral
Table 5.1: Three high-precision (top) and three low-precision (bottom) descrip-
tors for the RMN and HTMM, along with labels from an external evaluator and
model precision (MP) computed via word intrusion experiments. The RMN is
able to learn a variety of interpersonal states (e.g., love, sadness), while the
HTMM’s most coherent topics are about concrete objects or events.
K=10 K=30 K=50
0.4
0.5
0.6
0.7
0.8
M
od
el
 P
re
cis
io
n
LDA Nubbi HTMM GRMN RMN
Figure 5.3: Model precision results from our word intrusion task. The RMN
learns more interpretable descriptors than three topic model baselines.
82
the training process, we train three of each model, so the final experiment consists of 1,350
tasks (K = 10, 30, 50 descriptors per trial, three trials per model).
We collect judgments from ten different workers for each task using the Crowdflower
platform.9 Our evaluation metric, model precision (MP), is the fraction of workers that
select the correct intruder word for a descriptor k. Low model precision signals descriptors
that lack cohesive themes.
On average, the RMN’s descriptors are much more interpretable than those of the
baselines, as it achieves a mean model precision of 0.73 (Figure 5.4.3) across all values
of K. There is little difference between the model precision of the three topic model
baselines, which hover around 0.5. There is also little difference between the GRMN and
RMN; however, visualizing the learned character and book embeddings as in Figure 5.5 may
be insightful for literary scholars. We show example high and low precision descriptors for
the RMN and HTMM in Table 5.4.2; a full list is included in the supplementary material.
5.4.4 Do Trajectories Make Sense?
While the previous evaluation focused only on descriptor quality, our next experiment
compares the trajectories learned by the best RMN model from the intrusion experiment
(measured by highest mean model precision) to those learned by the best HTMM model,
which is the only baseline capable of learning relationship trajectories. Workers read a plot
summary and choose which model’s trajectory best represents the relationship in question.
We use the K = 30 setting because it provides the best balance between descriptor variety
and trajectory interpretability.
9http://www.crowdflower.com
83
For this evaluation, we crawl Wikipedia, Goodreads, and SparkNotes for plot sum-
maries associated with our 1,383 books. We then remove all relationships where each
involved character is not mentioned at least five times in the summary, which results in
a final evaluation set of 125 relationships.10 We present workers with two characters,
a plot summary, and a visualization of trajectories learned by the RMN and the HTMM
(Figure 5.4.4). The workers then select the trajectory that best matches the relationship
described by the summary.
To generate the visualizations, we first have an external annotator label each descrip-
tor from both models with a single word as in Table 5.4.2. For fairness, the annotator is
unaware of the underlying models. For the RMN, we visualize trajectories by displaying
the label of the argmax over descriptor weights dt at each time step t. Similarly, for the
HTMM, we display the most probable topic at each time step.11
The results of this task with seven workers per comparison favor the RMN: for 87 out
of the 125 evaluated relationships (69.6%), the workers choose the RMN’s trajectory over
the HTMM’s. We compute the Fleiss κ value (Fleiss, 1971) to correct our inter-annotator
agreement for chance and find that κ = 0.32, indicating fair agreement among the workers.
Furthermore, thirty-four relationships had unanimous agreement among the seven workers;
of these, twenty-six were unanimous in favor of the RMN compared to only eight for the
HTMM.
10Without this filtering step, workers do not have enough information to compare the two models since
most of the characters in our dataset are not mentioned in summaries.
11To reduce visual clutter, we ignore descriptors that persist for only a single time step.
84
Summary: Govinda is Siddhartha’s best friend and sometimes his 
follower. Like Siddhartha, Govinda devotes his life to the quest for 
understanding and enlightenment. He leaves his village with 
Siddhartha to join the Samanas, then leaves the Samanas to follow 
Gotama. He searches for enlightenment independently of Siddhartha 
but persists in looking for teachers who can show him the way. In the 
end, he is able to achieve enlightenment only because of 
Siddhartha’s love for him.
A B
TI
M
E
Siddhartha and Govinda
Figure 5.4: An example from the Crowdflower summary matching task;
workers are asked to choose the trajectory (here, “A” is generated by the RMN
and “B” by the HTMM) that best matches a provided summary that describes
the relationship between Siddartha and Govinda (from Siddartha by Hesse).
5.4.5 What Makes a Relationship Positive?
While the previous two experiments show that the RMN is more interpretable and accurate
than baseline models, we have not yet shown that its insights can aid in drawing connections
85
across various books and genres. As a first step in this direction, we investigate what
makes a relationship positive or negative by comparing trajectories from the RMN and
HTMM to static affinity annotations from a recently-released dataset (Massey et al., 2015)
of fictional relationships. Expected correlations (e.g., murder and sadness are strongly
negative descriptors) emerge alongside surprising ones (politics is negative, religion is
positive).
The affinity labeling task of Massey et al. (2015) requires workers to describe a given
relationship as positive, negative, or neutral. We consider only non-neutral relationships
for which two annotators agree on the affinity label and remove all books not present in
our own dataset. This filtering step results in 120 relationships, 78% of which are positive
and the remaining 22% negative.
Since the annotations are static, we first aggregate our trajectories across all time
steps. We compute K-dimensional “average positive” and “average negative” weight
vectors ap and an by averaging the relationship states dt for the RMN and the document-
topic distributions for the HTMM across all time steps for relationships labeled with a
particular affinity. Then, we compute the vector difference ap−an and sort it to produce a
ranked list of descriptors, where descriptors with positive differences occur more frequently
in positive relationships. Table 5.4.5 shows the most positive and most negative descriptors;
of particular interest is the large negative weight associated with political relationships
from both models.
86
Model Positive Negative
RMN education, love, reli-
gion, sex
politics, murder,
sadness, royalty
HTMM love, parental, busi-
ness, outdoors
love, politics, vio-
lence, crime
Table 5.2: Descriptors most characteristic of positive and negative relation-
ships, computed using existing annotations. Compared to the RMN, the HTMM
struggles to coherently characterize negative relationships. Interestingly, both
models show negative predispositions for political relationships, perhaps due
to genre bias or class differences.
TI
M
E
HTMMRMN
Storm Island: David and Lucy
HTMMRMN
A Tale of Two Cities: Darnay and Lucie
HTMMRMN
Dracula: Arthur and Lucy
Figure 5.5: Left: the RMN is able to model Arthur and Lucy’s trajectory
reasonably well compared to our manually-created version in Figure 5. Middle:
both models agree on event-based descriptors such as food and sex. Right: a
failure case for the RMN in which it is unable to learn that Lucie Manette and
Charles Darnay are in love.
5.5 Qualitative Analysis
Our experiments show the superiority of the RMN over various topic model baselines
in both descriptor interpretability and trajectory accuracy, but what causes the improved
performance? In this section, we analyze similarities between the RMN and HTMM and
look at qualitative examples where the RMN succeeds and fails. We also connect the
findings of our affinity experiment to existing literary scholarship.
Both models are equally proficient at learning and assigning event-based descriptors
87
Figure 5.6: Clusters from PCA visualizations of the RMN’s learned book (left)
and character (right) embeddings. We see a cluster of books about war and
violence (many of which are authored by Tom Clancy) as well as a cluster of
lead female characters from primarily romance novels. These visualizations
show that the RMN can recover useful static representations of characters and
books in addition to the dynamic relationship trajectories.
(e.g., crime, violence, food). More specifically, the RMN and HTMM agree on environ-
mental descriptions (e.g., boats, outdoors) and graphic sexual scenes (Figure 5.5, middle).
However, the RMN is more sophisticated with interpersonal relationships. None
of the topic model baselines learns negative emotional descriptors such as sadness or
suffering, which explains the inaccurate HTMM trajectory of Arthur and Lucy in the
left-most panel of Figure 5.5. All of the topic model baselines learn duplicate topics; in
Table 5.4.5, one love descriptor is highly positive while a duplicate is strongly negative.12
The RMN circumvents this problem with its uniqueness penalty (Equation 5.8).
While the increased descriptor variety is a positive, sometimes it leads the RMN
astray. The model largely ignores the love between Charles Darnay and Lucie Manette in
Dickens’ A Tale of Two Cities due to book’s sad tone; meanwhile, the HTMM’s trajectory,
12This “duplicate love” phenomenon persists even when we reduce the number of topics.
88
while vastly simplified, does pick up on the romance (Figure 5.5, right). While the RMN’s
learnable book and character embeddings should help, the signal in a span cannot lead to
the “proper” descriptor.
Both the RMN and HTMM learn that politics is strongly negative (Table 5.4.5).
Existing scholarship supports this finding: Victorian-era authors, for example, are “ob-
sessed with otherness . . . of antiquated social and legal institutions, and of autocratic
and/or dictatorial abusive government” (Zarifopol-Johnston, 1995), while in science fic-
tion, “dystopia—–precisely because it is so much more common (than utopia)—–bears
the aspect of lived experience” (Gordin et al., 2010). Our affinity data comes primarily
from Victorian novels (e.g., by Dickens and George Eliot), leading us to believe that that
the models are behaving reasonably. Finally, returning to the “extra” meaning of meals
discussed at the beginning of the chapter, food occurs slightly more frequently in positive
relationships.
5.6 Related Work
There are two major areas upon which our work builds: computational literary analysis
and deep neural networks for natural language processing.
Most previous work in computational literary analysis has focused either on charac-
ters or events. In the former category, graphical models and classifiers have been proposed
for learning character personas from novels (Bamman et al., 2014; Flekova and Gurevych,
2015) and film summaries (Bamman et al., 2013). The NUBBI model of Chang et al.
(2009a) learns topics that statically describe characters and their relationships. Because
89
these models lack temporal components (the focus of our task), we compare instead against
the HTMM of Gruber et al. (2007).
Closest to our own work is the supervised structured prediction problem of Chaturvedi
et al. (2016), in which features are designed to predict dynamic sequences of positive and
negative interactions between two characters in plot summaries. Other research in this
area includes social network construction from novels (Elson et al., 2010; Srivastava et al.,
2016) and film (Krishnan and Eisenstein, 2015), as well as attempts to summarize and
generate stories (Elsner, 2012).
While some of the relationship descriptors learned by our model are character-centric,
others are more events-based, depicting actions rather than feelings; such descriptors
have been the focus of much previous work (Schank and Abelson, 1977; Chambers and
Jurafsky, 2008, 2009; Orr et al., 2014). Our model is more closely related to the plot units
framework (Lehnert, 1981; Goyal et al., 2013), which annotates events with emotional
states.
The RMN builds on deep recurrent autoencoders such as the hierarchical LSTM
autoencoder of Li et al. (2015); however, it is more efficient because of the span-level vector
averaging. It is also similar to recent neural topic model architectures (Cao et al., 2015;
Das et al., 2015), although these models are limited to static document representations. We
hope to apply the RMN to nonfictional datasets as well; in this vein, Iyyer et al. (2014b)
apply a neural network to sentences from nonfiction political books for ideology prediction.
More generally, topic models and related generative models are a central tool for
understanding large corpora from science (Talley et al., 2011) to politics (Nguyen et al.,
2014). We show representation learning models like RMN can be just as interpretable as
90
LDA-based models. Other applications for which researchers have prioritized interpretable
vector representations include text-to-vision mappings (Lazaridou et al., 2014) and word
embeddings (Fyshe et al., 2015; Faruqui et al., 2015).
5.7 Conclusion
We formalize the task of unsupervised relationship modeling, which involves learning
a set of relationship descriptors as well as a trajectory over these descriptors for each
relationship in an input dataset. We present the RMN, a novel neural network architecture
for this task that generates more interpretable descriptors and trajectories than topic model
baselines. Finally, we show that the output of our model can lead to interesting insights
when combined with annotations in an existing dataset.
91
Chapter 6
Understanding Panel-to-Panel Inferences in Comic Books
So far, we have looked at problems that involve understanding and extracting information
from language-based contexts. To close out this thesis, I look at comic books, a multimodal
domain that incorporates images and language into a single medium. As we will see, the
network architectures are similar to those used in the previous chapters, and the challenges
of leveraging information from previously-observed context still remain.1
Figure 6.1: Where did the snake in the last panel come from? Why is it biting
the man? Is the man in the second panel the same as the man in the first panel?
To answer these questions, readers form a larger meaning out of the narration
boxes, speech bubbles, and artwork by applying closure across panels.
1This chapter includes content and figures from Iyyer et al. (2017a, CVPR).
92
6.1 Motivation
Comics are fragmented scenes forged into full-fledged stories by the imagination of their
readers. A comics creator can condense anything from a centuries-long intergalactic war
to an ordinary family dinner into a single panel. But it is what the creator hides from their
pages that makes comics truly interesting: the unspoken conversations and unseen actions
that lurk in the spaces (or gutters) between adjacent panels. For example, the dialogue in
Figure 6 suggests that between the second and third panels, Gilda commands her snakes to
chase after a frightened Michael in some sort of strange cult initiation. Through a process
called closure (McCloud, 1994), which involves (1) understanding individual panels and
(2) making connective inferences across panels, readers form coherent storylines from
seemingly disparate panels such as these. In this chapter, we study whether computers can
do the same by collecting a dataset of comic books (COMICS) and designing several tasks
that require closure to solve.
Section 6.2 describes how we create COMICS,2 which contains ∼1.2 million panels
drawn from almost 4,000 publicly-available comic books published during the “Golden
Age” of American comics (1938–1954). COMICS is challenging in both style and content
compared to natural images (e.g., photographs), which are the focus of most existing
datasets and methods (Krizhevsky et al., 2012; Xu et al., 2015; Xiong et al., 2016). Much
like painters, comic artists can render a single object or concept in multiple artistic styles to
evoke different emotional responses from the reader. For example, the lions in Figure 6.1
are drawn with varying degrees of realism: the more cartoonish lions, from humorous
2Data, code, and annotations available at http://github.com/miyyer/comics
93
comics, take on human expressions (e.g., surprise, nastiness), while those from adventure
comics are more photorealistic.
Comics are not just visual: creators push their stories forward through text—speech
balloons, thought clouds, and narrative boxes—which we identify and transcribe using
optical character recognition (OCR). Together, text and image are often intricately woven
together to tell a story that neither could tell on its own (Section 6.5). To understand a story,
readers must connect dialogue and narration to characters and environments; furthermore,
the text must be read in the proper order, as panels often depict long scenes rather than
individual moments (Cohn, 2010). Text plays a much larger role in COMICS than it does
for existing datasets of visual stories (Huang et al., 2016b).
To test machines’ ability to perform closure, we present three novel cloze-style tasks
in Section 6.6 that require a deep understanding of narrative and character to solve. In
Section 6.7, we design four neural architectures to examine the impact of multimodality
and contextual understanding via closure. All of these models perform significantly worse
than humans on our tasks; we conclude with an error analysis (Section 6.8) that suggests
future avenues for improvement.
6.2 Creating a dataset of comic books
Comics, defined by cartoonist Will Eisner as sequential art (Eisner, 1990), tell their stories
in sequences of panels, or single frames that can contain both images and text. Existing
comics datasets (Gue´rin et al., 2013; Matsui et al., 2015) are too small to train data-hungry
machine learning models for narrative understanding; additionally, they lack diversity in
94
Figure 6.2: Different artistic renderings of lions taken from the COMICS
dataset. The left-facing lions are more cartoonish (and humorous) than the
ones facing right, which come from action and adventure comics that rely on
realism to provide thrills.
visual style and genres. Thus, we build our own dataset, COMICS, by (1) downloading
comics in the public domain, (2) segmenting each page into panels, (3) extracting textbox
locations from panels, and (4) running OCR on textboxes and post-processing the output.
Table 6.2.1 summarizes the contents of COMICS. The rest of this section describes each
step of our data creation pipeline.
95
# Books 3,948
# Pages 198,657
# Panels 1,229,664
# Textboxes 2,498,657
Text cloze instances 89,412
Visual cloze instances 587,797
Char. coherence instances 72,313
Table 6.1: Statistics describing dataset size (top) and the number of total
instances for each of our three tasks (bottom).
6.2.1 Where do our comics come from?
The “Golden Age of Comics” began during America’s Great Depression and lasted through
World War II, ending in the mid-1950s with the passage of strict censorship regulations.
In contrast to the long, world-building story arcs popular in later eras, Golden Age
comics tend to be small and self-contained; a single book usually contains multiple
different stories sharing a common theme (e.g., crime or mystery). While the best-selling
Golden Age comics tell of American superheroes triumphing over German and Japanese
villains, a variety of other genres (such as romance, humor, and horror) also enjoyed
popularity (Goulart, 2004). The Digital Comics Museum (DCM)3 hosts user-uploaded
scans of many comics by lesser-known Golden Age publishers that are now in the public
domain due to copyright expiration. To avoid off-square images and missing pages, as the
scans vary in resolution and quality, we download the 4,000 highest-rated comic books
from DCM.4
3http://digitalcomicmuseum.com/
4Some of the panels in COMICS contain offensive caricatures and opinions reflective of that period in
American history.
96
6.2.2 Breaking comics into their basic elements
The DCM comics are distributed as compressed archives of JPEG page scans. To analyze
closure, which occurs from panel-to-panel, we first extract panels from the page images.
Next, we extract textboxes from the panels, as both location and content of textboxes are
important for character and narrative understanding.
Panel segmentation: Previous work on panel segmentation uses heuristics (Li et al.,
2014) or algorithms such as density gradients and recursive cuts (Tanaka et al., 2007; Pang
et al., 2014b; Rigaud et al., 2015) that rely on pages with uniformly white backgrounds and
clean gutters. Unfortunately, scanned images of eighty-year old comics do not particularly
adhere to these standards; furthermore, many DCM comics have non-standard panel layouts
and/or textboxes that extend across gutters to multiple panels.
After our attempts to use existing panel segmentation software failed, we turned to
deep learning. We annotate 500 randomly-selected pages from our dataset with rectangular
bounding boxes for panels. Each bounding box encloses both the panel artwork and the
textboxes within the panel; in cases where a textbox spans multiple panels, we necessarily
also include portions of the neighboring panel. After annotation, we train a region-based
convolutional neural network to automatically detect panels. In particular, we use Faster
R-CNN (Ren et al., 2015) initialized with a pretrained VGG CNN M 1024 model (Chatfield
et al., 2014) and alternatingly optimize the region proposal network and the detection
network. In Western comics, panels are usually read left-to-right, top-to-bottom, so we
also have to properly order all of the panels within a page after extraction. We compute
97
the midpoint of each panel and sort them using Morton order (Morton, 1966), which gives
incorrect orderings only for rare and complicated panel layouts.
Textbox segmentation: Since we are particularly interested in modeling the interplay be-
tween text and artwork, we need to also convert the text in each panel to a machine-readable
format.5 As with panel segmentation, existing comic textbox detection algorithms (Ho
et al., 2012; Rigaud et al., 2013) could not accurately localize textboxes for our data.
Thus, we resort again to Faster R-CNN: we annotate 1,500 panels for textboxes,6 train a
Faster-R-CNN, and sort the extracted textboxes within each panel using Morton order.
6.2.3 OCR
The final step of our data creation pipeline is applying OCR to the extracted textbox
images. We unsuccessfully experimented with two trainable open-source OCR systems,
Tesseract (Smith, 2007) and Ocular (Berg-Kirkpatrick et al., 2013), as well as Abbyy’s
consumer-grade FineReader.7 The ineffectiveness of these systems is likely due to the
considerable variation in comic fonts as well as domain mismatches with pretrained
language models (comics text is always capitalized, and dialogue phenomena such as
dialects may not be adequately represented in training data). Google’s Cloud Vision OCR8
performs much better on comics than any other system we tried. While it sometimes
5Alternatively, modules for text spotting and recognition (Jaderberg et al., 2016) could be built into
architectures for our downstream tasks, but since comic dialogues can be quite lengthy, these modules would
likely perform poorly.
6We make a distinction between narration and dialogue; the former usually occurs in strictly rectangular
boxes at the top of each panel and contains text describing or introducing a new scene, while the latter is
usually found in speech balloons or thought clouds.
7http://www.abbyy.com
8http://cloud.google.com/vision
98
struggles to detect short words or punctuation marks, the quality of the transcriptions is
good considering the image domain and quality. We use the Cloud Vision API to run OCR
on all 2.5 million textboxes for a cost of $3,000. We post-process the transcriptions by
removing systematic spelling errors (e.g., failing to recognize the first letter of a word).
Finally, each book in our dataset contains three or four full-page product advertisements;
since they are irrelevant for our purposes, we train a classifier on the transcriptions to
remove them.
6.3 OCR Post-Processing and Advertisement Removal
OCR makes systematic mistakes on our textboxes. We target two types of these mistakes
using PyEnchant:9 1) where the OCR system fails to recognize the first letter of a particular
word (e.g., eleportation instead of teleportation), and 2) where the OCR system transcribes
part of a word as a single alphabetical character. To eliminate errors of the first type, we
start by tokenizing the OCR output using NLTK’s Punkt Tokenizer.10 We then sort the
vocabulary of the tokenized OCR output in decreasing order of frequency and pick words
ranked from 10,001 to 100,000, because most misspelled words are also rare. For each of
these words that is length three or longer, we look up the most likely suggestion offered by
PyEnchant. If the only difference between the most likely suggestion and the original word
is an additional letter in the first position of the suggestion, then we replace the word with
the suggestion everywhere in our corpus. To correct the second type of errors, we simply
delete all single character alphabetical tokens that are not one of a, d, i, m, s, t—characters
9http://pythonhosted.org/pyenchant/faq.html
10http://www.nltk.org/
99
that can plausibly occur by themselves quite frequently (some occur after an apostrophe).
In addition to spelling errors, the books in COMICS contain many advertisements that
we need to remove before generating data for our tasks. While most dialogue and narration
boxes contain less than 30 words, longer textboxes frequently come from full-page product
advertisements (e.g., Figure 6.3). However, detecting ads from page images is not easy.
Some ads are deceptively similar to comic pages, containing images and even containing
faux mini-comics. Aside from ads, there are also other undesirable pages; many books
contain text-only short stories in addition to comics. We remove these kinds of pages using
features from OCR transcriptions. We annotate each page of 100 random books with a
label indicating the presence or absence of an invalid page as our training set and each
page of twenty random books as our test set. Out of 6,117 annotated pages, 697 of them
are either advertisements or text-only stories (11.4%). We train a binary classifier using
Vowpal Wabbit:11 which takes the OCR text for all the panels of a pages as lexical features
(unigrams and bigrams). We improve our model by adding features like total count of
words in the page and a count of non-alphanumeric characters. Our model gives us a total
misclassification error of 8% and a false negative error of 17.3%, which means it misses
one invalid page out of every six. The model has a negligible false positive error of 0.2%.
Using this model to filter the entire dataset of 198,657 pages yields 13,200 invalid pages.
6.4 Examples from Dataset Creation
OCR transcription is the final stage of our data creation pipeline (panel extraction →
textbox extraction→ OCR). Therefore, faulty outputs in any of the preceeding steps can
11https://github.com/JohnLangford/vowpal_wabbit/wiki
100
Figure 6.3: An advertisement from the dataset. The juxtaposition of text and image causes
it to slightly resemble a comics page.
101
Figure 6.4: A minor OCR error. Mistakes such as predicting “BG” for “BIG” are under-
standable, since the ‘I’ in “BIG” is barely visible. Similarly, the “IC” in “QUICKLY”
looks a lot like “K” in this font. Finally, “SUB STANCE” is predicted rather than “SUB-
STANCE”, due to an end-of-line word break.
lead to faulty OCR outputs. In Figure 6.4, there are only minor errors in OCR extraction
due to understandable misinterpretations of the text in the dialog boxes. For example,
the OCR interprets the letters “IC” as “K”, which leads to incorrectly predicting the word
“QUICKLY” as “QUKKLY”. However, in Figure 6.5, we observe a more critical error due
to missing pixels in the panel extraction process. Due to the layout of the textbox in the
panel, crucial portions of the text are trimmed from view; while the OCR does a valiant job
of predicting the contents of the textbox, its output is gibberish.
6.5 Data Analysis
In this section, we explore what makes understanding narratives in COMICS difficult,
focusing specifically on intrapanel behavior (how images and text interact within a panel)
and interpanel transitions (how the narrative advances from one panel to the next). We
characterize panels and transitions using a modified version of the annotation scheme in
102
Figure 6.5: A major OCR error. In part a) of the figure, note the location of the panel in
the page. b) gives us the panel as predicted by the RCNN, but a critical portion of the
text is missing. As a consequence, the textbox extraction is also faulty, rendering the OCR
completely meaningless.
Scott McCloud’s “Understanding Comics” (McCloud, 1994). Over 90% of panels rely
on both text and image to convey information, as opposed to just using a single modality.
Closure is also important: to understand most transitions between panels, readers must
make complex inferences that often require common sense (e.g., connecting jumps in
space and/or time, recognizing when new characters have been introduced to an existing
scene). We conclude that any model trained to understand narrative flow in COMICS will
have to effectively tie together multimodal inputs through closure.
To perform our analysis, we manually annotate 250 randomly-selected pairs of
consecutive panels from COMICS. Each panel of a pair is annotated for intrapanel behavior,
while an interpanel annotation is assigned to the transition between the panels. Two
annotators independently categorize each pair, and a third annotator makes the final
103
INTRAPANEL
SUBJECT-TO-SUBJECT: 32.7%
SCENE-TO-SCENE: 13.8%
ACTION-TO-ACTION: 34.6%
CONTINUED CONVERSATION: 17.7%
INTERDEPENDENT: 92.1%
WORD-SPECIFIC: 4.4%
PARALLEL: 0.57%
PICTURE-SPECIFIC: 2.8%
MOMENT-TO-MOMENT: 0.39%
Figure 6.6: Five example panel sequences from COMICS, one for each type of
interpanel transition. Individual panel borders are color-coded to match their
intrapanel categories (legend in bottom-left). Moment-to-moment transitions
unfold like frames in a movie, while scene-to-scene transitions are loosely
strung together by narrative boxes. Percentages are the relative prevalance of
the transition or panel type in an annotated subset of COMICS.
decision when they disagree. We use four intrapanel categories (definitions from McCloud,
percentages from our annotations):
1. Word-specific, 4.4%: The pictures illustrate, but do not significantly add to a largely
complete text.
2. Picture-specific, 2.8%: The words do little more than add a soundtrack to a visually-
told sequence.
3. Parallel, 0.6%: Words and pictures seem to follow very different courses without
intersecting.
4. Interdependent, 92.1%: Words and pictures go hand-in-hand to convey an idea
that neither could convey alone.
We group interpanel transitions into five categories:
104
1. Moment-to-moment, 0.4%: Almost no time passes between panels, much like
adjacent frames in a video.
2. Action-to-action, 34.6%: The same subjects progress through an action within the
same scene.
3. Subject-to-subject, 32.7%: New subjects are introduced while staying within the
same scene or idea.
4. Scene-to-scene, 13.8%: Significant changes in time or space between the two
panels.
5. Continued conversation, 17.7%: Subjects continue a conversation across panels
without any other changes.
The two annotators agree on 96% of the intrapanel annotations (Cohen’s κ = 0.657),
which is unsurprising because almost every panel is interdependent. The interpanel task is
significantly harder: agreement is only 68% (Cohen’s κ = 0.605). Panel transitions are
more diverse, as all types except moment-to-moment are relatively common (Figure 6.5);
interestingly, moment-to-moment transitions require the least amount of closure as there
is almost no change in time or space between the panels. Multiple transition types may
occur in the same panel, such as simultaneous changes in subjects and actions, which also
contributes to the lower interpanel agreement.
6.6 Tasks that test closure
To explore closure in COMICS, we design three novel tasks (text cloze, visual cloze, and
character coherence) that test a model’s ability to understand narratives and characters
105
THANKS OLD TIMER! 
THE BATS WOULD 
HAVE GOT US, SURE! 
WHERE’D THEY COME 
FROM?
SCOTTY’S MY NAME. 
I’M THE SHERIFF. MEAN 
TO TELL YOU’VE NEVER 
HEARD OF THE BATS?
THANKS OLD TIMER! THE 
BATS WOULD HAVE GOT 
US, SURE! WHERE’D 
THEY COME FROM?
SCOTTY’S MY 
NAME. I’M THE 
SHERIFF. MEAN TO 
TELL YOU’VE 
NEVER HEARD OF 
THE BATS?
character 
coherence
visual
cloze
Figure 6.7: In the character coherence task (top), a model must order the
dialogues in the final panel, while visual cloze (bottom) requires choosing the
image of the panel that follows the given context. For visualization purposes,
we show the original context panels; during model training and evaluation,
textboxes are blacked out in every panel.
given a few panels of context. As shown in the previous section’s analysis, a high percent-
age of panel transitions require non-trivial inferences from the reader; to successfully solve
our proposed tasks, a model must be able to make the same kinds of connections.
While their objectives are different, all three tasks follow the same format: given
preceding panels pi−1, pi−2, . . . , pi−n as context, a model is asked to predict some aspect
of panel pi. While previous work on visual storytelling focuses on generating text given
some context (Huang et al., 2016a), the dialogue-heavy text in COMICS makes evaluation
difficult (e.g., dialects, grammatical variations, many rare words). We want our evaluations
to focus specifically on closure, not generated text quality, so we instead use a cloze-style
framework (Taylor, 1953): given c candidates—with a single correct option—models must
use the context panels to rank the correct candidate higher than the others. The rest of this
106
section describes each of the three tasks in detail; Table 6.2.1 provides the total instances
of each task with the number of context panels n = 3.
Text Cloze: In the text cloze task, we ask the model to predict what text out of a set
of candidates belongs in a particular textbox, given both context panels (text and image)
as well as the current panel image. While initially we did not put any constraints on the
task design, we quickly noticed two major issues. First, since the panel images include
textboxes, any model trained on this task could in principle learn to crudely imitate OCR
by matching text candidates to the actual image of the text. To solve this problem, we
“black out” the rectangle given by the bounding boxes for each textbox in a panel (see
Figure 6.6).12 Second, panels often have multiple textboxes (e.g., conversations between
characters); to focus on interpanel transitions rather than intrapanel complexity, we restrict
pi to panels that contain only a single textbox. Thus, nothing from the current panel matters
other than the artwork; the majority of the predictive information comes from previous
panels.
Visual Cloze: We know from Section 6.5 that in most cases, text and image work
interdependently to tell a story. In the visual cloze task, we follow the same set-up as in text
cloze, but our candidates are images instead of text. A key difference is that models are not
given text from the final panel; in text cloze, models are allowed to look at the final panel’s
artwork. This design is motivated by eyetracking studies in single-panel cartoons, which
show that readers look at artwork before reading the text (Carroll et al., 1992), although
12To reduce the chance of models trivially correlating candidate length to textbox size, we remove very
short and very long candidates.
107
atypical font style and text length can invert this order (Foulsham et al., 2016).
Character Coherence: While the previous two tasks focus mainly on narrative structure,
our third task attempts to isolate character understanding through a re-ordering task. Given
a jumbled set of text from the textboxes in panel pi, a model must learn to match each
candidate to its corresponding textbox. We restrict this task to panels that contain exactly
two dialogue boxes (narration boxes are excluded to focus the task on characters). While it
is often easy to order the text based on the language alone (e.g., “how’s it going” always
comes before “fine, how about you?”), many cases require inferring which character is
likely to utter a particular bit of dialogue based on both their previous utterances and their
appearance (e.g., Figure 6.6, top).
6.6.1 Task Difficulty
For text cloze and visual cloze, we have two difficulty settings that vary in how cloze
candidates are chosen. In the easy setting, we sample textboxes (or panel images) from the
entire COMICS dataset at random. Most incorrect candidates in the easy setting have no
relation to the provided context, as they come from completely different books and genres.
This setting is thus easier for models to “cheat” on by relying on stylistic indicators instead
of contextual information. With that said, the task is still non-trivial; for example, many
bits of short dialogue can be applicable in a variety of scenarios. In the hard case, the
candidates come from nearby pages, so models must rely on the context to perform well.
For text cloze, all candidates are likely to mention the same character names and entities,
while color schemes and textures become much less distinguishing for visual cloze.
108
z1 z3
t11
LSTM LSTM
z4t4
HIYA KID! ALL 
ALONE???
ALICE! I’VE BEEN 
LOOKING ALL 
OVER FOR YOU!
+
-
-
t12
LSTM LSTM z1
t11
LSTM
t12
LSTM LSTM
ReLu ReLu ReLu
Figure 6.8: The image-text architecture applied to an instance of the text cloze
task. Pretrained image features are combined with learned text features in a
hierarchical LSTM architecture to form a context representation, which is then
used to score text candidates.
6.7 Models & Experiments
To measure the difficulty of these tasks for deep learning models, we adapt strong baselines
for multimodal language and vision understanding tasks to the comics domain. We
evaluate four different neural models, variants of which were also used to benchmark
the Visual Question Answering dataset (Antol et al., 2015) and encode context for visual
storytelling (Huang et al., 2016b): text-only, image-only, and two image-text models.
Our best-performing model encodes panels with a hierarchical LSTM architecture (see
Figure 6.7).
On text cloze, accuracy increases when models are given images (in the form of
pretrained VGG-16 features) in addition to text; on the other tasks, incorporating both
modalities is less important. Additionally, for the text cloze and visual cloze tasks, models
perform far worse on the hard setting than the easy setting, confirming our intuition that
109
these tasks are non-trivial when we control for stylistic dissimilarities between candidates.
Finally, none of the architectures outperform human baselines, which demonstrates the
difficulty of understanding COMICS: image features obtained from models trained on
natural images cannot capture the vast variation in artistic styles, and textual models
struggle with the richness and ambiguity of colloquial dialogue highly dependent on visual
contexts. In the rest of this section, we first introduce a shared notation and then use it to
specify all of our models.
6.7.1 Model definitions
In all of our tasks, we are asked to make a prediction about a particular panel given the
preceding n panels as context.13 Each panel consists of three distinct elements: image, text
(OCR output), and textbox bounding box coordinates. For any panel pi, the corresponding
image is zi. Since there can be multiple textboxes per panel, we refer to individual textbox
contents and bounding boxes as tix and bix , respectively. Each of our tasks has a different
set of answer candidates A: text cloze has three text candidates ta1...3 , visual cloze has three
image candidates za1...3 , and character coherence has two combinations of text / bounding
box pairs, {ta1/ba1 , ta2/ba2} and {ta1/ba2 , ta2/ba1}. Our architectures differ mainly in the
encoding function g that converts a sequence of context panels pi−1, pi−2, . . . , pi−n into a
fixed-length vector c. We score the answer candidates by taking their inner product with c
and normalizing with the softmax function,
s = softmax(AT c), (6.1)
13Test and validation instances for all tasks come from comic books that are unseen during training.
110
and we minimize the cross-entropy loss against the ground-truth labels.14
Text-only: The text-only baseline only has access to the text tix within each panel. Our
encoding function g processes this text on multiple levels: we first compute a representation
for each tix with a word embedding sum15 and then combine multiple textboxes within the
same panel using an intrapanel LSTM (Hochreiter and Schmidhuber, 1997). Finally, we
feed the panel-level representations to an interpanel LSTM and take its final hidden state
as the context representation (Figure 6.7). For text cloze, the answer candidates are also
encoded with a word embedding sum; for visual cloze, we project the 4096-d fc7 layer of
VGG-16 down to the word embedding dimensionality with a fully-connected layer.16
Image-only: The image-only baseline is even simpler: we feed the fc7 features of each
context panel to an LSTM and use the same objective function as before to score candidates.
For visual cloze, we project both the context and answer representations to 512-d with
additional fully-connected layers before scoring. While the COMICS dataset is certainly
large, we do not attempt learning visual features from scratch as our task-specific signals
are far more complicated than simple image classification. We also try fine-tuning the
lower-level layers of VGG-16 (Aytar et al., 2016); however, this substantially lowers task
accuracy even with very small learning rates for the fine-tuned layers.
14Performance falters slightly on a development set with contrastive max-margin loss functions (Socher
et al., 2014) in place of our softmax alternative.
15As in previous work for visual question answering (Zhou et al., 2015), we observe no noticeable
improvement with more sophisticated encoding architectures.
16For training and testing, we use three panels of context and three candidates. We use a vocabulary size
of 30,000 words, restrict the maximum number of textboxes per panel to three, and set the dimensionality
of word embeddings and LSTM hidden states to 256. Models are optimized using Adam (Kingma and Ba,
2014) for ten epochs, after which we select the best-performing model on the dev set.
111
Model Text Cloze Visual Cloze Char. Coheren.
easy hard easy hard
Random 33.3 33.3 33.3 33.3 50.0
Text-only 63.4 52.9 55.9 48.4 68.2
Image-only 51.7 49.4 85.7 63.2 70.9
NC-image-text 63.1 59.6 - - 65.2
Image-text 68.6 61.0 81.3 59.1 69.3
Human – 84 – 88 87
Table 6.2: Combining image and text in neural architectures improves their
ability to predict the next image or dialogue in COMICS narratives. The contex-
tual information present in preceding panels is useful for all tasks: the model
that only looks at a single panel (NC-image-text) always underperforms its
context-aware counterpart. However, even the best performing models lag
well behind humans.
Image-text: We combine the previous two models by concatenating the output of the
intrapanel LSTM with the fc7 representation of the image and passing the result through a
fully-connected layer before feeding it to the interpanel LSTM (Figure 6.7). For text cloze
and character coherence, we also experiment with a variant of the image-text baseline
that has no access to the context panels, which we dub NC-image-text. In this model, the
scoring function computes inner products between the image features of pi and the text
candidates.17
6.8 Error Analysis
Table 4.4.2 contains our full experimental results, which we briefly summarize here. On
text cloze, the image-text model dominates those trained on a single modality. However,
text is much less helpful for visual cloze than it is for text cloze, suggesting that visual
similarity dominates the former task. Having the context of the preceding panels helps
17We cannot apply this model to visual cloze because we are not allowed access to the artwork in panel pi.
112
across the board, although the improvements are lower in the hard setting. There is more
variation across the models in the easy setting; we hypothesize that the hard case requires
moving away from pretrained image features, and transfer learning methods may prove
effective here. Differences between models on character coherence are minor; we suspect
that more complicated attentional architectures that leverage the bounding box locations
bix are necessary to “follow” speech bubble tails to the characters who speak them.
We also compare all models to a human baseline, for which the authors manually
solve one hundred instances of each task (in the hard setting) given the same preprocessed
input that is fed to the neural architectures. Most human errors are the result of poor OCR
quality (e.g., misspelled words) or low image resolution. Humans comfortably outperform
all models, making it worthwhile to look at where computers fail but humans succeed.
113
catfish creek 
jail
thanks , lem ah 
sho nuff will
hang tight evah   
one - we ‘ uns are 
UNK for the drink !
you won ‘ t be 
using this 
transmitter
here is sorcery 
black magic 
UNK him , boys !
guess i ‘ ll … great 
guns ! another ship !
correct candidate incorrect candidates
black hood 
overcoming 
scorpio
about this why 
i might be 
murdered next!
the shooting 
begins
Figure 6.9: Three text cloze examples from the development set, shown
with a single panel of context (boxed candidates are predictions by the text-
image model). The airplane artwork in the top row helps the image-text
model choose the correct answer, while the text-only model fails because the
dialogue lacks contextual information. Conversely, the bottom two rows show
the image-text model ignoring the context in favor of choosing a candidate
that mentions something visually present in the last panel.114
The top row in Figure 6.8 demonstrates an instance (from easy text cloze where the
image helps the model make the correct prediction. The text-only model has no idea that
an airplane (referred to here as a “ship”) is present in the panel sequence, as the dialogue
in the context panels make no mention of it. In contrast, the image-text model is able to
use the artwork to rule out the two incorrect candidates.
The bottom two rows in Figure 6.8 show hard text cloze instances in which the
image-text model is deceived by the artwork in the final panel. While the final panel of the
middle row does contain what looks to be a creek, “catfish creek jail” is more suited for a
narrative box than a speech bubble, while the meaning of the correct candidate is obscured
by the dialect and out-of-vocabulary token. Similarly, a camera films a fight scene in the
last row; the model selects a candidate that describes a fight instead of focusing on the
context in which the scene occurs. These examples suggest that the contextual information
is overridden by strong associations between text and image, motivating architectures that
go beyond similarity by leveraging external world knowledge to determine whether an
utterance is truly appropriate in a given situation.
6.9 Related Work
Our work is related to three main areas: (1) multimodal tasks that require language and
vision understanding, (2) computational methods that focus on non-natural images, and (3)
models that characterize language-based narratives.
Deep learning has renewed interest in jointly reasoning about vision and language.
Datasets such as MS COCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2016)
115
have enabled image captioning (Vinyals et al., 2015; Karpathy and Li, 2015; Xu et al.,
2015) and visual question answering (Malinowski et al., 2015; Lu et al., 2016). Similar to
our character coherence task, researchers have built models that match TV show characters
with their visual attributes (Everingham et al., 2006) and speech patterns (Haurilet et al.,
2016).
Closest to our own comic book setting is the visual storytelling task, in which
systems must generate (Huang et al., 2016a) or reorder (Agrawal et al., 2016) stories given
a dataset (SIND) of photos from Flikr galleries of “storyable” events such as weddings
and birthday parties. SIND’s images are fundamentally different from COMICS in that
they lack coherent characters and accompanying dialogue. Comics are created by skilled
professionals, not crowdsourced workers, and they offer a far greater variety of character-
centric stories that depend on dialogue to further the narrative; with that said, the text in
COMICS is less suited for generation because of OCR errors.
We build here on previous work that attempts to understand non-natural images.
Zitnick et al. (Zitnick et al., 2016) discover semantic scene properties from a clip art
dataset featuring characters and objects in a limited variety of settings. Applications of
deep learning to paintings include tasks such as detecting objects in oil paintings (Crowley
and Zisserman, 2014; Crowley et al., 2015) and answering questions about artwork (Guha
et al., 2016). Previous computational work on comics focuses primarily on extracting
elements such as panels and textboxes (Rigaud, 2014); in addition to the references in
Section 6.2, there is a large body of segmentation research on manga (Aramaki et al., 2014;
Pang et al., 2014a; Matsui, 2015; Kovanen and Aizawa, 2015).
To the best of our knowledge, we are the first to computationally model content in
116
comic books as opposed to just extracting their elements. We follow previous work in
language-based narrative understanding; very similar to our text cloze task is the “Story
Cloze Test” (Mostafazadeh et al., 2016), in which models must predict the ending to a
short (four sentences long) story. Just like our tasks, the Story Cloze Test proves difficult
for computers and motivates future research into commonsense knowledge acquisition.
Others have studied characters (Elson et al., 2010; Bamman et al., 2014; Iyyer et al., 2016)
and narrative structure (Schank and Abelson, 1977; Lehnert, 1981; Chambers and Jurafsky,
2009) in novels.
6.10 Conclusion & Future Work
We present the COMICS dataset, which contains over 1.2 million panels from “Golden Age”
comic books. We design three cloze-style tasks on COMICS to explore closure, or how
readers connect disparate panels into coherent stories. Experiments with different neural
architectures, along with a manual data analysis, confirm the importance of multimodal
models that combine text and image for comics understanding. We additionally show that
context is crucial for predicting narrative or character-centric aspects of panels.
However, for computers to reach human performance, they will need to become
better at leveraging context. Readers rely on commonsense knowledge to make sense of
dramatic scene and camera changes; how can we inject such knowledge into our models?
Another potentially intriguing direction, especially given recent advances in generative
adversarial networks (Goodfellow et al., 2014), is generating artwork given dialogue (or
vice versa). Finally, COMICS presents a golden opportunity for transfer learning; can we
117
train models that generalize across natural and non-natural images much like humans do?
118
Chapter 7
Conclusion
In this thesis, I have explored a variety of deep neural network architectures for tasks that
involve both small and large-scale discourse-level understanding. These tasks encompass
a wide variety of contexts, from short questions to sequences of images; the unifying
factor is our ability to tackle all of them using a collection of neural network modules.
The work here departs from traditional NLP work on discourse by not specifying the
type of relations between different units of text as we would in rhetorical structure theory
(see Section 2.5.1 for details), for example. Instead, neural networks implicitly reason
about these connections; in the relationships modeling work of Chapter 5, we do not
specify anything other than the number of relationship types, and the model fills them by
leveraging patterns it learns from the input data. Deep learning holds much promise for
discourse-level representation learning; to wrap up my thesis, I will first recap each chapter
before offering proposed directions for future research in this area.
7.1 Contextual Question Answering
I first presented two question-answering tasks, quiz bowl and sequential semantic parsing,
which focused on small paragraph-length inputs. For quiz bowl, we have complete
supervision in the form of question-answer pairs, and simple vector averaging serves
to effectively aggregate information across sentences. In contrast, our semantic parsing
119
setting is not fully supervised: we have question-answer pairs but lack the ground-truth
intermediate logical form. This makes both answering questions and using information
from previous questions more difficult than in quiz bowl; we settle on a modular neural
network trained via structured output learning.
7.1.1 Future Directions for QA
For both tasks, we look at small context sizes, which was necessary to make the problems
approachable. Here I outline future directions that seek to expand context complexity and
size.
Quiz bowl Quiz bowl contains questions about language domains such as literature
with huge, convoluted contexts (as we saw with the fictional relationships in Chapter 5).
Simply training on paragraph-long questions is not enough to answer these questions at
early positions, especially if the clues do not occur during training. For example, take
this clue from a question on Henrik Ibsen’s “A Doll’s House”: In a scene from this play,
one character practices a tarantella to prevent another character from opening his mail.
We rarely expect to see references to this particular scene during training, if at all, as it is
relatively obscure. The only way to answer clues like this is to allow the model access to
the raw source material, which opens up another can of worms: how do we map this clue to
scenes in the play? This “inverse summarization” problem is a potentially very interesting
avenue of research, as it requires both small and large-scale context understanding.
120
Sequential semantic parsing The SQA dataset is a first step in the direction of conver-
sational QA. With that said, it is not by any means a perfect simulation of real conversation.
We assume that the user is always asking a question at each turn, and that the computer is
always answering it. This paradigm is of course not always true in real life, as computers
may want to ask clarifying questions given underspecified queries, and users may want to
incorporate more chat-like turns rather than bombarding the computer with questions in an
effort to make the conversation seem more natural. To bring the task closer to real-world
conversation, we need to equip our QA models with the ability to generate language in
addition to existing semantic parse functionality. This is a challenging goal because not
only does the network need to generate grammatical, meaningful utterances, but it also
needs to decide when to switch to “chat mode“ and when to execute a semantic parse over
a knowledge base.
7.2 Comprehending Novels and Comics
The most difficult problems I tackle in this thesis are applications of deep learning to
creative domains. In the QA tasks, we looked at short contexts in a relatively small answer
space; for quiz bowl, we have a fixed set of a few thousand answers, while for SQA each
question’s answer space is defined by its corresponding table. In Chapter 5, I propose a
neural network architecture to model the dynamics of fictional relationships that span entire
novels, while in Chapter 6 I explore comic narrative understanding with deep learning.
121
7.2.1 Future Directions in Creative Understanding
There is a long (and perhaps impossible) road ahead for machines before they can “read”
a novel or comic book at the same level of understanding as human readers. For one,
humans possess a wealth of world knowledge and personal experiences that they can
access while reading, while neural networks start with a blank slate and have to pick up
world knowledge and commonsense reasoning just from their training data. An important
open question is exactly how to bake world knowledge into these networks prior to or
during training; can purely unsupervised methods learn this knowledge from raw text, or
do we need to leverage large annotated resources?
Regarding the relationship modeling task, there are many avenues for further re-
search. For example, the RMN model ignores asymmetric relationships, and so modeling
unrequited love and other more complex relationships is not feasible within the current
framework. To capture these sorts of relationships, we cannot use bag-of-words models
like the DAN to compose span representations, as syntactic features are crucial to deter-
mine agent-patient relationships and other features indicative of asymmetry; architectures
such as TreeNNs could perhaps be of value here. Another potential future direction is
considering the entire book rather than just spans of text that contain relevant character
mentions, as interactions between other characters and descriptive language about the
environment are useful sources of information.
Finally, in addition to the future directions mentioned in Chapter 6 regarding the
comic books project, there are more ambitious tasks to be solved in this domain. Generating
dialogue and artwork using adversarial networks is one such application. A less lofty goal
122
might be to accomplish panel reordering: given a set of n panels from a comic book page,
can a model learn to sort them into the correct order? To solve this task for large values of
n, the model needs a lot of external knowledge; pretraining on frames from movies is a
concrete way to inject such knowledge into the network.
123
Bibliography
Alekh Agarwal, Animashree Anandkumar, Prateek Jain, Praneeth Netrapalli, and Rashish Tandon. Learning
sparsely used overcomplete dictionaries. In Proceedings of Conference on Learning Theory, 2014.
Harsh Agrawal, Arjun Chandrasekaran, Dhruv Batra, Devi Parikh, and Mohit Bansal. Sort story: Sorting
jumbled images and captions into stories. In Proceedings of Empirical Methods in Natural Language
Processing, 2016.
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Learning to compose neural networks for
question answering. In Conference of the North American Chapter of the Association for Computational
Linguistics, 2016.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and
Devi Parikh. VQA: Visual question answering. In International Conference on Computer Vision, 2015.
Yuji Aramaki, Yusuke Matsui, Toshihiko Yamasaki, and Kiyoharu Aizawa. Interactive segmentation for
manga. In Special Interest Group on Computer Graphics and Interactive Techniques Conference, 2014.
Yoav Artzi and Luke Zettlemoyer. Bootstrapping semantic parsers from conversations. In Proceedings of
Empirical Methods in Natural Language Processing, 2011.
Yusuf Aytar, Lluis Castrejon, Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. Cross-modal scene
networks. arXiv, 2016. URL http://arxiv.org/abs/1610.09003.
David Bamman, Brendan O’Connor, and Noah A. Smith. Learning latent personas of film characters. In
Proceedings of the Association for Computational Linguistics, 2013.
David Bamman, Ted Underwood, and Noah A. Smith. A Bayesian mixed effects model of literary character.
In Proceedings of the Association for Computational Linguistics, 2014.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspec-
tives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
Taylor Berg-Kirkpatrick, Greg Durrett, and Dan Klein. Unsupervised transcription of historical documents.
In Proceedings of the Association for Computational Linguistics, 2013.
David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. Journal of Machine
Learning Research, 3, 2003.
Jordan Boyd-Graber, Brianna Satinoff, He He, and Hal Daume´ III. Besting the quiz master: Crowdsourcing
incremental classification games. In Proceedings of Empirical Methods in Natural Language Processing,
2012.
Ziqiang Cao, Sujian Li, Yang Liu, Wenjie Li, and Heng Ji. A novel neural topic model and its supervised
extension. In Association for the Advancement of Artificial Intelligence, 2015.
Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. Building a discourse-tagged corpus in the
framework of rhetorical structure theory. In Current and new directions in discourse and dialogue, 2003.
Patrick J Carroll, Jason R Young, and Michael S Guertin. Visual analysis of cartoons: A view from the far
side. In Eye movements and visual cognition. Springer, 1992.
Nathanael Chambers and Dan Jurafsky. Unsupervised learning of narrative schemas and their participants.
In Proceedings of the Association for Computational Linguistics, 2009.
124
Nathanael Chambers and Daniel Jurafsky. Unsupervised learning of narrative event chains. In Proceedings
of the Association for Computational Linguistics, 2008.
Jonathan Chang, Jordan Boyd-Graber, and David M Blei. Connections between the lines: augmenting social
networks with text. In Knowledge Discovery and Data Mining, 2009a.
Jonathan Chang, Sean Gerrish, Chong Wang, Jordan L Boyd-Graber, and David M Blei. Reading tea leaves:
How humans interpret topic models. In Proceedings of Advances in Neural Information Processing
Systems, 2009b.
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details:
Delving deep into convolutional nets. In British Machine Vision Conference, 2014.
Snigdha Chaturvedi, Shashank Srivastava, Hal Daume´ III, and Chris Dyer. Modeling dynamic relationships
between characters in literary novels. In Association for the Advancement of Artificial Intelligence, 2016.
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Fethi Bougares, Holger Schwenk, and Yoshua
Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In
Proceedings of Empirical Methods in Natural Language Processing, 2014.
J. Cognard-Black, M. Goldthwaite, and M. Nestle. Books That Cook: The Making of a Literary Meal.
NYU Press, 2014. ISBN 9781479830213. URL https://books.google.com/books?id=
5IETCgAAQBAJ.
Neil Cohn. The limits of time and transitions: challenges to theories of sequential image comprehension.
Studies in Comics, 1(1), 2010.
E. J. Crowley, O. M. Parkhi, and A. Zisserman. Face painting: querying art with photos. In British Machine
Vision Conference, 2015.
Elliot Crowley and Andrew Zisserman. The state of the art: Object retrieval in paintings using discriminative
regions. In British Machine Vision Conference, 2014.
Rajarshi Das, Manzil Zaheer, and Chris Dyer. Gaussian lda for topic models with word embeddings. In
Proceedings of the Association for Computational Linguistics, 2015.
Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William W Cohen, and Ruslan Salakhutdinov. Gated-attention
readers for text comprehension. In Proceedings of the Association for Computational Linguistics, 2017.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic
optimization. Journal of Machine Learning Research, 2011.
Will Eisner. Comics & Sequential Art. Poorhouse Press, 1990.
Michael Elad and Michal Aharon. Image denoising via sparse and redundant representations over learned
dictionaries. IEEE Transactions on Image Processing, 15(12), 2006.
Micha Elsner. Character-based kernels for novelistic plot structure. In Proceedings of the European Chapter
of the Association for Computational Linguistics, 2012.
David K Elson, Nicholas Dames, and Kathleen R McKeown. Extracting social networks from literary fiction.
In Proceedings of the Association for Computational Linguistics, 2010.
Mark Everingham, Josef Sivic, and Andrew Zisserman. Hello! my name is... Buffy” – automatic naming of
characters in TV video. In Proceedings of the British Machine Vision Conference, 2006.
Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open question answering over curated and extracted
knowledge bases. In Knowledge Discovery and Data Mining, 2014.
125
Manaal Faruqui, Yulia Tsvetkov, Dani Yogatama, Chris Dyer, and Noah Smith. Sparse overcomplete word
vector representations. In Proceedings of the Association for Computational Linguistics, 2015.
Vanessa Wei Feng and Graeme Hirst. Text-level discourse parsing with rich linguistic features. In Proceedings
of the Association for Computational Linguistics, 2012.
Joseph L Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5),
1971.
Lucie Flekova and Iryna Gurevych. Personality profiling of fictional characters using sense-level links
between lexical resources. In Proceedings of Empirical Methods in Natural Language Processing, 2015.
T.C. Foster. How to Read Literature Like a Professor. HarperCollins, 2009. ISBN 9780061804069. URL
https://books.google.com/books?id=PKeOyeZdVKwC.
Tom Foulsham, Dean Wybrow, and Neil Cohn. Reading without words: Eye movements in the comprehension
of comic strips. Applied Cognitive Psychology, 30, 2016.
Stefan L Frank. Uncertainty reduction as a measure of cognitive load in sentence comprehension. Topics in
Cognitive Science, 5(3), 2013.
Alona Fyshe, Leila Wehbe, Partha P Talukdar, Brian Murphy, and Tom M Mitchell. A compositional
and interpretable semantic space. In Conference of the North American Chapter of the Association for
Computational Linguistics, 2015.
Christoph Goller and Andreas Kuchler. Learning task-dependent distributed representations by backprop-
agation through structure. In Neural Networks, 1996., IEEE International Conference on, volume 1,
1996.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of Advances in Neural
Information Processing Systems, 2014.
Michael D Gordin, Helen Tilley, and Gyan Prakash. Utopia/dystopia: conditions of historical possibility.
Princeton University Press, 2010.
Ron Goulart. Comic Book Encyclopedia: The Ultimate Guide to Characters, Graphic Novels, Writers, and
Artists in the Comic Book Universe. HarperCollins, 2004.
Amit Goyal, Ellen Riloff, and Hal Daume´ III. A computational model for plot units. Computational
Intelligence Journal, 29(3), 2013.
Amit Gruber, Yair Weiss, and Michal Rosen-Zvi. Hidden topic markov models. In Proceedings of Artificial
Intelligence and Statistics, 2007.
Cle´ment Gue´rin, Christophe Rigaud, Antoine Mercier, Farid Ammar-Boudjelal, Karell Bertet, Alain Bouju,
Jean-Christophe Burie, Georges Louis, Jean-Marc Ogier, and Arnaud Revel. eBDtheque: A representative
database of comics. In International Conference on Document Analysis and Recognition, 2013.
Anupam Guha, Mohit Iyyer, and Jordan Boyd-Graber. A distorted skull lies in the bottom center: Identifying
paintings from text descriptions. In NAACL Human-Computer Question Answering Workshop, 2016.
John Hale. Uncertainty about the rest of the sentence. Cognitive Science, 30(4), 2006.
Michael Halliday and Ruqaiya Hasan. Cohesion in English. 1976.
Sanda Harabagiu, Andrew Hickl, John Lehmann, and Dan Moldovan. Experiments with interactive question-
answering. In Proceedings of the Association for Computational Linguistics, 2005.
126
Monica-Laura Haurilet, Makarand Tapaswi, Ziad Al-Halah, and Rainer Stiefelhagen. Naming TV characters
by watching and analyzing dialogs. In IEEE Winter Conference on Applications of Computer Vision,
2016.
Marti A Hearst. Texttiling: Segmenting text into multi-paragraph subtopic passages. Computational
linguistics, 23(1), 1997.
Karl Moritz Hermann, Edward Grefenstette, and Phil Blunsom. ”not not bad” is not ”bad”: A distributional
account of negation. Proceedings of the ACL Workshop on Continuous Vector Space Models and their
Compositionality, 2013.
Karl Moritz Hermann, Dipanjan Das, Jason Weston, and Kuzman Ganchev. Semantic frame identification
with distributed word representations. In Proceedings of the Association for Computational Linguistics,
2014.
Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving
neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
Anh Khoi Ngo Ho, Jean-Christophe Burie, and Jean-Marc Ogier. Panel and speech balloon extraction from
comic books. In IAPR International Workshop on Document Analysis Systems, 2012.
Jerry R Hobbs. Coherence and coreference. Cognitive science, 3(1), 1979.
Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term memory. Neural computation, 1997.
Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob
Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In
Conference of the North American Chapter of the Association for Computational Linguistics, 2016a.
Ting-Hao (Kenneth) Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob
Devlin, Ross B. Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh,
Lucy Vanderwende, Michel Galley, and Margaret Mitchell. Visual storytelling. In Conference of the
North American Chapter of the Association for Computational Linguistics, 2016b.
Aapo Hyva¨rinen and Erkki Oja. Independent component analysis: algorithms and applications. Neural
networks, 13(4), 2000.
Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daume´ III. A neural
network for factoid question answering over paragraphs. In Proceedings of Empirical Methods in Natural
Language Processing, 2014a.
Mohit Iyyer, Peter Enns, Jordan Boyd-Graber, and Philip Resnik. Political ideology detection using recursive
neural networks. In Proceedings of the Association for Computational Linguistics, 2014b.
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daume´ III. Deep unordered composition
rivals syntactic methods for text classification. In Proceedings of the Association for Computational
Linguistics, 2015.
Mohit Iyyer, Anupam Guha, Snigdha Chaturvedi, Jordan Boyd-Graber, and Hal Daume´ III. Feuding families
and former friends: Unsupervised learning for dynamic fictional relationships. In Conference of the North
American Chapter of the Association for Computational Linguistics, 2016.
Mohit Iyyer, Varun Manjunatha, Anupam Guha, Yogarshi Vyas, Jordan Boyd-Graber, Hal Daume´ III, and
Larry Davis. The amazing mysteries of the gutter: Drawing inferences between panels in comic book
narratives. In Computer Vision and Pattern Recognition, 2017a.
127
Mohit Iyyer, Wen tau Yih, and Ming-Wei Chang. Search-based neural structured learning for sequential
question answering. In Proceedings of the Association for Computational Linguistics, 2017b.
Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Reading text in the wild with
convolutional neural networks. International Journal of Computer Vision, 116(1), 2016.
Yangfeng Ji and Jacob Eisenstein. Representation learning for text-level discourse parsing. In Proceedings
of the Association for Computational Linguistics, 2014.
Matt L. Jockers. Macroanalysis: Digital Methods and Literary History. Topics in the Digital Humanities.
University of Illinois Press, 2013. ISBN 9780252094767. URL http://books.google.com/
books?id=mPOdxQgpOSUC.
Andrej Karpathy and Fei-Fei Li. Deep visual-semantic alignments for generating image descriptions. In
Computer Vision and Pattern Recognition, 2015.
Diane Kelly and Jimmy Lin. Overview of the trec 2006 ciQA task. In ACM SIGIR Forum, volume 41, 2007.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the
International Conference on Learning Representations, 2014.
Samu Kovanen and Kiyoharu Aizawa. A layered method for determining manga text bubble reading order.
In International Conference on Image Processing, 2015.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen,
Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. Visual genome:
Connecting language and vision using crowdsourced dense image annotations. In International Journal
of Computer Vision, 2016.
Vinodh Krishnan and Jacob Eisenstein. “You’re Mr. Lebowski, I’m The Dude”: Inducing address term
formality in signed social networks. In Conference of the North American Chapter of the Association for
Computational Linguistics, 2015.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional
neural networks. In Proceedings of Advances in Neural Information Processing Systems, 2012.
Angeliki Lazaridou, Elia Bruni, and Marco Baroni. Is this a wampimuk? Cross-modal mapping between
distributional semantics and the visual world. In Proceedings of the Association for Computational
Linguistics, 2014.
Quoc V Le and Tomas Mikolov. Distributed representations of sentences and documents. In Proceedings of
the International Conference of Machine Learning, 2014.
Wendy G Lehnert. Plot units and narrative summarization. Cognitive Science, 5(4), 1981.
Roger Levy. Expectation-based syntactic comprehension. Cognition, 106(3), 2008.
Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. A hierarchical neural autoencoder for paragraphs and
documents. In Proceedings of the Association for Computational Linguistics, 2015.
Luyuan Li, Yongtao Wang, Zhi Tang, and Liangcai Gao. Automatic comic page segmentation based on
polygon detection. Multimedia Tools and Applications, 69(1), 2014.
Chen Liang, Jonathan Berant, Quoc Le, Kenneth D. Forbus, and Ni Lao. Neural symbolic machines:
Learning semantic parsers on Freebase with weak supervision. In Annual Meeting of the Association for
Computational Linguistics (ACL), 2017.
128
Percy Liang. Learning executable semantic parsers for natural language understanding. Communications of
the ACM, 59(9), 2016.
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Ross B. Girshick, James Hays, Pietro
Perona, Deva Ramanan, Piotr Dolla´r, and C. Lawrence Zitnick. Microsoft COCO: Common objects in
context. In Proceedings of the European Conference on Computer Vision, 2014.
Reginald Long, Panupong Pasupat, and Percy Liang. Simpler context-dependent logical forms via model
projections. In Proceedings of the Association for Computational Linguistics, 2016.
Moshe Looks, Marcello Herreshoff, DeLesley Hutchins, and Peter Norvig. Deep learning with dynamic
computation graphs. In Proceedings of the International Conference on Learning Representations, 2017.
Jiasen Lu, Jianwei Yang, Dhruv Batra, and Devi Parikh. Hierarchical question-image co-attention for visual
question answering, 2016.
Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to
answering questions about images. In Computer Vision and Pattern Recognition, 2015.
William C Mann and Sandra A Thompson. Rhetorical structure theory: Toward a functional theory of text
organization. Text-Interdisciplinary Journal for the Study of Discourse, 8, 1988.
Philip Massey, Patrick Xia, David Bamman, and Noah A Smith. Annotating character relationships in
literary texts. arXiv:1512.00728, 2015.
Yusuke Matsui. Challenge for manga processing: Sketch-based manga retrieval. In Proceedings of the 23rd
Annual ACM Conference on Multimedia, 2015.
Yusuke Matsui, Kota Ito, Yuji Aramaki, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga
retrieval using manga109 dataset. arXiv preprint arXiv:1510.04389, 2015.
Jon D Mcauliffe and David M Blei. Supervised topic models. In Proceedings of Advances in Neural
Information Processing Systems, 2008.
Scott McCloud. Understanding Comics. HarperCollins, 1994.
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781, 2013.
Eleni Miltsakaki, Rashmi Prasad, Aravind K Joshi, and Bonnie L Webber. The penn discourse treebank. In
International Language Resources and Evaluation, 2004.
Guy M Morton. A computer oriented geodetic data base and a new technique in file sequencing. International
Business Machines Co, 1966.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende,
Pushmeet Kohli, and James Allen. A corpus and cloze evaluation for deeper understanding of com-
monsense stories. In Conference of the North American Chapter of the Association for Computational
Linguistics, 2016.
Arvind Neelakantan, Quoc Le, Martin Abadi, Andrew McCallum, and Dario Amodei. Learning a natural
language interface with neural programmer. In Proceedings of the International Conference on Learning
Representations, 2017.
129
Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos,
Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia
Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya,
Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng
Yin. DyNet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980, 2017.
Viet-An Nguyen, Jordan Boyd-Graber, Philip Resnik, Deborah Cai, Jennifer Midberry, and Yuanxin Wang.
Modeling topic control to detect influence in conversations using nonparametric topic models. Machine
Learning, 95, 2014.
Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed
by v1? Vision research, 37(23), 1997.
J Walker Orr, Prasad Tadepalli, Janardhan Rao Doppa, Xiaoli Fern, and Thomas G Dietterich. Learning
scripts as hidden markov models. In Association for the Advancement of Artificial Intelligence, 2014.
Mark Palatucci, Dean Pomerleau, Geoffrey E. Hinton, and Tom M. Mitchell. Zero-shot learning with
semantic output codes. In Proceedings of Advances in Neural Information Processing Systems, 2009.
Xufang Pang, Ying Cao, Rynson W. H. Lau, and Antoni B. Chan. A robust panel extraction method for
manga. In Proceedings of the ACM International Conference on Multimedia, 2014a.
Xufang Pang, Ying Cao, Rynson WH Lau, and Antoni B Chan. A robust panel extraction method for manga.
In Proceedings of the ACM International Conference on Multimedia, 2014b.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks.
In Proceedings of the International Conference of Machine Learning, 2013.
Rebecca J Passonneau and Diane J Litman. Discourse segmentation by human and automated means.
Computational Linguistics, 23(1), 1997.
P. Pasupat and P. Liang. Zero-shot entity extraction from web pages. In Proceedings of the Association for
Computational Linguistics, 2014.
Panupong Pasupat and Percy Liang. Compositional semantic parsing on semi-structured tables. In Proceed-
ings of the Association for Computational Linguistics, 2015.
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation.
In Proceedings of Empirical Methods in Natural Language Processing, 2014.
Jordan B Pollack. Recursive distributed representations. Artificial Intelligence, 1990.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In Proceedings of Advances in Neural Information Processing Systems,
2015.
Christophe Rigaud. Segmentation and indexation of complex objects in comic book images. PhD thesis,
University of La Rochelle, France, 2014.
Christophe Rigaud, Jean-Christophe Burie, Jean-Marc Ogier, Dimosthenis Karatzas, and Joost Van de
Weijer. An active contour model for speech balloon detection in comics. In International Conference on
Document Analysis and Recognition, 2013.
Christophe Rigaud, Cle´ment Gue´rin, Dimosthenis Karatzas, Jean-Christophe Burie, and Jean-Marc Ogier.
Knowledge-driven understanding of images in comic books. International Journal on Document Analysis
and Recognition, 18(3), 2015.
130
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation.
Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, 1986.
Gerard Salton and Michael J McGill. Introduction to modern information retrieval. 1986.
Roger Schank and Robert Abelson. Scripts, Plans, Goals and Understanding: An Inquiry into Human
Knowledge Structures. L. Erlbaum, 1977.
Ray Smith. An overview of the tesseract ocr engine. In International Conference on Document Analysis and
Recognition, 2007.
Richard Socher, Eric H. Huang, Jeffrey Pennington, Andrew Y. Ng, and Christopher D. Manning. Dynamic
Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In Proceedings of Advances in
Neural Information Processing Systems, 2011a.
Richard Socher, Jeffrey Pennington, Eric H. Huang, Andrew Y. Ng, and Christopher D. Manning. Semi-
Supervised Recursive Autoencoders for Predicting Sentiment Distributions. In Proceedings of Empirical
Methods in Natural Language Processing, 2011b.
Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and
Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In
Proceedings of Empirical Methods in Natural Language Processing, 2013.
Richard Socher, Quoc V Le, Christopher D Manning, and Andrew Y Ng. Grounded compositional semantics
for finding and describing images with sentences. Transactions of the Association for Computational
Linguistics, 2014.
Radu Soricut and Daniel Marcu. Sentence level discourse parsing using syntactic and lexical information. In
Conference of the North American Chapter of the Association for Computational Linguistics, 2003.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A
simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1),
2014.
Shashank Srivastava, Snigdha Chaturvedi, and Tom Mitchell. Inferring interpersonal relations in narrative
summaries. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, 2016.
Edmund M. Talley, David Newman, David Mimno, Bruce W. Herr, Hanna M. Wallach, Gully A. P. C.
Burns, A. G. Miriam Leenders, and Andrew McCallum. Database of NIH grants using machine-learned
categories and graphical clustering. Nature Methods, 8(6), 2011.
Takamasa Tanaka, Kenji Shoji, Fubito Toyama, and Juichi Miyamichi. Layout analysis of tree-structured
scene frames in comic images. In International Joint Conference on Artificial Intelligence, 2007.
Duyu Tang, Bing Qin, and Ting Liu. Document modeling with gated recurrent neural network for sentiment
classification. In Proceedings of Empirical Methods in Natural Language Processing, 2015.
Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. In Proceedings of Advances
in Neural Information Processing Systems, 2004.
Wilson L Taylor. Cloze procedure: a new tool for measuring readability. Journalism and Mass Communica-
tion Quarterly, 30(4), 1953.
Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods for
structured and interdependent output variables. Journal of machine learning research, 6(Sep):1453–1484,
2005.
131
Nicolas Usunier, David Buffoni, and Patrick Gallinari. Ranking with ordered weighted pairwise classification.
In Proceedings of the International Conference of Machine Learning, 2009.
Hardik Vala, David Jurgens, Andrew Piper, and Derek Ruths. Mr. bennet, his coachman, and the archbishop
walk into a bar but only one of them gets recognized: On the difficulty of detecting characters in literary
texts. In Proceedings of Empirical Methods in Natural Language Processing, 2015.
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of Machine Learning
Research, 2008.
Karen van Hoek. Anaphora and Conceptual Structure. Cognitive Theory of Language and Culture Series.
University of Chicago Press, 1997.
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption
generator. In Computer Vision and Pattern Recognition, 2015.
Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78
(10), 1990.
Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation.
In International Joint Conference on Artificial Intelligence, 2011.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.
Machine learning, 8(3-4):229–256, 1992.
Caiming Xiong, Stephen Merity, and Richard Socher. Dynamic memory networks for visual and textual
question answering. In Proceedings of the International Conference of Machine Learning, 2016.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel,
and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In
Proceedings of the International Conference of Machine Learning, 2015.
Wen-tau Yih, Ming-Wei Chang, Xiaodong He, and Jianfeng Gao. Semantic parsing via staged query
graph generation: Question answering with knowledge base. In Annual Meeting of the Association for
Computational Linguistics (ACL), pages 1321–1331, July 2015.
Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. The value of semantic parse
labeling for knowledge base question answering. In Annual Meeting of the Association for Computational
Linguistics (ACL), pages 201–206, Berlin, Germany, August 2016.
Ilinca Zarifopol-Johnston. To kill a text: the dialogic fiction of Hugo, Dickens, and Zola. University of
Delaware Press, 1995.
Luke S Zettlemoyer and Michael Collins. Learning context-dependent mappings from sentences to logical
form. In Proceedings of the Association for Computational Linguistics, 2009.
Bolei Zhou, Yuandong Tian, Sainbayar Sukhbaatar, Arthur Szlam, and Rob Fergus. Simple baseline for
visual question answering. arXiv preprint arXiv:1512.02167, 2015.
C. Lawrence Zitnick, Ramakrishna Vedantam, and Devi Parikh. Adopting abstract images for semantic
scene understanding. IEEE Trans. Pattern Anal. Mach. Intell., 38(4):627–638, 2016.
132