ABSTRACT
Title of dissertation: TEACHING MACHINES TO ASK
USEFUL CLARIFICATION QUESTIONS
Sudha Rao
Doctor of Philosophy, 2018
Dissertation directed by: Professor Hal Daumé III
Computer Science, University of Maryland
Inquiry is fundamental to communication, and machines cannot effectively
collaborate with humans unless they can ask questions. Asking questions is also a
natural way for machines to express uncertainty, a task of increasing importance in
an automated society. In the field of natural language processing, despite decades of
work on question answering, there is relatively little work in question asking. More-
over, most of the previous work has focused on generating reading comprehension
style questions which are answerable from the provided text. The goal of my disser-
tation work, on the other hand, is to understand how can we teach machines to ask
clarification questions that point at the missing information in a text. Primarily, we
focus on two scenarios where we find such question asking to be useful: (1) clarifi-
cation questions on posts found in community-driven technical support forums such
as StackExchange (2) clarification questions on descriptions of products in e-retail
platforms such as Amazon.
In this dissertation we claim that, given large amounts of previously asked
questions in various contexts (within a particular scenario), we can build machine
learning models that can ask useful questions in a new unseen context (within the
same scenario). In order to validate this hypothesis, we firstly create two large
datasets of context paired with clarification question (and answer) for the two scenar-
ios of technical support and e-retail by automatically extracting these information
from available datadumps of StackExchange and Amazon. Given these datasets, in
our first line of research, we build a machine learning model that first extracts a set
of candidate clarification questions and then ranks them such that a more useful
question would be higher up in the ranking. Our model is inspired by the idea of
expected value of perfect information: a good question is one whose expected answer
will be useful. We hypothesize that by explicitly modeling the value added by an
answer to a given context, our model can learn to identify more useful questions. We
evaluate our model against expert human judgments on the StackExchange dataset
and demonstrate significant improvements over controlled baselines.
In our second line of research, we build a machine learning model that learns to
generate a new clarification question from scratch, instead of ranking previously seen
questions. We hypothesize that we can train our model to generate good clarification
questions by incorporating the usefulness of an answer to the clarification question
into the recent sequence-to-sequence based neural network approaches. We develop
a Generative Adversarial Network (GAN) where the generator is a sequence-to-
sequence model and the discriminator is a utility function that models the value of
updating the context with the answer to the clarification question. We evaluate our
model on our two datasets of StackExchange and Amazon, using both automatic
metrics and human judgments of usefulness, specificity and relevance, showing that
our approach outperforms both a retrieval-based model and ablations that exclude
the utility model and the adversarial training.
We observe that our question generation model generates questions that range
a wide spectrum of specificity to the given context. We argue that generating
questions at a desired level of specificity (to a given context) can be useful in many
scenarios. In our last line of research we, therefore, build a question generation
model which given a context and a level of specificity (generic or specific), generates
a question at that level of specificity. We hypothesize that by providing the level of
specificity of the question to our model during training time, it can learn patterns in
the question that indicate the level of specificity and use those to generate questions
at a desired level of specificity. To automatically label the large number of questions
in our training data with the level of specificity, we train a binary classifier which
given a context and a question, predicts whether the question is specific (to the
context) or generic. We demonstrate the effectiveness of our specificity-controlled
question generation model by evaluating it on the Amazon dataset using human
judgements.
TEACHING MACHINES TO ASK
USEFUL CLARIFICATIONS QUESTIONS
by
Sudha Rao
Dissertation submitted to the Faculty of the Graduate School of the
University of Maryland, College Park in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
2018
Advisory Committee:
Professor Hal Daumé III, Chair/Advisor
Professor Philip Resnik
Professor Marine Carpuat
Professor Jordan Boyd-Graber
Professor David Jacobs
Professor Lucy Vanderwende
©c Copyright by
Sudha Rao
2018

Dedication
To my loving amma (mom), appa (dad) and akka (sister).
ii
Acknowledgments
I have been very fortunate to have got the opportunity to pursue my PhD
at University of Maryland. I really enjoyed my time at UMD and I would like
to thank a number of people at UMD without whom my work would not have
been possible. First of, I would like to thank my wonderful advisor Hal Daumé III
who guided me throughout my PhD and provided me with valuable advice. Our
lengthy discussions helped me think through my ideas and motivated me to explore
interesting new avenues of research. He taught me how to write papers including
how to motivate the work and be thorough in explaining the technical details. I
also learnt from him how to design experiments and ask the right research questions
when evaluating proposed models. I am also thankful to Hal for all his career
advice and for encouraging me to always aim for the best. I am also very grateful to
Philip Resnik who I worked very closely with during the first two years of my PhD.
His constant emphasis on thinking about the big picture helped me put my work
in perspective. His course on Computational Linguistics II was my most favorite
course at UMD. His enthusiasm for the material was very contagious and helped
foster my interest in NLP. I am also thankful to him for guiding me through my
career choices and helping me build some useful connections in the NLP community.
I am also very thankful to all my other committee members: Jordan Boyd-
Graber, Marine Carpuat, David Jacobs and Lucy Vanderwende. Their valuable
feedback on my thesis helped me improve my draft. A special thanks to Lucy Van-
derwende for agreeing to be an external member on my committee. Her expertise in
iii
topics closely related to my work helped me broaden my research ideas. I was very
fortunate to be a part of the amazing Computational Linguistics and Information
Processing (CLIP) lab at UMD. The weekly CLIP talks, reading groups, paper clin-
ics and the opportunity to have one-on-one meetings with the speakers of the CLIP
talks, have been extremely crucial to my growth as a researcher. A big thank you
to all the faculty members of the CLIP lab for taking the initiative and organizing
all these for the benefit of the students in the CLIP lab. I would also like to thank
all the other CLIP lab faculty members: Naomi Feldman, Doug Oard, Jimmy Lin,
Vanessa Frias-Martinez and Louiqa Raschid. I am also thankful to the Language
Science Community for providing me with the platform to discuss my research ideas
in an interdisciplinary atmosphere. Presenting my work at the Language Science
Lunch Talks helped me hone the skills of describing my work to a non-NLP audience.
I would also like to thank LSC for organizing the yearly Language Science Day and
Winter Storm where I enjoyed meeting all the other language enthusiasts at UMD.
Finally, I would also like to thank Jennifer Story, Joe Webster, Janice Perrone and
Tom Hurst at UMD for making all the administrative tasks easy.
I would also like to take this opportunity to thank all my internship mentors.
Thank you Daniel Marcu and Kevin Knight for providing me with my first research
internship opportunity at ISI. Through this internship, I got my exposure to the new
area of semantic parsing. I would also like to thank my Microsoft Research internship
mentor Paul Mineiro. During this internship, I got my first hands-on experience with
deep learning technology that later on turned out to be the dominant approach in my
thesis. I would also like to thank my Grammarly internship mentor Joel Tetreault
iv
for teaching me how to set concrete goals to ensure a successful research project in
a short duration of an internship. Finally, I would like to give my immense thanks
to Rajiv Gandhi who was the most important mentor in my undergraduate life. I
would not have thought about pursuing a PhD if not for his encouragement and
guidance.
My time in the CLIP lab would not have been enjoyable if not for the amazing
students at the CLIP lab. Firstly, I would like to thank Allyson Ettinger, Yo-
garshi Vyas, Xing Niu and Trista Cao who have not only been great lab mates
but also amazing collaborators. Special thanks to my senior lab members Snigdha
Chaturvedi, He He and Mohit Iyyer who always inspired me to aim for more by
being incredible researchers and by providing me with timely support and guidance.
I also thank Hal’s other students Amr Sharaf, Kiante Brantley, Khanh Nguyen and
Shi Feng for providing me timely feedback on my paper drafts and presentations. I
also thank all the other students in the CLIP lab who I have enjoyed having several
interactions: Junhui Li, Yuening Hu, Ke Wu, Hadi Amiri, Amittai Axelrod, Ahmed
Elgohary, Rashmi Sankepally, Pedro Rodriguez, Denis Peskov, Yulu Wang, Suraj
Nair, Marianna Martindale, Hua He, Fenfei Gao, Viet-An Nguyen, Ke Zhai, Thang
Nguyen, Ning Gao, Mossaab Bagdouri, Jinfeng Rao, Anupam Guha, Weiwei Yang,
Joe Barrow, Petra Galuscakova, Jiaul Paik, Alvin Grissom II, John Morgan, Sweta
Agarwal and Pranav Goel.
Last but not the least, I would like to thank my close friends and family.
Thank you Meethu Malu, Manaswi Saha, Pallabi Ghosh, Nidhi Shah, Jay Ghurye
and Mary DePascale for being amazing roommates. Because of you all I had a
v
home away from my home. I would also like to thank all my other friends at
UMD: Amit Chavan, Kartik Nayak, Bhaskar Ramasubramanian, Soham De, Man-
ish Purohit, Anshul Sawant, Zamira Daw, Ramakrishna Padmanabhan and Ladan
Najafizadeh. My PhD life would not have been fun without you guys. I thank my
parents immensely for believing in me and encouraging me throughout my PhD.
They supported me wholeheartedly even though this meant me staying away from
them for years together. I would also like to thank my dear sister and brother-in-law
for supporting me and my dearest little nephew and niece for helping me keep my
calm through their loaded cuteness. I also thank my in-laws for their support and
love throughout my PhD. Finally, I would like to thank my loving husband Amit
Rao who supported me and continues to support me accomplish my dreams and
goals. Thank you all.
vi
Table of Contents
Dedication ii
Acknowledgements iii
List of Tables x
List of Figures xii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Specific Scenarios Considered in this Dissertation . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.1 Dataset Creation . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4.2 Question Ranking Model . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Question Generation Model . . . . . . . . . . . . . . . . . . . 11
1.4.4 Specificity-Controlled Question Generation Model . . . . . . . 14
1.4.5 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Background 17
2.1 Definition and Importance of Clarification Questions . . . . . . . . . 17
2.2 Different Types of Question Generation Work . . . . . . . . . . . . . 19
2.2.1 Reading Comprehension Question Generation . . . . . . . . . 19
2.2.2 Question Generation in Dialogue . . . . . . . . . . . . . . . . 20
2.2.3 Question Generation for Text Understanding . . . . . . . . . . 22
2.2.4 Visual Question Generation . . . . . . . . . . . . . . . . . . . 24
2.2.5 Question Refinement to Help Question-Answering . . . . . . . 25
2.3 Approaches to Question Generation . . . . . . . . . . . . . . . . . . . 26
2.3.1 Syntactic Rule based Methods . . . . . . . . . . . . . . . . . . 26
2.3.2 Neural Network based Methods . . . . . . . . . . . . . . . . . 28
2.4 Relevant Neural Network Models . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Feedforward neural network . . . . . . . . . . . . . . . . . . . 30
2.4.2 Recurrent neural network . . . . . . . . . . . . . . . . . . . . 32
vii
2.4.3 Sequence-to-sequence neural network . . . . . . . . . . . . . . 33
2.5 Generating Text with Stylistic Variations . . . . . . . . . . . . . . . . 35
3 Dataset Creation 38
3.1 StackExchange Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Analysis of StackExchange Dataset . . . . . . . . . . . . . . . . . . . 42
3.3 Amazon Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 Question Ranking Model 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Model description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Question & answer candidate generator . . . . . . . . . . . . . 52
4.2.2 Answer modeling . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.2.3 Utility calculator . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.4 Our joint neural network model . . . . . . . . . . . . . . . . . 57
4.3 Evaluation design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3.1 Annotation scheme . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Annotation analysis . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Baseline methods . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.2 Implementation details . . . . . . . . . . . . . . . . . . . . . . 66
4.4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4.3.1 Evaluating against expert annotations . . . . . . . . 68
4.4.3.2 Evaluating against the original question . . . . . . . 69
4.4.3.3 Excluding the original question . . . . . . . . . . . . 69
4.5 Example outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5 Question Generation Model 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Training a Clarification Question Generator . . . . . . . . . . . . . . 76
5.2.1 Sequence-to-sequence Model for Question Generation . . . . . 79
5.2.2 Training the Generator to Optimize Utility . . . . . . . . . 80
5.2.3 Estimating a Utility Function from Historical Data . . . . . 82
5.2.4 Utility GAN for Clarification Question Generation . . . . . 82
5.2.5 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3.1 Baselines and Ablated Models . . . . . . . . . . . . . . . . . . 86
5.3.2 Experimental Details . . . . . . . . . . . . . . . . . . . . . . . 87
5.3.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.3.1 Automatic Metrics . . . . . . . . . . . . . . . . . . . 89
5.3.3.2 Human Judgements . . . . . . . . . . . . . . . . . . 89
5.3.4 Automatic Metric Results . . . . . . . . . . . . . . . . . . . . 92
5.3.5 Human Judgements Analysis . . . . . . . . . . . . . . . . . . . 93
5.3.6 Analysis of System Outputs on Amazon Dataset . . . . . . . 95
viii
5.3.7 Analysis of System Outputs on Stack Exchange Dataset . . . 98
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6 Specificity-Controlled Question Generation Model 103
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3 Annotating Questions with Specificity Level . . . . . . . . . . . . . . 109
6.3.1 Annotation Design . . . . . . . . . . . . . . . . . . . . . . . . 109
6.3.2 Getting Specificity Levels from Annotations . . . . . . . . . . 110
6.4 Model for Automatically Predicting Specificity Level . . . . . . . . . 112
6.5 Specificity-Controlled Question Generation Model . . . . . . . . . . . 116
6.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.6.1 Specificity Classifier Results . . . . . . . . . . . . . . . . . . . 118
6.6.2 Question Generation Results . . . . . . . . . . . . . . . . . . . 120
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7 Conclusion 124
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
7.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.2.1 Using Multi-modal Context . . . . . . . . . . . . . . . . . . . 125
7.2.2 Using External Knowledge Sources . . . . . . . . . . . . . . . 126
7.2.3 Interactive Search Queries . . . . . . . . . . . . . . . . . . . . 128
7.2.4 Question Asking in Writing Assistance . . . . . . . . . . . . . 128
7.2.5 Towards Intelligent Dialogue Agents . . . . . . . . . . . . . . 129
7.2.6 Question Asking to Help Build Reasoning . . . . . . . . . . . 130
7.2.7 Generalization Beyond Large Datasets . . . . . . . . . . . . . 131
A Crowdsourcing Annotation Details 133
A.1 Question Ranking Task Evaluation . . . . . . . . . . . . . . . . . . . 133
A.2 Question Generation Task Evaluation . . . . . . . . . . . . . . . . . . 136
A.3 Specificity Labeling Task . . . . . . . . . . . . . . . . . . . . . . . . . 136
Bibliography 148
ix
List of Tables
3.1 Table above shows the sizes of the train, tune and test split of our
dataset for three domains. . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Likelihood of a post getting answered with and without a clarification
question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Model performances on 500 samples when evaluated against the union
of the “best” annotations (B1 ∪ B2), intersection of the “valid” an-
notations (V 1 ∩ V 2) and the original question paired with the post
in the dataset. The difference between the bold and the non-bold
numbers is statistically significant with p < 0.05 as calculated using
bootstrap test. p@k is the precision of the k questions ranked highest
by the model and MAP is the mean average precision of the ranking
predicted by the model. . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Model performances on 500 samples when evaluated against the union
of the “best” annotations (B1 ∪ B2) and intersection of the “valid”
annotations (V 1 ∩ V 2), with the original question excluded. The
difference between all numbers except the random and bag-of-ngrams
are statistically insignificant. . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Example of human annotation from the askubuntu domain of our
dataset. The questions are sorted by expected utility, given in the
first column. The “best” annotation is marked with black ticks and
the “valid”’ annotations are marked with grey ticks . . . . . . . . . 73
4.4 Examples of human annotation from the unix and superuser domain
of our dataset. The questions are sorted by expected utility, given in
the first column. The “best” annotation is marked with black ticks
and the “valid”’ annotations are marked with grey ticks . . . . . 74
5.1 Sample product description from amazon.com paired with a clarifi-
cation question and answer. . . . . . . . . . . . . . . . . . . . . . . . 77
5.2 Sample post from stackexchange.com paired with a clarification ques-
tion and answer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Inter-annotator agreement on the five criteria used in human-based
evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
x
5.4 Diversity as measured by the proportion of unique trigrams in
model outputs. Bleu and Meteor scores using up to 10 references
for the Amazon dataset and up to six references for the StackEx-
change dataset. Numbers in bold are the highest among the models.
All results for Amazon are on the entire test set whereas for StackEx-
change they are on the 500 instances of the test set that have multiple
references. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 Results of human judgments on model generated questions on 300
sample Home & Kitchen product descriptions. The options described
in § 5.3.3 are converted to corresponding numeric range (see supple-
mentary material). The difference between the bold and the non-
bold numbers is statistically significant with p <0.05. Reference is
excluded in the significance calculation. . . . . . . . . . . . . . . . . . 94
5.6 Example outputs from each of the systems for two product descrip-
tions along with the usefulness and the specificity score given by hu-
man annotators. Descriptions of scores are in the supplementary
material. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 Example outputs from each of the systems for three product descrip-
tions from the Home & Kitchen category of the Amazon dataset. . . 101
5.8 Example outputs from each of the systems for three posts of the Stack
Exchange dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.1 Average specificity classifier accuracy under 10 fold cross validation
on train set and test set using different feature sets. * denotes new
features not present in the model by Louis and Nenkova (2011). . . . 118
6.2 Diversity as measured by the proportion of unique trigrams in
model outputs. Bleu and Meteor scores are calculated using an
average of 6 references under generic setting and using an average
of 3 references under specific setting. The highest numbers within a
column is in bold (except for diversity under generic setting where
the lowest number is bold). . . . . . . . . . . . . . . . . . . . . . . . 119
6.3 Example outputs from each of the systems for a single product de-
scription. g indicates generic token whereas s indicates specific token. 123
xi
List of Figures
1.1 Sample post paired with a clarification question on StackExchange,
an online question-answering forum. . . . . . . . . . . . . . . . . . . . 5
1.2 Sample product description paired with a clarification question on
Amazon, an online shopping platform. . . . . . . . . . . . . . . . . . 6
2.1 An example of a reading comprehension passage and a question whose
answer can be found in the given passage from Heilman (2011). . . . 19
2.2 An example interaction between a user and a system in the context
of travel booking where they system asks questions to fill a set of
predefined slots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 An example of an argument followed by questions that encourage
further explanation of the argument from Liu et al. (2010) . . . . . . 23
2.4 Example of a system asking series of questions to simplify the original
user query from Artzi and Zettlemoyer (2011) . . . . . . . . . . . . . 23
2.5 Example from the Visual Question Generation task (Mostafazadeh
et al., 2016) and the Image Grounded Conversation task (Mostafazadeh
et al., 2017). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 A fully connected feedforward neural network with an input layer,
two hidden layers (with four hidden units each) and an output layer. 32
2.7 Recurrent neural network operating over the input sequence x one
word at a time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.8 Sequence-to-sequence learning model which takes in an input se-
quence and generates an output sequence one word at a time. . . . . 35
3.1 Example of a post on askubuntu.com where a user asked a clarifi-
cation question in the comments section of the post following which
the author of the post edited the post adding the missing information
pointed out by the clarification question. . . . . . . . . . . . . . . . . 39
3.2 Example of a post on askubuntu.com where a user asked a clarifica-
tion question in the comments section of the post and the author of
the post answered the question as a subsequent comment. . . . . . . . 41
3.3 Example of a product description on amazon.com followed by a clar-
ification question and an answer to the question. . . . . . . . . . . . 46
xii
4.1 A post on an online Q & A forum “askubuntu.com” is updated to fill
the missing information pointed out by the question comment. . . . . 48
4.2 We formulate our ranking problem as, given a post, first extract a
set of ten candidate questions and then rank them such that a more
useful question would be higher up in the ranking . . . . . . . . . . . 49
4.3 The behavior of our model during test time: Given a post p, we
retrieve 10 posts similar to post p using Lucene. The questions asked
to those 10 posts are our question candidates Q and the edits made
to the posts in response to the questions (or the author’s response to
the question in the comments section) are our answer candidates A.
For each question candidate qi, we generate an answer representation
F (p, qi) and calculate how close is the answer candidate aj to our
answer representation F (p, qi). We then calculate the utility of the
post p if it were updated with the answer aj. Finally, we rank the
candidate questionsQ by their expected utility given the post p (Eq 4.1). 53
4.4 Training of our answer generator. Given a post pi and its question
qi, we generate an answer representation that is not only close to its
original answer ai, but also close to one of its candidate answers aj if
the candidate question qj is close to the original question qi. . . . . . 56
4.5 Left: Fans computed using a feedfoward neural network over post
LSTM p̄ and question LSTM q̄ representations and â computed using
average word embeddings over words in the answer. Right: Futil
computed using a feedforward neural network over post LSTM p̄,
question LSTM q̄ and answer LSTM ā representations. . . . . . . . . 59
4.6 Our LSTM architecture on a post pi. The input layer consists of pre-
trained word embeddings of the words in the post which is fed into
a single hidden layer. The output ok of each of the hidden states is
averaged together to get our neural representation p̄i . . . . . . . . . 60
4.7 Distribution of the count of questions in the intersection of the “valid”
annotations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1 Overview of our GAN-based clarification question generation model. 78
5.2 Results of human judgements on the specificity criteria. . . . . . . . . 96
5.3 Results of human judgements on the usefulness criteria. . . . . . . . . 97
6.1 Sample product description from amazon.com paired with a generic
and a specific clarification question. . . . . . . . . . . . . . . . . . . . 104
6.2 Specificity-controlled question generation model. . . . . . . . . . . . . 106
7.1 An example of product description on amazon.com paired with the
image of the product. . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.2 Example of a question generation model that uses a knowledge base
containing attributes of an operating system (or attributes of a toaster)
to ask a relevant clarification question. . . . . . . . . . . . . . . . . . 127
xiii
7.3 An example of a writing assistance tool which given a content, iden-
tifies the missing information and asks a question about it. . . . . . 128
7.4 An example conversation with a robot where the robot asks questions
to resolve its uncertainty. . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.5 An example scenario where a robot is reading a passage and asking
questions to a human to build an understanding of the world. . . . . 131
A.1 Example of the interface shown to annotators on UpWork for anno-
tating “best” and “valid” questions, given a post. . . . . . . . . . . . 139
A.2 Task overview shown to annotators on Figure-Eight for the task of
evaluating model generated questions. . . . . . . . . . . . . . . . . . . 140
A.3 Instructions shown to annotators on Figure-Eight for the task of eval-
uating model generated questions. . . . . . . . . . . . . . . . . . . . . 141
A.4 Rules and tips shown to annotators on Figure-Eight for the task of
evaluating model generated questions. . . . . . . . . . . . . . . . . . . 142
A.5 Example annotations shown to annotators on Figure-Eight for the
task of evaluating model generated questions. . . . . . . . . . . . . . 143
A.6 Interface shown to the annotators on Figure-Eight for the task of
evaluating model generated questions. . . . . . . . . . . . . . . . . . . 144
A.7 Instructions shown to the annotators for the task of comparing the
specificity of two questions asked about a product on amazon.com . . 145
A.8 Rules and Tips shown to the annotators for the task of comparing
the specificity of two questions asked about a product on amazon.com .145
A.9 Example shown to the annotators for the task of comparing the speci-
ficity of two questions asked about a product on amazon.com . . . . . 146
A.10 Interface shown to the annotators for the task of comparing the speci-
ficity of two questions asked about a product on amazon.com . . . . . 147
xiv
Chapter 1: Introduction
1.1 Motivation
An overarching goal of the natural language processing community is to de-
velop techniques that would enable machines to process naturally occurring text as
accurately as humans do. However, as humans, we may not always understand each
other. In pragmatics, Grice’s theory of conversational implicatures (Grice, 1975)
says that there is a difference between what someone says and what someone ‘impli-
cates’ by uttering a sentence. What someone says is determined by the conventional
meaning of the sentence uttered and contextual processes of disambiguation; what
she implicates is associated with the existence of some rational principles and max-
ims governing conversation. The Gricean maxims of conversation suggest speakers
and listeners adhere to a Cooperative Principle where a speaker communicates in-
formation that is as informative as required and not more. The speaker assumes
a certain common ground or mutual information or shared knowledge with the lis-
tener (Clark, 1981; Clark et al., 1991; Clark and Carlson, 1982). In case of gaps
or mismatches in knowledge, the listener resorts to asking questions. Correction
of such knowledge deficits has been identified as one of the key purposes of asking
questions (Graesser et al., 2008).
1
With the advancements of artificial intelligence technologies, automated agents
such as text-based or voice-based search engines, interactive robots, automated car
navigation systems, etc are becoming increasingly common in our day-to-day lives.
We frequently use these bots to search for information or accomplish certain tasks.
However, often when a human user’s input to a bot is underspecified i.e. it is missing
some information, the bot might fail in its task. One key reasons for such failures is
the lack of common understanding between the human user and the bot. The human
user has a certain understanding of their problem/request and often times she fails
to convey the same understanding to the bot. In such a scenario, the bot can be
much more useful if it could try to establish this common understanding by asking
relevant questions. For example, if we search in a search engine “How long does it
take to get a PhD”, then the search engine could in turn ask “In which field?” since
the duration of the program would differ according to the field of study. Or if we
instruct a robot “Please bring me my coffee mug from the kitchen” and if there are
multiple mugs in the kitchen, the robot could in turn ask “What color is your coffee
mug?” in order to distinguish our mug from the other mugs in the kitchen. If we
wish to make such human-bot interactions as efficient as human-human interactions
are, it is important that we teach machines to ask clarification questions when faced
with uncertainty or knowledge gaps. We define “clarification question” as a question
that asks for some missing information in a given context.
In the field of natural language processing, however, despite decades of work
on question answering, there has been little work in question asking. Moreover
most of the previous work on generating questions has been on generating reading
2
comprehension style questions: given a text, write a question that one might find
on a standardized test with the goal of assessing someone’s understanding of the
text. Comprehension questions, by definition, are answerable from the provided
text. Clarification questions, on the other hand, ask about information that is
missing from the given text and hence is not answerable from the text. The goal of
my thesis work is to explore how can a machine automatically generate clarification
questions when faced with uncertainty or knowledge gaps. More concretely, we
define our goal as given a context, we want to automatically generate a question
whose answer can fill in the information missing from the given context.
1.2 Specific Scenarios Considered in this Dissertation
Text generation has been studied extensively in the field of natural language
processing. Tasks such as machine translation, summarization, dialogue generation
have achieved varied degrees of success in this field. Most of the successful models
have been machine learning models where the models learn from vast amounts of
data. For instance, in the task of machine translation, models learn to translate
from say French to English by having access to large number of French-English
sentence pairs where the French sentence has been translated into English or vice-
versa. The French sentence in this case can be considered as the input whereas
the English sentence can be considered as the label or the output. A supervised
machine learning model will then learn to predict the label (or the output) given
the input. Motivated by these successes, in this thesis, we take a supervised learning
3
approach for our task of generating useful clarification questions given a context.
Supervised learning approaches need access to large amounts of labelled data or
large amounts of input-output pairs (in our setting). We, therefore, approach the
problem of clarification question generation under two specific scenarios where we
have access to abundant such input-output data online.
Our first scenario is the generation of clarification questions during trou-
bleshooting of technical issues. About a decade ago, if you faced a technical problem
the only way to solve it would be to go to an expert. Due to the recent surge in the
use of internet, most of the problem solving these days happen online on question
answering (Q&A) forums where users post their problems and others provide assis-
tance by replying to the posts. However, Asaduzzaman et al. (2013) observed that
on StackExchange, which is one such community-driven problem solving platforms,
posts often go unanswered for a long time because they are not clear enough i.e.
they are missing some information. Consequently, other users ask clarification ques-
tions to those posts so that they can better offer assistance to the original poster.
For instance, in Figure 1.1, a user posts an issue she is facing while installing a
certain software on Ubuntu operating system. Another user on the forum asks for
the version of Ubuntu in the comment section of the post suggesting that the ver-
sion information could be useful in debugging the issue and hence should have been
included in the initial post. In this dissertation, we train a machine learning model
that learns to automatically generate a useful clarification question given an under-
specified post. We imagine a use case in which while a user is writing their post, a
system generates a single (or a shortlist of) question(s) asking for information that
4
Figure 1.1: Sample post paired with a clarification question on StackExchange, an
online question-answering forum.
it thinks other users on the forum might need to provide a solution, thus enabling
the original poster to immediately clarify their post, potentially leading to a much
quicker resolution.
Our second scenario is the generation of clarification questions during online
shopping on e-retail platforms such as amazon.com. With the emergence of internet,
people frequently resort to online shopping for buying different products. Often
times the description of a product on these e-retail platforms could omit important
information that a potential buyer might seek. For example, Figure 1.2 shows the
description of a cookware set on amazon.com. A potential buyer asks “Are they
ok for induction stove?” in the FAQ section pointing out that this information
about the compatibility of the pan with induction stove tops is missing from the
current product description. In this dissertation, we train a machine learning model
that learns to automatically generate a useful clarification question given a product
description. As in the previous scenario, we imagine a use case in which while a
5
Figure 1.2: Sample product description paired with a clarification question on Ama-
zon, an online shopping platform.
product seller is writing their initial product description, a system generates a single
(or a shortlist of) question(s) asking for information that it thinks a potential buyer
might need to make a more informed decision about the purchasing of the product,
thus enabling the seller to immediately clarify their description. In future, one
could also imagine building systems that can in turn answer these auto-generated
questions from other similar product descriptions or product reviews.
1.3 Contributions
Our first contribution is the creation of two clarification questions dataset.
Most previous work on question generation took the approach of transforming state-
ments into questions using syntactic rules. We instead take a novel approach where
we investigate how can we use existing human written questions to train a machine
learning model to in turn generate new questions. We hypothesize that given abun-
dant amounts of such naturally occurring questions, we can build a machine learning
6
model that can learn useful patterns of question asking and generalize those to new
unseen contexts. We build our first dataset for the scenario of technical support.
We use existing StackExchange datadump to find posts on which people have asked
clarification questions and the author of the post has subsequently answered the
question to create our dataset of (context, question, answer) triples. We build our
second dataset for the scenario of online shopping (e-retail). We use existing Ama-
zon dataset to find product descriptions on which people have asked clarification
questions and the product seller (or another user) has answered the question to
create our dataset of (context, question, answer) triples.
In both the aforementioned scenarios, similar contexts tend to reoccur fre-
quently. For instance, under the StackExchange scenario, a post describing the
issue with the installation of a certain software X on Ubuntu operating system
might have similarities with another post describing the issue with the installation
of a different software Y on Ubuntu. Therefore, a question such as “What version
of Ubuntu are you using?” previously asked on a certain post could be useful for a
new post as well. Similarly, under the Amazon scenario, a kitchen appliance such
as toaster might share common features with another appliance such as a sandwich
maker. Therefore, a question such as “How long is the cord?” asked about toaster,
can be a useful question about sandwich maker as well. This motivates a learning
approach that looks at questions asked previously to contexts that are similar to the
given context and chooses a question from that candidate set that could be useful
to the given context as well. Our second contribution is a novel question ranking
model which first extracts a set of candidate questions from a pool of previously
7
asked questions based on the similarity with the given context and then ranks these
questions in a way that a question more useful to the given context would be higher
up in the ranking.
A major limitation of the ranking approach is that it can only reuse the ques-
tions already existing in a dataset. It cannot generalize to new unseen scenarios. For
example, under the StackExchange scenario, if the previous posts only discuss issues
faced when using Ubuntu operating system, the model will not be able to generate
a question such as “What version of Windows are you using?”. We hypothesize
that we can train a sequence-to-sequence learning model (Sutskever et al., 2014)
to generate a clarification question one word at a time, given the context as the
input. Our third contribution is a clarification question generation model trained
to maximize an answer-based utility function. We use an approach similar to the
more recent generative adversarial networks (Goodfellow et al., 2014) to train our
model.
We observe that humans ask clarification questions at different levels of speci-
ficity. For instance, in Figure 1.1, the question “What version of Ubuntu are you
using?” is a generic question i.e. it could be useful for many other posts. Whereas,
a question such as “Does your bashrc file include the path to the library installa-
tion?” is specific to the given post. In Figure 1.2, the question “Are they ok for
induction stove?” is a question specific to the given product whereas the question
“Is there a guarantee or warranty?” is a generic question. We hypothesize that
we can guide a machine learning model to generate questions at a desired level of
specificity by providing the level of specificity as a signal while training the model.
8
Our fourth contribution is a specificity-controlled question generation model which
given a context and a level of specificity (specific or generic), generates a question
at that level of specificity.
1.4 Roadmap
Chapter 2 presents a background study where we first define “clarification
questions” in more detail and discuss their importance. We describe some of the
existing works on question generation in the natural language processing literature.
We also review the different approaches to question generation including the tradi-
tional syntax based methods and the more recent neural network based methods.
The chapter also includes a brief review of the neural network models most relevant
to this dissertation. Finally, related to our specificity-controlled question generation
model, we discuss some of the recent works on generating text controlled for a given
style.
1.4.1 Dataset Creation
Chapter 3 describes our method for creating the StackExchange and the Ama-
zon clarification questions dataset. We begin by discussing the importance of asking
clarification questions in these two scenarios. We then describe in detail how we
extract the (context, question, answer) triple from the raw data including the pre-
processing steps. To create the dataset for the StackExchange scenario, we use the
9
publicly available StackExchange datadump. On StackExchange, users routinely
ask clarification questions to post. The author of the post subsequently edits the
post answering the question. We use the StackExchange’s edit history to extract
the initial post as the “context”, the question asked in the comment section of the
post as the “question” and the edit made to the post in response to the question as
the “answer” to create our dataset of (context, question, answer) triples. To create
the dataset for the Amazon scenario, we repurpose the formally created Amazon
product review dataset (McAuley et al., 2015) and the Amazon question-answering
dataset (McAuley and Yang, 2016). We extract the product description as the
“context”, the question asked by a potential buyer in the FAQ section of the cor-
responding product as the “question” and the response given by the seller (or an
existing customer) to the question as the “answer” to create our dataset of (context,
question, answer) triples. We also include some data analysis.
1.4.2 Question Ranking Model
Chapter 4 introduces our novel question ranking model. In our learning model,
we represent the words in the context, question and answer using word embeddings
(Mikolov et al., 2013; Pennington et al., 2014) which correspond to vector represen-
tations of words in some N-dimensional space in a way that words that are closer in
meaning would be closer in the vector space. From these word level representations,
we obtain the sentence level representation using recurrent neural networks (Hochre-
10
iter and Schmidhuber, 1997; Mikolov, 2010) which perform a series of non-linear
transformations on the input word vectors guided by a task specific loss function.
Such neural network models have recently proven to be effective for several natu-
ral language processing tasks such as part-of-speech tagging (Santos and Zadrozny,
2014), dependency parsing (Chen and Manning, 2014), sentiment analysis (Glorot
et al., 2011), etc. Our neural network model is inspired by the decision theoretic
framework of expected value of perfect information (EVPI). EVPI is a measurement
of the value of gathering information. We use EVPI to calculate which question is
most likely to elicit an answer that would make the post more informative. Given
a context and a set of candidate questions, we rank the questions by their EVPI
value.
In this chapter, we start by describing the notion of Expected Value of Per-
fect Information (EVPI) and then discuss how we model our problem under the
EVPI framework. We describe the three components of our model: question & an-
swer generator, answer model and utility calculator, and describe the details of our
neural network based representations. We discuss our human-based evaluation de-
sign and conclude with the results of our experiments on the StackExchange dataset.
1.4.3 Question Generation Model
Chapter 5 introduces our novel question generation model. Our question gen-
eration model is built on the sequence-to-sequence approach that has proven effective
11
for several language generation tasks (Du et al., 2017; Serban et al., 2016b; Sutskever
et al., 2014; Yin et al., 2016). Unfortunately, training a sequence-to-sequence model
directly on context/question pairs yields generated questions that are highly generic.
For instance, in the context of asking questions about home appliances, these mod-
els frequently generate bland questions such “What are the dimensions?” or “Is
it made in China?’,’ corroborating a common finding in dialog systems (Li et al.,
2016b). Our goal is to be able to generate questions that are useful and specific.
Inspired by the idea of expected value of perfect information, we build a model that
uses the answer to the generated question to decide the usefulness of the question
by measuring the value of updating the context with the answer. We construct a
model that first generates a question given a context, and then generates a hypo-
thetical answer to that question. Given this (context, question, answer) tuple, we
train a utility calculator to estimate the usefulness of this question. We reinterpret
the utility value as a reward in reinforcement learning setting and train our model
to generate questions that will give us a high reward.
Reinforcement learning is one of the learning paradigms under machine learn-
ing which unlike supervised learning does not assume access to input-output pairs
apriori. Given some input, the model makes predictions and gets a reward for that
prediction from the environment. The goal of the model is to maximize its end
reward. The use of such a reward based learning strategy relaxes the strong de-
pendence on input-output pairs that is otherwise observed in a supervised learning
strategy. This is especially attractive in our problem setting since we find that a
given context can have multiple useful clarification questions. For instance, in Fig-
12
ure 1.2, a question such as “Is there a guarantee or warranty?” could be useful
as well even though the question is very different from the one that the product
was paired with in the dataset. The use of a reward based learning strategy allows
the model to quantify the usefulness of a generated question by its utility value as
opposed to its similarity to the question paired with the given context in the dataset.
Further, we improve the utility calculator by training it along with our question
generation model. We show that the utility calculator can be generalized using ideas
for generative adversarial networks (Goodfellow et al., 2014) for text (Yu et al.,
2017). A generative adversarial network is a training procedure for “generative”
models that can be interpreted as a game between a generator and a discriminator.
The goal of the generator is to generate data such that it can fool the discriminator;
the goal of the discriminator is to be able to successfully distinguish between real
and generated data. In the process of trying to fool the discriminator, the generator
produces data that is as close as possible to the real data distribution. In our problem
setting, the utility predictor plays the role of the “discriminator” and the question
generator is the “generator” and we train our model end-to-end using adversarial
training approach. We find that our adversarially-trained model generates questions
that are more specific to the context.
In this chapter, we begin by describing the sequence-to-sequence neural net-
work framework on which we base our question generation model. We then discuss
the limitations of such sequence-to-sequence models and motivate the use of utility-
function based reward. Further, we describe the generative adversarial training
paradigm and discuss how we reinterpret our utility predictor in this adversarial
13
training setting. Finally, we describe our automatic metric-based and human-based
evaluation strategy and present results of our experiments on both the StackEx-
change and the Amazon dataset.
1.4.4 Specificity-Controlled Question Generation Model
Chapter 6 introduces our specificity-controlled question generation model.
There has been previous work on generating text with specific stylistic constraints
both at the lexical (Edmonds and Hirst, 2002; Inkpen and Hirst, 2006; Kamps et al.,
2004; Reiter et al., 2005) and more recently at sentence level (Jhamtani et al., 2017;
Xu et al., 2012; ?). Our model is primarily based on the idea of side constraints
where the source is appended with an artificial token denoting the style in which we
want the model to generate its target. This idea has been used before for control-
ling politeness (Sennrich et al., 2016), voice (Yamagishi et al., 2016), and formality
(Niu et al., 2017, 2018) in machine translation. In our setting, the side constraint
corresponds to the level of specificity. We annotate a set of 3000 questions from
the Amazon dataset with their level of specificity using crowdsourcing. Next, we
train a model to automatically identify the level of specificity given the context and
the question (Louis and Nenkova, 2011). We use this model to append the source
context with the level of specificity of the target question. We finally retrain our
previously described question generation model with the modified source. At test
time, given a context and a level of specificity, our model generates a clarification
question at that level of specificity.
14
In this chapter, we begin by describing our method for annotating the ques-
tions with their level of specificity using crowdsourcing. We then describe our feature
based learning model for automatically identifying the level of specificity of clarifi-
cation questions. Finally, we describe how we integrate these specificity annotations
as side constraints into our question generation model. We present the results of
our experiments on the Amazon dataset.
1.4.5 Future Directions
Chapter 7 concludes this thesis by summarizing our contributions and dis-
cussing the shortcomings of our approaches. We also present some avenues of future
work where teaching machines to ask useful questions would be helpful. For in-
stance, in the context of goal oriented dialogue, teaching an agent to ask the right
questions to a human can help the agent successfully solve a task. In the context of
writing assistance, teaching machines to identify important gaps in the content and
ask the right questions to the human writer can help the writer fill those informa-
tional gaps. Another potential direction is the use of multi-modal inputs to guide
machines to ask useful questions. For instance, in the context of robot navigation,
teaching a robot to ask the right questions using both visual context (surrounding
environment) and textual context (human interaction) can help the robot resolve its
uncertainty and thus enable it to navigate more easily in a given environment. The
skill of asking the right questions is an important yardstick of human intelligence
and therefore teaching machines to ask useful questions can take us a step closer to
15
building intelligent artificial agents.
16
Chapter 2: Background
2.1 Definition and Importance of Clarification Questions
We define “clarification question” as a question that asks about some informa-
tion “X” that is currently missing from a given context but is essential for someone
trying to solve a task or make a decision using the given context. Graesser et al.
(1992) identify the four different purposes of questions as correction of knowledge
deficits (e.g. information seeking question such as “What is the color of the coffee
mug?”), monitoring common ground (e.g. “Are we meeting after lunch today?”),
social coordination of action (e.g. “Can you please close the door behind you?”)
and control of conversation and attention (e.g. “Hello Sir, how are you doing to-
day?”). Our definition of clarification question aligns most with the first purpose:
correction of knowledge deficits. We consider this definition of clarification ques-
tions in the context of problem solving. This definition is subsumed by the broader
definition of clarification questions which includes any questions whose purpose is
to eliminate confusion, ambiguity or misunderstanding and seek additional essen-
tial information. Clarification questions can be of two types: open questions and
closed questions. Open clarifying questions are the ones that take the form of what,
when, where, which, how and why questions (e.g. How are you installing this soft-
17
ware? ). Whereas, closed clarifying questions take the form of yes/no questions (e.g.
Do you have Powerpoint installed in your computer? ). Clarification questions are
sometimes also known as probing questions since they probe the participants of the
discussion to give more information on what they said.
Asking questions is considered to be central to learning, cognition and ed-
ucation. Researchers in education and development psychology have found that
teaching students to ask probing questions in a classroom setting can help foster
their learning (Graesser and Person, 1994; Rosenshine et al., 1996). The art of
asking good probing questions requires a deep understanding of the subject matter
and the ability to identify what is the essential missing information. Therefore,
learning to ask useful questions can help students develop essential skills such as
reasoning, problem solving and knowledge building. Adults ask clarification ques-
tions often when participating in discussions. Asking clarification questions helps
the speaker and the listener establish a common ground which is required for an
effective communication.
With the advancements of artificial intelligence technologies, we find ourselves
interacting more and more with automated agents in our daily lives. In order for
these agents to be successfully communicating with humans, it is important that
they are able to establish the same mutual common ground with humans. There-
fore, it is important that these agents learn how to ask clarification questions to
humans when they face with uncertainty. Learning to ask useful questions would
also help these agents achieve the same kind of reasoning and understanding abilities
as humans (Vanderwende, 2008).
18
Figure 2.1: An example of a reading comprehension passage and a question whose
answer can be found in the given passage from Heilman (2011).
2.2 Different Types of Question Generation Work
2.2.1 Reading Comprehension Question Generation
The task of question generation is defined as automatically generating a ques-
tion, given a context. Most previous work on question generation has been on
generating reading comprehension style questions: given text, generate a question
whose answer can be found in the given text (Heilman, 2011; Olney et al., 2012;
Rus et al., 2011; Vanderwende, 2008). For instance, in Figure 2.1, given the passage
highlighted in green, the task is to generate a question such as the one highlighted
in red with the goal of assessing someone’s understanding of the given passage.
Automatically generating such reading comprehension questions can be helpful in
creating standardized tests. In this dissertation, on the other hand, our goal is
to generate questions whose answer cannot be found in the given text. Therefore,
the challenge in our work is not limited to identifying relevant information from a
19
given text, but requires a broader understanding of the subject at hand and asking
questions that can identify important missing information in a given text.
2.2.2 Question Generation in Dialogue
Outside reading comprehension, the task of question generation has been stud-
ied the most under the context of task oriented dialogue (Grosz and Sidner, 1986).
In a task oriented dialogue, a system interacts with a human with the purpose of
accomplishing a given task. For instance, consider the task of flight booking. A
system interacting with a human user would ask the user a set of questions that
would enable the system to book a flight for the human user. Under the context of
such task oriented dialogue, the intent of teaching a system to ask questions is to fill
some predefined slots (Bobrow et al., 1977; Goddeau et al., 1996; Lemon et al., 2006;
Williams et al., 2013; Young et al., 2013). For instance, in travel booking, the slots
would include origin city, origin time, airline etc (refer Figure 2.2). Correspondingly,
the system would generate questions such as “What time do you want to leave?”,
“Which airline would you prefer?”, etc that would help fill those predefined slots.
In contrast, in our work we do not assume access to such predefined slots apriori.
The goal is to identify these missing slots implicitly and ask a question about it.
Another use case of question generation in dialogue is to resolve ambiguity.
For instance, in spoken dialogue, due to error prone automatic speech recognition
(ASR) systems, clarifying the intent of the user becomes important. Clark (1996)
and Allwood (2000) argue that the aim of clarification questions in human-human
20
Figure 2.2: An example interaction between a user and a system in the context of
travel booking where they system asks questions to fill a set of predefined slots.
dialogue is to resolve misunderstanding at the following levels: securing attention,
hearing an utterance, meaning of an utterance and deciding which action is appro-
priate. Most spoken dialogue systems ask generic clarification questions such as
“What did you say?” or “Can you please repeat?” when faced with uncertainty.
Stoyanchev et al. (2014), on the other hand, develop a model with the aim of gen-
erating more targeted clarification questions. For instance, consider the interaction
below:
A: When did the problems with [power ] start?
B: The problem with what?
A: Power
Speaker B asks a targeted clarification question instead of merely saying “Please re-
peat”. They present an approach for generating more natural clarification questions
using rules based on human behavior. Our work is similar to this work in that our
goal is also to generate more natural clarification questions. However, in our work,
the ambiguity in the original intent is not because of failure of the ASR system but
because of a piece of information that is missing from the given context. Hence, we
21
aim to generate clarification questions that point at the missing information instead
of those that point at an information that is unclear in the given context.
Following Stoyanchev et al. (2014), there have been other similar works such
as recognizing intention through clarification dialogue (Trott et al., 2016) and entity
disambiguation through clarification dialogue (Coden et al., 2015). Our work is
similar to these in that we also aim to generate questions to better understand the
original intent. But our work aims to resolve such ambiguity at a more general level
by asking for missing information instead of specifically disambiguating an intent or
an entity.
2.2.3 Question Generation for Text Understanding
Liu et al. (2010) propose a novel question generation model that generates
trigger questions as a form of support for students’ learning through writing. For
instance, if a student is writing a related work section, then their system would gen-
erate questions that would help the student augment their writing with supporting
arguments. Figure 2.3 shows an example use case of this system where, when the
author writes an argument, the system generates the questions that encourage the
author to provide more reasoning to the argument. Our work is similar to this work
in that our aim is also to augment the given context with additional informational
content. However, the intent of generating questions in our scenario is to resolve an
uncertainty or to fill informational gaps rather than help the author improve their
understanding of the original text.
22
Figure 2.3: An example of an argument followed by questions that encourage further
explanation of the argument from Liu et al. (2010)
Figure 2.4: Example of a system asking series of questions to simplify the original
user query from Artzi and Zettlemoyer (2011)
Artzi and Zettlemoyer (2011) use human-generated clarification questions to
drive a semantic parser where the clarification questions are aimed towards simplify-
ing a user query. For instance, consider the user query shown in Figure 2.4. In order
to simplify the complex query, the system in turn asks the user follow-up questions
that helps the system parse the original user query more easily. Our work departs
from this work in that we generate questions to fill some missing information in a
given text instead of generating questions that reiterates something that was already
stated before.
23
2.2.4 Visual Question Generation
Most previous work at the intersection of language and image processing has
been in caption generation where given an image the goal is to generate a caption
that explains the image. For instance, given an image such as the top image in
Figure 2.5, the goal would be to generate a caption such as “A man and a woman
standing next to a fallen motorcycle”. However, recently Mostafazadeh et al. (2016)
introduced the visual question generation task where the goal is to generate natural
and engaging questions about an image. For instance, for the top image in Fig-
ure 2.5, the goal is to generate questions such as “Was anyone injured?”, similar to
what a human would think about when they look at this image. Somewhat similar
to clarification questions in our work, these questions do not ask about something
that is already present in the image but rather ask about something that can be
inferred from the given image. Following this work, Mostafazadeh et al. (2017)
introduced an extension of this task called the Image Grounded Conversation task
where they use both the image and some initial textual context to generate a natural
follow-up question and a response to that question. Our work departs from these
work in that, given a context, we assume there is a goal to be accomplished using
the given information (which is more specific than say the broader goal of image
understanding suggested perhaps by Mostafazadeh et al. (2016)). And therefore,
the questions generated by our work aim at asking for information that can help
someone achieve that goal faster.
24
Figure 2.5: Example from the Visual Question Generation task (Mostafazadeh et al.,
2016) and the Image Grounded Conversation task (Mostafazadeh et al., 2017).
2.2.5 Question Refinement to Help Question-Answering
The task of question-answering can be defined as given a question, retrieve
(or generate) an answer to the question from a document (or a set of documents)
or a database. Previous work in question-answering find that retrieving the correct
answer could largely depend on the way the question is asked. Therefore, there has
been work on refining a given question with the aim of improving the accuracy of a
question-answering system. For instance, the keywords to questions (K2Q) system
(Zheng et al., 2011) generates a list of candidate questions and refinement words,
given a set of input keywords, to help a user ask a better question. Figueroa and
Neumann (2013) rank different paraphrases of query for effective search on forums.
25
Romeo et al. (2016) develop a neural network based model for ranking questions
on forums with the intent of retrieving similar other question. Buck et al. (2017)
propose an active question answering model where they build an agent that learns
to reformulate the question to be asked to a question-answering system so as to
elicit the best possible answers. As a future direction, one could imagine building a
complementary system to our work which can automatically answer the questions
generated by our system with the help of previous related contextual information.
2.3 Approaches to Question Generation
In this section, we describe the different major approaches to question gener-
ation explored by the natural language processing community.
2.3.1 Syntactic Rule based Methods
Given that most previous work on question generation has been on reading
comprehension style question generation, the task of question generation then turns
out to be, given a sentence (or a text), transform the sentence to a question. For
instance, given a statement “John met Sally”, their system generates “Who met
Sally?”, “Who did John meet?” and “Did John meet Sally?”. One way to achieve
this would to identify name entities or adjunct roles in the statement and map
them to the appropriate Wh-question. For instance, the sentence “<Person>Albert
Einstein</Person> developed the theory of relativity.” may be transformed into the
question “Who developed the theory of relativity?” by mapping the Person named
26
entity “Albert Einstein” to the question type “Who”. Enumerating rules for such
wh-movement based transformations can be sometimes challenging, especially in the
English language. For instance, the sentence “James Madison, following Thomas
Jefferson, was elected as the 4th president of United States.” should be transformed
into “Following Thomas Jefferson, who was elected as the 4th president of United
States?” instead of the awkward transformation “Who, following Thomas Jefferson,
was elected as the 4th president of United States?”
Heilman (2011) propose a three step approach for factoid question generation
where they first extract a set of factual statements from complex input texts, trans-
form the factual statements into candidate questions and then rank them such that
a better question is higher up in the ranking. Their system uses semantic entail-
ment and presupposition for the extraction of sets of simplified factual statements
from embedded constructions in complex input sentences, Given the simplified state-
ments, they identify the answer phrases that may be targets for WH-movement and
convert them into question phrases. Lastly, they use a feature-based linear regression
model to rank the candidate questions.
Rus et al. (2011, 2010) introduced the question generation shared task where
the task is defined as generate a question from a paragraph and generate ques-
tion from a sentence such that the answer to the question can be found in the
corresponding paragraph or the sentence. The systems submitted to these tasks
mainly used handcrafted rules and features for generating questions (Ali et al.,
2010; Kalady et al., 2010). Under template based methods, Chen (2009) generate
questions from knowledge structure by filling templates “Why/How did <character>
27
<verb> <complement>?”. Olney et al. (2012) generate questions from knowledge
representation modeled as a concept map. Labutov et al. (2015) generate high-level
question templates by crowdsourcing and given a text segment, rank question tem-
plates that are relevant. However the crowdsourcing method of collecting data leads
to significantly less data than we collect using our method.
2.3.2 Neural Network based Methods
Sequence-to-sequence based network based models (explained in detail in §2.4)
have had significant success at a variety of text generation tasks, including machine
translation (Bahdanau et al., 2015; Luong et al., 2015), summarization (Nallapati
et al., 2016), dialog (Bordes and Weston, 2016; Li et al., 2016a; Serban et al., 2016b,
2017), textual style transfer (Jhamtani et al., 2017; Kabbara and Cheung, 2016;
Rao and Tetreault, 2018) and question answering (Serban et al., 2016b; Yin et al.,
2016). The key idea behind these sequence-to-sequence approaches is that given
large amounts of input, output sequence pairs, the model learns internal represen-
tations such that at test time, given an input sequence, it generates the appropriate
output sequence.
Recently, there have been work on generating reading comprehension style
questions using such neural network models. Serban et al. (2016a) created a large
(30 million) factoid question-answering dataset by transforming facts in the Freebase
into natural language questions. Their question generation model was inspired by
the well-known attention-based encoder-decoder model (Luong et al., 2015) used for
28
machine translation. Duan et al. (2017) extract a large number of question-answer
pairs from community question answering forums and use them to train an attention-
based sequence-to-sequence learning approach to generate challenging questions for
the reading comprehension task. They find that their approach outperforms previous
rule-based question generation approaches when evaluated using automatic metrics
and human judgments. Du et al. (2017) propose an attention-based encoder-decoder
model for generating questions from text passages and show that humans find the
questions generated by their model to be more natural and more difficult to answer
compared to rule-based systems.
There has been also work on using neural networks for building question gener-
ation models that in turn assist question answering. Yuan et al. (2017) use sequence-
to-sequence learning approach to generate natural language questions from docu-
ments, conditioned on answers. Their question generation model maximizes a re-
ward which is defined by the performance on a downstream question answering task.
Sachan and Xing (2018) propose a self-training method for jointly learning to ask
and answer questions. Their model is also based on sequence-to-sequence learning
with soft attention. Tang et al. (2018) use generative adversarial network (GAN)
based approach for jointly learning the tasks of question answering and question
generation.
Under visual question generation, Mostafazadeh et al. (2016) propose a neural
network based approach for question generation where they process the image input
using a convolutional neural network (CNN) and the text input using a recurrent
neural network (RNN). Li et al. (2018) propose to jointly train the two tasks of visual
29
question generation and visual question answering using recurrent neural networks.
In our work, we use sequence-to-sequence based neural network to generate
clarification question, given a textual context. In Chapter 5, we describe our ques-
tion generation model where we begin with a maximum-likelihood based training
approach followed by a reinforcement learning based training. Our final model uses
a generative adversarial training approach to train a sequence-to-sequence based
neural network model.
2.4 Relevant Neural Network Models
Applying machine learning algorithms to natural language data requires trans-
forming text into numeric representation as a first step. Until recently, the dominant
approach to learning such representations has been the use of hand-crafted features
that are developed based on the task at hand. Deep learning or neural network
modeling (Goodfellow et al., 2016) allows us to automatically learn representations
of text without requiring feature engineering. In this section, we give an overview
of the neural network models used in this dissertation.
2.4.1 Feedforward neural network
Feedforward neural networks or multilayer perceptrons are functions that per-
form a series of nonlinear transformations on a given input vector to obtain an
output. The term feedforward comes from the fact that in these models, informa-
tion flows from the input into intermediate computations and finally to the output.
30
There are no feedback connections in which the output is fed back into the model.
In our work, we use a feedforward neural network to compute a value between 0 and
1, given an input vector. In Figure 2.6, the input layer consists of the input vector
(~x = {x1, x2, ..., xn}) of n dimensions, the hidden layers consists of hidden units (hi)
and the output layer consists of single output unit y. We use a fully connected
feedforward neural network consisting of K hidden layers where each hidden unit hi
in the input layer lk is connected to each of the units in the next hidden layer lk+1.
Each of the connections correspond to a nonlinear transformation such as a tanh.
hk = tanh(w~k · ok~−1 + bki i i ) (2.1)
oki = tanh(h
k
i ) (2.2)
y = sigmoid(oK1 ) (2.3)
w~ki = {wk1 , wk2 , ..., wk }rk
o~k = {ok1, ok2, ..., ok }rk
wkij : weight for hidden unit h
k
j in layer lk for incoming hidden unit h
k−1
i in layer lk−1
bki : bias for hidden unit i in layer lk
hki : hidden unit i in layer lk
oki : output for hidden unit i in layer lk
rk : number of hidden units in layer lk
This network is trained (i.e. the weights w and the bias b are learned) using
31
Figure 2.6: A fully connected feedforward neural network with an input layer, two
hidden layers (with four hidden units each) and an output layer.
backpropagation to minimize loss such as the cross-entropy loss between all (x, y)
pairs in the training data.
2.4.2 Recurrent neural network
The feedforward neural network described before cannot make use of sequen-
tial information present in language. It processes each word in the sentence inde-
pendently without considering the dependencies between those words. However, in
language, words in a sentence are related to each other. For instance, to predict the
next word in a sentence, we need to look at the previous words. Recurrent neural
networks (RNN) allows us to capture these dependencies (Hopfield, 1982; LeCun
et al., 1990). Given an input sequence (x1, x2, x3, ..., xn), RNN reads the input from
left to right and computes an hidden state ht at each timestep t. The hidden state
is computed using both the input at the current timestep xt and the hidden state
32
from the previous timestep ht−1.
ht = σh(Whxt + Uhht−1 + bh) (2.4)
ot = σy(Wyht + by) (2.5)
where xt: input vector
ht: hidden layer vector
yt: output vector
Wh, Uh,Wy: weight matrices (parameters)
bh, by: bias (parameter)
σh, σy: nonlinear activation functions such as tanh
The model is trained using a variant of backpropagation called backpropa-
gation through time. RNNs suffer from the issue of vanishing gradient when the
input sequences are very long. Long-short term memory (LSTM) (Hochreiter and
Schmidhuber, 1997) networks are a variant of RNNs that try to overcome this issue
by having an extended memory which allows them to remember inputs over a long
period of time.
2.4.3 Sequence-to-sequence neural network
We describe sequence-to-sequence learning model (Sutskever et al., 2014).
Given an input sequence x = (x1, x2, ..., xN), this model generates an output se-
quence y = (y1, y2, ..., yT ). The architecture of this model is an encoder-decoder
33
Figure 2.7: Recurrent neural network operating over the input sequence x one word
at a time.
with attention. The encoder is a recurrent neural network (RNN) operating over
the input word embeddings to compute a source representation S̃. The decoder uses
this source representation to generate the target sequence one word at a time:
∏T
p(y|S) = p(y1, y2, ..., yT |S) = p(yt|y1, y2, ..., yt−1, S) (2.6)
t=1
In the above equation, the chain rule permits the calculation of the joint
distribution of the output token probabilities using the product of the individual
output token probabilities. The predicted token yt is the token in the vocabulary
that is assigned the highest probability using a softmax function. The standard
training objective for sequence-to-sequence model is to maximize the log-likelihood
of all (x, y) pairs in the training data D.
34
Figure 2.8: Sequence-to-sequence learning model which takes in an input sequence
and generates an output sequence one word at a time.
2.5 Generating Text with Stylistic Variations
In the field of natural language processing, the task of automatically generating
text in a particular style has been studied using parallel data and without using
parallel data. Under style transfer using parallel data, Sheikha and Inkpen (2011)
collect pairs of formal and informal words and phrases from different sources and
use a natural language generation system to generate informal and formal texts by
replacing lexical items based on user preferences. Xu et al. (2012) was one of the first
works to treat style transfer as a sequence to sequence task. They generate a parallel
corpus of 30K sentence pairs by scraping the modern translations of Shakespeare
plays and train a phrase-based machine translation system to translate from modern
English to Shakespearean English. More recently, Jhamtani et al. (2017) show that a
copy-mechanism enriched sequence-to-sequence neural model outperforms Xu et al.
(2012) on the same set. In text simplification, the availability of parallel data
extracted from English Wikipedia and Simple Wikipedia (Zhu et al., 2010) led to
the application of phrase-based machine translation (Wubben et al., 2012) and more
35
recently neural network based machine translation (Wang et al., 2016) models.
Under style transfer without using parallel data, Hu et al. (2017) control the
sentiment and the tense of the generated text by learning a disentangled latent rep-
resentation in a neural generative model. Ficler and Goldberg (2017) control several
linguistic style aspects simultaneously by conditioning a recurrent neural network
language model on specific style (professional, personal, length) and content (theme,
sentiment) parameters. There has also been work on controlling style in neural ma-
chine translation. Sennrich et al. (2016) control the politeness of the translated
text via side constraints, and the methods raised BLEU score by 3.2 points. Niu
et al. (2017) control the level of formality of machine translation output by selecting
phrases of a requisite formality level from the k-best list during decoding. They find
that the best BLEU scores are obtained when the level of formality given as input
to the machine translation system matches the nature of the text being translated.
In the field of text simplification, more recently, Xu et al. (2016) learn large-scale
paraphrase rules using bilingual texts whereas Kajiwara and Komachi (2016) build
a monolingual parallel corpus using sentence similarity based on alignment between
word embeddings.
In our work, we use a semi-supervised approach to generating text in a given
style. During training our model, we append the source with a special token in-
dicative of the style of the target sentence. These tokens are embedded into the
source sentence representation and control target sequence generation via the at-
tention mechanism. Sennrich et al. (2016) append <T> or <V> to the source text
for distinguishing between the familiar (Latin Tu) and the polite (Latin Vos) second
36
person pronoun in the German output. Johnson et al. (2017) and Niu et al. (2018)
concatenate parallel data of various language directions and mark the source with
the desired output language to perform multilingual or bi-directional NMT. KOBUS
et al. (2017) and Chu et al. (2017) add domain tags for domain adaptation in neural
machine translation. Mima et al. (1997) improve rule-based machine translation by
using extra-linguistic information such as speaker’s role and gender. Lewis et al.
(2015) and Niu and Carpuat (2016) equate style with domain, and train conversa-
tional machine translation systems by selecting in-domain (i.e. conversation-like)
training data. Similarly, Wintner et al. (2017) and Michel and Neubig (2018) take
an adaptation approach to personalize machine translation with gender-specific or
speaker-specific data.
In summary, in this chapter we present a background study where we first
define clarification questions and state their importance in human communication
and therefore in human computer interactions. We discuss previous works in the
general area of question generation and briefly explain previous major approaches
to question generation. We also introduce the major neural network models that we
use in our work. Finally, we discuss previous work related to text generation with
stylistic variations.
37
Chapter 3: Dataset Creation
3.1 StackExchange Dataset
StackExchange is a network of online question answering websites about varied
topics like Academia, Ubuntu operating system, Latex, etc. The sites are modeled
after StackOverflow, a popular platform used for asking and answering questions on
a wide range of topics in computer programming. On this platform, users frequently
post issues they are facing with a particular topic and other users on the forum help
resolve the issue. For instance, in Figure 3.1, a user posts an issue they are facing
with installing an application on Ubuntu operating system. Another user comes and
asks a clarification question in the comment section asking for the version of Ubuntu
suggesting that that information is important for resolving the issue. The author
subsequently comes back and edits the original post adding the version information.
The data dump of StackExchange contains timestamped information about
the posts, comments on the post and the history of the revisions made to the post.
We use this data dump to create our dataset of (post, question, answer) triples:
where the post is the initial unedited post, the question is the comment containing
a question and the answer is either the edit made to the post after the question or
38
Figure 3.1: Example of a post on askubuntu.com where a user asked a clarification
question in the comments section of the post following which the author of the
post edited the post adding the missing information pointed out by the clarification
question.
the author’s response to the question in the comments section.1
Extract posts: We use the post histories to identify posts that have been updated
by its author. We use the timestamp information to retrieve the initial unedited
version of the post.
Extract questions: For each such initial version of the post, we use the timestamp
information of its comments to identify the first comment made to the post. If the
1We use data from StackExchange; per license cc-by-sa 3.0, the data is “intended to be shared
and remixed” (with attribution).
39
comment contains a question mark ‘?’, we truncate the comment till the question
mark ‘?’ to retrieve the question part of the comment.
Filtering out questions: We find that about 7% of the questions are rhetoric that
indirectly suggest a solution to the post. For e.g. “have you considered installing
X?”. We do a manual analysis of these non-clarification questions and hand-crafted
a few rules to remove them. We filter out questions that indirectly suggest a solution
by ignoring questions that start with one of these phrases: “have you”, “did you
try”, “can you try” or “could you try”. We also ignore questions that contain one
of the following words ‘duplicate’, ‘upvote’, ‘downvote’, ‘vote’, ‘related’, ‘upvoted’,
‘downvoted’ or ‘edit’. We ignore questions that contain more than 20 tokens. Ques-
tions often start with “@username” when it is directed to a specific user. In these
cases, we remove the initial part of the question corresponding to “@username”.
Extract answers: We extract the answer to a clarification question in the follow-
ing two ways:
1. Edited post : Authors tend to respond to a clarification question by editing
their original post and adding the missing information. In order to account for
edits made for other reasons like stylistic updates and grammatical corrections,
we consider only those edits that are longer than four words. Authors can
make multiple edits to a post in response to multiple clarification questions.To
identify the edit made corresponding to the given question comment, we choose
40
Figure 3.2: Example of a post on askubuntu.com where a user asked a clarification
question in the comments section of the post and the author of the post answered
the question as a subsequent comment.
the edit closest in time following the question.
2. Response to the question: Authors also respond to clarification questions as
subsequent comments in the comment section (see Figure 3.2). We extract the
first comment by the author following the clarification question as the answer
to the question.
In cases where both the methods above yield an answer, we pick the one that is
the most semantically similar to the question, where the measure of similarity is
the cosine distance between the average word embeddings of the question and the
answer.
41
Train Tune Test
askubuntu 19,944 2493 2493
unix 10,882 1360 1360
superuser 30,852 3857 3856
Table 3.1: Table above shows the sizes of the train, tune and test split of our dataset
for three domains.
We extract a total of 77,097 (post, question, answer) triples across three do-
mains in StackExchange (Table 3.1). Although StackExchange consists of many
sites, we choose the ones above because: a) the data dump available for them were
moderately big in size to train a model on; b) these domains contain clarification
questions that are generic enough to be useful for many different posts; and finally
c) the three domains were close enough so that we could combine them and train
on a larger dataset.
3.2 Analysis of StackExchange Dataset
How often are extracted questions clarifications? A natural question to our
process of data creation would be how often is the extracted question a clarification
question. We sample a set of 1000 questions from our dataset and design a crowd-
sourced task on Figure-Eight where given a question we ask annotators to choose
whether the question was: (a) Asking for more information, (b) Providing an answer
or a suggestion; or (c) Neither.2 We collect three annotations per question. We find
2www.figureeight.com
42
that 91% of the questions were marked with option (a), 7% with option (b) and 2%
with option (c). These numbers suggest that a large portion of the extracted ques-
tions are indeed “clarification questions”. Additionally, we analyze the questions
marked as “providing a solution” and find that majority of these started with one
of the following phrases: “have you”, “did you try”, “can you try”, “could you try”.
We preprocess our dataset to remove all such instances.
How useful are clarifications questions? A clarification question is useful if it
helps in generating an answer for a given post. Imagine a scenario in which a post
goes unanswered for some time. Following this, a clarification question gets asked on
this post and then the post gets an answer. Such a scenario will help showcase the
usefulness of clarification questions. We estimate such a usefulness by calculating
the following two probabilities for posts that have not received an answer within a
week:
Pr(A| #(A|CQ)CQ) =
#(A|CQ) + #(¬A|CQ)
|¬ #(A|¬CQ)Pr(A CQ) =
#(A|¬CQ) + #(¬A|¬CQ)
where:
#(A|CQ): # answered posts with a clarification question
#(¬A|CQ): # unanswered posts with a clarification question
#(A|¬CQ): # answered posts without a clarification question
#(¬A|¬CQ): # unanswered posts without a clarification question
43
Table 3.2 shows these probabilities for the three data domains. We can see
that, overall, the likelihood of a post getting an answer with a clarification question
is higher than the the likelihood of a post getting an answer without a clarification
question.
Yes/No clarification questions We argue in the introduction of Chapter 4 that
asking a question like “What version of Ubuntu do you have?” is more useful than
asking a more specific question that might yield a Yes/No answer. This raises the
question of how many clarification questions in our dataset are Yes/No questions.
We manually inspect 100 randomly selected clarification questions in our dataset
and find that 13 of them were Yes/No questions. This suggests that users, on these
forums, tend to ask questions that are generic enough to elicit a useful answer more
than a specific question.
Multiple clarification questions On analysis, we find that 35%-40% of the posts
get asked multiple clarification questions. We include only the first clarification
question to a post in our dataset since identifying if the following questions are
clarifications or a part of a dialogue is non-trivial.
44
askubuntu unix superuser
Pr(A|CQ) 0.82 0.85 0.45
Pr(A|¬CQ) 0.77 0.80 0.34
Table 3.2: Likelihood of a post getting answered with and without a clarification
question
3.3 Amazon Dataset
Amazon (amazon.com) is an online shopping platform where product sellers
post descriptions of their products and users buy them online. Often when the
given description is missing some important information, users ask questions in the
frequently-asked-questions section of the product. For instance, Figure 3.3 shows the
description of a cookware set under the “Home & Kitchen” category of amazon.com
and a clarification question that asks if the cookware set is induction safe (i.e. works
on induction stove).
McAuley and Yang (2016) introduced the Amazon question-answering dataset
where each instance consists of a question asked about a product on amazon.com
combined with other information (product ID, question type “Yes/No”, answer type,
answer and answer time). We extract the product ID, question and answer from
this dataset. To obtain the description of the product, we use the Amazon reviews
dataset (McAuley et al., 2015) which includes product ID and product description.
We consider at most 10 questions for each product. This dataset includes several
different product categories. We choose the Home and Kitchen category since it
45
Figure 3.3: Example of a product description on amazon.com followed by a clarifi-
cation question and an answer to the question.
contains a high number of questions. This dataset consists of 19, 119 training, 2, 435
validation and 2, 305 test examples, and each product description contains between
3 and 10 questions (average: 7).
46
Chapter 4: Question Ranking Model
4.1 Introduction
In this chapter we describe our model for ranking clarification questions. A
principal goal of asking questions is to fill information gaps, typically through clar-
ification questions. We take the perspective that a good question is the one whose
likely answer will be useful. Consider the exchange in Figure 4.1, in which an ini-
tial poster (who we call “Terry”) asks for help configuring environment variables.
This post is underspecified and a responder (“Parker”) asks a clarifying question (a)
below, but could alternatively have asked (b) or (c):
(a) What version of Ubuntu do you have?
(b) What is the make of your wifi card?
(c) Are you running Ubuntu 14.10 kernel 4.4.0-59-generic on an x86 64 architecture?
Parker should not ask (b) because an answer is unlikely to be useful; they should
not ask (c) because it is too specific and an answer like “No” or “I do not know”
gives little help. Parker’s question (a) is much better: it is both likely to be useful,
and is plausibly answerable by Terry.
In this work, we design a model to rank a candidate set of clarification questions
by their usefulness to the given post. We imagine a use case (more discussion in §4.6)
47
Figure 4.1: A post on an online Q & A forum “askubuntu.com” is updated to fill
the missing information pointed out by the question comment.
in which, while Terry is writing their post, a system suggests a shortlist of questions
asking for information that it thinks people like Parker might need to provide a
solution, thus enabling Terry to immediately clarify their post, potentially leading
to a much quicker resolution.
To develop our model we take inspiration from the decision theoretic frame-
work of the Expected Value of Perfect Information (EVPI) (Avriel and Williams,
1970), a measure of the value of gathering additional information. In our setting, we
use EVPI to calculate which question is most likely to elicit an answer that would
make the post more informative (§ 4.2). Formally, for an input post p, we want to
choose a question q that maximizes Ea|p,q[U(p+a)], where a is a hypothetical answer
and U is a function measuring the utility of post p if a were to be added to it. To
48
Figure 4.2: We formulate our ranking problem as, given a post, first extract a set of
ten candidate questions and then rank them such that a more useful question would
be higher up in the ranking
achieve this, we construct two models:
(1) an answer model, which estimates P[a | p, q], the likelihood of receiving answer
a if one were to ask question q on post p (§4.2.2);
(2) an utility calculator, U(p), which measures the utility of the post (§4.2.3).
Given these two models, at prediction time we search over a shortlist of possible
questions for that which maximizes the EVPI. We formulate this task as a ranking
problem where given a post and a list of candidate questions, the task is to rank the
questions such that a more useful question would be higher up in the ranking (refer
Figure 4.2). The candidate list includes the “original” question asked to the post
and nine other questions that we extract from posts that are similar to the given
post.1 Note that this setting is different from the distractor-based setting popularly
1Henceforth we refer to the question paired with the post as the “original” question
49
used in dialogue (Lowe et al., 2015) in that the nine other questions can include a
good question.
We train our answer model and our utility calculator jointly based on (p, q, a)
triples that we extract from StackExchange (§3.1), using its edit history (Figure 4.1).
In the figure, the initial post fails to state what version of Ubuntu is being run. In
response to Parker’s question in the comments section, Terry, the author of the post,
edits the post to answer Parker’s clarification question. Terry might also choose to
answer the clarification question in the subsequent comment. We extract the initial
post p, question posted in the comments section q, and edit to the original post or
the comment following the clarification question comment as answer a to form our
(p, q, a) triples.
We evaluate our models using human judgments that we collect on Upwork.2
We ask annotators to select what they thought was the single best question to ask,
and additionally mark as “valid” any other questions that they thought would be
okay to ask (§ 4.3). We evaluate models both on the task of returning the original
clarification question and also on the task of picking any of the candidate clarification
questions marked as good by experts. We find that our EVPI model outperforms
the baseline models when evaluated against expert human annotations. We include
a few examples of human annotations along with our model performance on them
in § 4.5. We have released our dataset of ∼77K (p, q, a) triples and the expert
annotations on 500 triples to help facilitate further research in this task.3
2https://www.upwork.com
3https://github.com/raosudha89/ranking_clarification_questions
50
4.2 Model description
We build a neural network model inspired by the theory of expected value
of perfect information (EVPI). EVPI is a measurement of: if I were to acquire
information X, how useful would that be to me? However, because we haven’t
acquired X yet, we have to take this quantity in expectation over all possible X,
weighted by each X’s likelihood. In our setting, for any given question qi that we
can ask, there is a set A of possible answers that could be given. For each possible
answer aj ∈ A, there is some probability of getting that answer, and some utility if
that were the answer we got. The value of this question qi is the expected utility,
over all possible answers:
∑
EV PI(qi|p) = P[aj|p, qi]U(p+ aj) (4.1)
aj∈A
In Eq 4.1, p is the post, qi is a potential question from a set of candidate
questions Q and aj is a potential answer from a set of candidate answers A. Here,
P[aj|p, qi] measures the probability of getting an answer aj given an initial post p
and a clarifying question qi, and U(p + aj) is a utility function that measures how
much more complete p would be if it were augmented with answer aj. The modeling
question then is how to model:
1. The probability distribution P[aj|p, qi] and
2. The utility function U(p+ aj).
In our work, we represent both using neural networks over the appropriate inputs.
We train the parameters of the two models jointly to minimize a joint loss defined
51
such that an answer that has a higher potential of increasing the utility of a post
gets a higher probability.
Figure 5.1 describes the behavior of our model during test time. Given a post
p, we generate a set of candidate questions and a set of candidate answers (§4.2.1).
Given a post p and a question candidate qi, we calculate how likely is this question
to be answered using one of our answer candidates aj (§4.2.2). Given a post p and
an answer candidate aj, we calculate the utility of the updated post i.e. U(p + aj)
(§ 4.2.3). We compose these modules into a joint neural network that we optimize
end-to-end over our data (§4.2.4).
4.2.1 Question & answer candidate generator
Given a post p, our first step is to generate a set of question and answer
candidates. One way that humans learn to ask questions is by looking at how
others ask questions in a similar situation. Using this intuition we generate question
candidates for a given post by identifying posts similar to the given post and then
looking at the questions asked to those posts. For identifying similar posts, we use
Lucene, a software extensively used in information retrieval for extracting documents
relevant to a given query from a pool of documents.4 Lucene implements a variant
of the term frequency-inverse document frequency (TF-IDF) model to score the
extracted documents according to their relevance to the query. We use Lucene to
find the top 10 posts most similar to a given post from our dataset (§ 3.1). We
consider the questions asked to these 10 posts as our set of question candidates Q
4https://lucene.apache.org/
52
Figure 4.3: The behavior of our model during test time: Given a post p, we
retrieve 10 posts similar to post p using Lucene. The questions asked to those 10
posts are our question candidates Q and the edits made to the posts in response to
the questions (or the author’s response to the question in the comments section) are
our answer candidates A. For each question candidate qi, we generate an answer
representation F (p, qi) and calculate how close is the answer candidate aj to our
answer representation F (p, qi). We then calculate the utility of the post p if it were
updated with the answer aj. Finally, we rank the candidate questions Q by their
expected utility given the post p (Eq 4.1).
and the edits made to the posts in response to the questions as our set of answer
candidates A. Since the top-most similar candidate extracted by Lucene is always
the original post itself, the original question and answer paired with the post is
always one of the candidates in Q and A. § 3.1 describes in detail the process of
extracting the (post, question, answer) triples from the StackExchange datadump.
53
4.2.2 Answer modeling
Given a post p and a question candidate qi, our second step is to calculate
how likely is this question to be answered using one of our answer candidates aj.
We first generate an answer representation by combining the neural representations
of the post and the question using a function Fans(p̄, q̄i) (details in § 4.2.4). Given
such a representation, we measure the distance between this answer representation
and one of the answer candidates aj using the function below:
dist(Fans(p̄, q̄i), âj) = 1− cos sim(Fans(p̄, q̄i), âj)
The likelihood of an answer candidate aj being the answer to a question qi
on post p is finally calculated by combining this distance with the cosine similarity
between the question qi and the question qj paired with the answer candidate aj:
P[a |p, q ] = exp−dist(Fans(p̄, q̄i), âj)j i ∗cos sim(q̂i, q̂j) (4.2)
where âj, q̂i and q̂j are the average word vector of aj, qi and qj respectively
(details in §4.2.4) and cos sim is the cosine similarity between the two input vectors.
We model our answer generator using the following intuition: a question can be
asked in several different ways. For e.g. in Figure 4.1, the question “What version of
Ubuntu do you have?” can be asked in other ways like “What version of operating system
are you using?”, “Version of OS?”, etc. Additionally, for a given post and a question,
there can be several different answers to that question. For instance, “Ubuntu 14.04
54
LTS”, “Ubuntu 12.0”, “Ubuntu 9.0”, are all valid answers. To generate an answer
representation capturing these generalizations, we train our answer generator on
our triples dataset (§3.1) using the loss function below:
lossans(∑pi, q(i, ai, Qi) = dist(Fans(p̄i, q̄i), âi) ) (4.3)
+ dist(Fans(p̄i, q̄i), âj) ∗ cos sim(q̂i, q̂j)
j∈Q
where, â and q̂ is the average word vectors of a and q respectively (details in §4.2.4),
cos sim is the cosine similarity between the two input vectors.
This loss function can be explained using the example in Figure 4.4. Question
qi is the question paired with the given post pi. In Eq 4.3, the first term forces the
function Fans(p̄i, q̄i) to generate an answer representation as close as possible to the
correct answer ai. Now, a question can be asked in several different ways. Let Qi be
the set of candidate questions for post pi, retrieved from the dataset using Lucene
(§ 4.2.1). Suppose a question candidate qj is very similar to the correct question
qi ( i.e. cos sim(q̂i, q̂j) is near zero). Then the second term forces the answer
representation Fans(p̄i, q̄i) to be close to the answer aj corresponding to the question
qj as well. Thus in Figure 4.4, the answer representation will be close to aj (since qj
is similar to qi), but may not be necessarily close to ak (since qk is dissimilar to qi).
This is similar to the idea of co-occurrence smoothing (Essen and Steinbiss, 1992;
Resnik, 1993), a method which combines prediction information of distinct words
based on their distributional similarity in order to smooth language models.
55
Figure 4.4: Training of our answer generator. Given a post pi and its question qi,
we generate an answer representation that is not only close to its original answer ai,
but also close to one of its candidate answers aj if the candidate question qj is close
to the original question qi.
4.2.3 Utility calculator
Given a post p and an answer candidate aj, the third step is to calculate the
utility of the updated post i.e. U(p+aj). As expressed in Eq 4.1, this utility function
measures how useful it would be if a given post p were augmented with an answer aj
paired with a different question qj in the candidate set. Although theoretically, the
utility of the updated post can be calculated only using the given post (p) and the
candidate answer (aj), empirically we find that our neural EVPI model performs
better when the candidate question (qj) paired with the candidate answer is a part
of the utility function. We attribute this to the fact that much information about
whether an answer increases the utility of a post is also contained in the question
asked to the post. We train our utility calculator using our dataset of (p, q, a) triples
(§3.1). We label all the (pi, qi, ai) pairs from our triples dataset with label y = 1. To
56
get negative samples, we make use of the answer candidates generated using Lucene
as described in § 4.2.1. For each aj ∈ Ai, where Ai is the set of answer candidates
for post pi, we label the pair (pi, qj, aj) with label y = 0, except for when aj = ai.
Thus, for each post pi in our triples dataset, we have one positive sample and nine
negative samples. This idea of using implicit negative evidence for training is similar
to the notion of contrastive estimation (Smith and Eisner, 2005). It should be noted
that this is a noisy labelling scheme since a question not paired with the original
question in our dataset can often times be a good question to ask to the post (§4.3).
However, since we do not have annotations for such other good questions at train
time, we assume such a labelling.
Given a post pi and an answer aj paired with the question qj, we combine
their neural representations using a function Futil(p̄i, q̄j, āj) (details in §4.2.4). The
utility of the updated post is then defined as U(pi + aj) = σ(Futil(p̄i, q̄j, āj)), where
σ is the sigmoid function. We want this utility to be close to 1 for all the positively
labelled (p, q, a) triples and close to 0 for all the negatively labelled (p, q, a) triples.
We therefore define our loss using the binary cross-entropy formulation below:
lossutil(yi, p̄i, q̄j, āj) = yi log(σ(Futil(p̄i, q̄j, āj))) (4.4)
4.2.4 Our joint neural network model
Our fundamental representation is based on recurrent neural networks over
word embeddings. We obtain the word embeddings using the GloVe (Pennington
et al., 2014) model trained on the entire datadump of StackExchange. In Eq 4.2 and
57
Eq 4.3, the average word vector representations q̂ and â are obtained by averaging the
GloVe word embeddings for all words in the question and the answer respectively.
Given an initial post p, we generate a post neural representation p̄ using a post
LSTM (long short-term memory architecture) (Hochreiter and Schmidhuber, 1997).
The input layer consists of word embeddings of the words in the post which is fed
into a single hidden layer. The output of each of the hidden states is averaged
together to get our neural representation p̄. Similarly, given a question q and an
answer a, we generate the neural representations q̄ and ā using a question LSTM
and an answer LSTM respectively. We define the function Fans in our answer model
as a feedforward neural network with five hidden layers on the inputs p̄ and q̄ as
shown in Figure 4.5. Likewise, we define the function Futil in our utility calculator
as a feedforward neural network with five hidden layers on the inputs p̄, q̄ and ā.
We train the parameters of the three LSTMs corresponding to p, q and a, and the
parameters of the two feedforward neural networks jointly to minimize the sum of
the loss of our answer model (Eq 4.3) and our utility calculator (Eq 4.4) over our
entire dataset:
∑∑
lossans(p̄i, q̄i, āi, Qi) + lossutil(yi, p̄i, q̄j, āj) (4.5)
i j
Given such an estimate P[aj|p, qi] of an answer and a utility U(p + aj) of the
updated post, we rank the candidate questions by their value as calculated using
Eq 4.1. The remaining question, then, is how to get data that enables us to train
our answer model and our utility calculator. Given data, the training becomes a
multitask learning problem, where we learn simultaneously to predict utility and to
58
Figure 4.5: Left: Fans computed using a feedfoward neural network over post LSTM
p̄ and question LSTM q̄ representations and â computed using average word embed-
dings over words in the answer. Right: Futil computed using a feedforward neural
network over post LSTM p̄, question LSTM q̄ and answer LSTM ā representations.
estimate the probability of answers.
4.3 Evaluation design
We define our task as given a post p, and a set of candidate clarification
questions Q, rank the questions according to their usefulness to the post. Since
the candidate set includes the original question q that was asked to the post p, one
possible approach to evaluation would be to look at how often the original question
is ranked higher up in the ranking predicted by a model. However, there are two
59
Figure 4.6: Our LSTM architecture on a post pi. The input layer consists of pre-
trained word embeddings of the words in the post which is fed into a single hidden
layer. The output ok of each of the hidden states is averaged together to get our
neural representation p̄i
problems to this approach: 1) Our dataset creation process is noisy. The original
question paired with the post may not be a useful question. For e.g. “are you
seriously asking this question?”, “do you mind making that an answer?”.5 2) The
nine other questions in the candidate set are obtained by looking at questions asked
to posts that are similar to the given post.6 This greatly increases the possibility
of some other question(s) being more useful than the original question paired with
the post. This motivates an evaluation design that does not rely solely on the
original question but also uses human judgments. We randomly choose a total of
5Data analysis in Chapter 3 suggests 9% of the questions are not useful.
6Note that this setting is different from the distractor-based setting popularly used in dialogue
(Lowe et al., 2015) where the distractor candidates are chosen randomly from the corpus.
60
500 examples from the test sets of the three domains proportional to their train set
sizes (askubuntu:160, unix:90 and superuser:250) to construct our evaluation set.
4.3.1 Annotation scheme
Due to the technical nature of the posts in our dataset, identifying useful
questions requires technical experts. We recruit 10 such experts on Upwork who
have prior experience in Unix based operating system administration. As a training
process, we first ask the annotators to annotate a sample of 5 examples and provide
them with feedback and additional guidance. We also ask annotators to rate their
confidence in {1: Educated guess, 2: Pretty sure, 3: Quite sure}. The confidence on
17% of the annotations was rated as low, 47% was rated as medium and 37% was
rated as high.
We provide the annotators with a post and a randomized list of the ten question
candidates obtained using Lucene (§ 4.2.1) and ask them to select a single “best”
(B) question to ask, and additionally mark as “valid” (V ) other questions that
they thought would be okay to ask in the context of the original post. We enforce
that the “best” question be always marked as a “valid” question. We group the 10
annotators into 5 pairs and assign the same 100 examples to the two annotators in
a pair.
61
4.3.2 Annotation analysis
We calculate the inter-annotator agreement on the “best” and the “valid”
annotations using Cohen’s Kappa measurement. When calculating the agreement
on the “best” in the strict sense, we get a low agreement of 0.15. However, when we
relax this to a case where the question marked as“best” by one annotator is marked
as “valid” by another, we get an agreement of 0.87. The agreement on the “valid”
annotations, on the other hand, was higher: 0.58. We calculate this agreement on
the binary judgment of whether a question was marked as valid by the annotator.
Given these annotations, we calculate how often is the original question marked
as “best” or “valid” by the two annotators. We find that 72% of the time one of
the annotators mark the original as the “best”, whereas only 20% of the time both
annotators mark it as the “best” suggesting against an evaluation solely based on
the original question. On the other hand, 88% of the time one of the two annotators
mark it as a “valid” question confirming the noise in our training data.7
Figure 4.7 shows the distribution of the counts of questions in the intersection
of “valid” annotations (blue legend). We see that about 85% of the posts have more
than 2 valid questions and 50% have more than 3 valid questions. The figure also
shows the distribution of the counts when the original question is removed from the
intersection (red legend). Even in this set, we find that about 60% of the posts
have more than two valid questions. These numbers suggests that the candidate set
of questions retrieved using Lucene (§ 4.2.1) very often contains useful clarification
776% of the time both the annotators mark it as a “valid”.
62
Figure 4.7: Distribution of the count of questions in the intersection of the “valid”
annotations.
questions.
4.4 Experimental results
Our primary research questions that we evaluate experimentally are:
1. Does a neural architecture with learned representations improve upon a simple
bag-of-ngrams baseline?
2. Does the expected value of perfect information (EVPI) formalism provide
leverage over a similarly expressive feedforward network?
3. Are answers useful in identifying the right question?
4. How do the models perform when evaluated on the candidate questions ex-
cluding the original?
4.4.1 Baseline methods
We compare our model with following baselines:
63
B1 ∪B2 V 1 ∩ V 2 Original
Model p@1 p@3 p@5 MAP p@1 p@3 p@5 MAP p@1
Random 17.5 17.5 17.5 35.2 26.4 26.4 26.4 42.1 10.0
Bag-of-ngrams 19.4 19.4 18.7 34.4 25.6 27.6 27.5 42.7 10.7
Community QA 23.1 21.2 20.0 40.2 33.6 30.8 29.1 47.0 18.5
Neural (p, q) 21.9 20.9 19.5 39.2 31.6 30.0 28.9 45.5 15.4
Neural (p, a) 24.1 23.5 20.6 41.4 32.3 31.5 29.0 46.5 18.8
Neural (p, q, a) 25.2 22.7 21.3 42.5 34.4 31.8 30.1 47.7 20.5
EVPI 27.7 23.4 21.5 43.6 36.1 32.2 30.5 49.2 21.4
Table 4.1: Model performances on 500 samples when evaluated against the union
of the “best” annotations (B1 ∪B2), intersection of the “valid” annotations (V 1 ∩
V 2) and the original question paired with the post in the dataset. The difference
between the bold and the non-bold numbers is statistically significant with p <
0.05 as calculated using bootstrap test. p@k is the precision of the k questions
ranked highest by the model and MAP is the mean average precision of the ranking
predicted by the model.
64
Random: Given a post, we randomly permute its set of 10 candidate questions
uniformly. We take the average over 1000 random permutations.
Bag-of-ngrams: Given a post and a set of 10 question and answer candidates,
we construct a bag-of-ngrams representation for the post, question and answer. We
train the baseline on all the positive and negative candidate triples (same as in our
utility calculator (§ 4.2.3)) to minimize hinge loss on misclassification error using
cross-product features between each of (p, q), (q, a) and (p, a). We tune the ngram
length and choose n=3 which performs best on the tune set. The question candidates
are finally ranked according to their predictions for the positive label.
Community QA: The recent SemEval2017 Community Question-Answering (CQA)
(Nakov et al., 2017) included a subtask for ranking a set of comments according to
their relevance to a given post in the Qatar Living forum. 8 Nandi et al. (2017),
winners of this subtask, developed a logistic regression model using features based
on string similarity, word embeddings, etc. We train this model on all the positively
and negatively labelled (p, q) pairs in our dataset (same as in our utility calculator
(§ 4.2.3), but without a). We use a subset of their features relevant to our task.
Details in §4.4.2.
Neural baselines: We construct the following neural baselines based on the LSTM
representation of their inputs (as described in §4.2.4):
1. Neural(p, q): Input is concatenation of p̄ and q̄.
8http://www.qatarliving.com/forum
65
2. Neural(p, a): Input is concatenation of p̄ and ā.
3. Neural(p, q, a): Input is concatenation of p̄, q̄ and ā.
Given these inputs, we construct a fully connected feedforward neural network with
10 hidden layers and train it to minimize the binary cross entropy across all positive
and negative candidate triples (same as in our utility calculator (§ 4.2.3)). We use
10 (double the number of hidden layers used in our EVPI model) hidden layers
to ensure that the improvement in our EVPI model is not merely because of the
increased number of parameters in the EVPI model. The major difference between
the neural baselines and our EVPI model is in the loss function: the EVPI model is
trained to minimize the joint loss between the answer model (defined on Fans(p, q)
in Eq 4.3) and the utility calculator (defined on Futil(p, q, a) in Eq 4.4) whereas
the neural baselines are trained to minimize the loss directly on F (p, q), F (p, a) or
F (p, q, a).
4.4.2 Implementation details
Preprocessing: We tokenize the raw text in our post, question and answer using
the NLTK tokenizer. We restrict the post to its first 300 tokens and the question
and answer to first 40 tokens. In our work, we choose these token lengths based on
the average lengths of posts and questions in the dataset. However, it is an open
research question as to how would changing these token lengths influence the model
predictions.
66
Word embedding model: Each post, question and answer in our dataset is rep-
resented using embeddings. To generate these embeddings, we train 200 dimensional
word embeddings using GloVe on the 3 billion token datadump of StackExchange.
Since the total number of tokens in the datadump is large, we use an unusually large
threshold frequency of 100 to create a vocabulary of 250,000 tokens. All tokens with
a frequency of less than 100 in our dataset get assigned an ‘UNK’ token.
Model hyperparameters: The hidden layers in all the neural models are of size
200. We use ReLU non-linearity as our activation function between the hidden
layers. We use a batch size of 128. We train the models for up to 14 epochs and
at test time we use the predictions of the epoch where the performance on the tune
set is the best.
Community QA baseline: We use the implementation provided by the winning
team of the SemEval2017 Community Question-Answering (cQA) subtask 3.9 Their
original model contains six feature groups: string similarity features, word embed-
ding features, topic modeling features, keyword features, meta data features and
dialogue identification features. Since we do not have information about the latter
three features in our dataset, we use only the first three features and train a logistic
regression model to obtain the confidence scores on the positive labels.
9https://github.com/TitasNandi/cQARank
67
4.4.3 Results
4.4.3.1 Evaluating against expert annotations
We first describe the results of the different models when evaluated against the
expert annotations we collect on 500 samples (§4.3). Since the annotators had a low
agreement on a single best, we evaluate against the union of the “best” annotations
(B1 ∪ B2 in Table 4.1) and against the intersection of the “valid” annotations
(V 1 ∩ V 2 in Table 4.1).
Among non-neural baselines, we find that the bag-of-ngrams baseline performs
slightly better than random but worse than all the other models. The Community
QA baseline, on the other hand, performs better than the neural baseline (Neural
(p, q)), both of which are trained without using the answers. The neural baselines
with answers (Neural(p, q, a) and Neural(p, a)) outperform the neural baseline with-
out answers (Neural(p, q)), showing that answer helps in selecting the right question.
More importantly, EVPI outperforms the Neural (p, q, a) baseline across most
metrics. Both models use the same information regarding the true question and
answer and are trained using the same number of model parameters.10 However,
the EVPI model, unlike the neural baseline, additionally makes use of alternate
question and answer candidates to compute its loss function. This shows that when
the candidate set consists of questions similar to the original question, summing
over their utilities gives us a boost.
10We use 10 hidden layers in the feedforward network of the neural baseline and five hidden
layers each in the two feedforward networks Fans and Futil of the EVPI model.
68
We can interpret the absolute numbers obtained by our best (EVPI) model in
a real world setting as follows: Given 10 candidate questions obtained from Lucene,
around 28% of the time, the top ranked question is the best question whereas around
36% of the time, the top ranked question is a valid question. Likewise, around 23%
of the time, the top three questions are the best questions whereas around 32%
of the time, the top three questions are valid questions. Although these absolute
numbers are relatively low, in this work, we set the baseline for this novel task and
hope that this work will encourage future work in this space.
4.4.3.2 Evaluating against the original question
The last column in Table 4.1 shows the results when evaluated against the
original question paired with the post. The bag-of-ngrams baseline performs similar
to random, unlike when evaluated against human judgments. The Community QA
baseline again outperforms Neural(p, q) model and comes very close to the Neural
(p, a) model.
As before, the neural baselines that make use of the answer outperform the
one that does not use the answer and the EVPI model performs significantly better
than Neural(p, q, a).
4.4.3.3 Excluding the original question
In the preceding analysis, we considered a setting in which the “ground truth”
original question was in the candidate set Q. While this is a common evaluation
69
B1 ∪B2 V 1 ∩ V 2
Model p@1 p@3 p@5 MAP p@1 p@3 p@5 MAP
Random 17.4 17.5 17.5 26.7 26.3 26.4 26.4 37.0
Bag-of-ngrams 16.3 18.9 17.5 25.2 26.7 28.3 26.8 37.3
Community QA 22.6 20.6 18.6 29.3 30.2 29.4 27.4 38.5
Neural (p,q) 20.6 20.1 18.7 27.8 29.0 29.0 27.8 38.9
Neural (p,a) 22.6 20.1 18.3 28.9 30.5 28.6 26.3 37.9
Neural (p,q,a) 22.2 21.1 19.9 28.5 29.7 29.7 28.0 38.7
EVPI 23.7 21.2 19.4 29.1 31.0 30.0 28.4 39.6
Table 4.2: Model performances on 500 samples when evaluated against the union
of the “best” annotations (B1 ∪ B2) and intersection of the “valid” annotations
(V 1∩V 2), with the original question excluded. The difference between all numbers
except the random and bag-of-ngrams are statistically insignificant.
framework in dialog response selection (Lowe et al., 2015), it is overly optimistic.
We, therefore, evaluate against the “best” and the “valid” annotations on the nine
other question candidates. We find that the neural models beat the non-neural
baselines. However, the differences between all the neural models are statistically
insignificant. Results are shown in Table 4.2
70
4.5 Example outputs
To understand the behavior of our EVPI model, we have included three exam-
ple outputs in Table 4.4 one each from the three domains in our dataset. The first
example is a case where the EVPI model predicts both the “best” and the “valid”
questions higher in its ranking. The original poster is facing some issue they call
the “suspend resume” issue. The post is unclear on what problem the poster is
facing. Hence the “best” question asks for that information. In the second example,
the model predicts one of the “valid” questions higher up in its ranking but fails to
predict the “best” question. The model predicts “why would you need this” with
very high probability likely because it is a very generic question, unlike the question
marked as “best” by the annotator which is too specific. In the third example, the
model again predicts a very generic question which is also marked as “valid” by the
annotator. These examples suggest that the model is good at correctly predicting
generic questions, but not at predicting very specific questions.
4.6 Conclusion
In this chapter we describe a novel model for the task of ranking clarification
questions. Our model integrates well-known deep network architectures with the
classic notion of expected value of perfect information, which effectively models
a pragmatic choice on the part of the questioner: how do I imagine the other
party would answer if I were to ask this question. Such pragmatic principles have
71
recently been shown to be useful in other tasks as well (Andreas and Klein, 2016;
Golland et al., 2010; Orita et al., 2015; Smith et al., 2013). One can naturally extend
our EVPI approach to a full reinforcement learning approach to handle multi-turn
conversations.
Our results show that the EVPI model is a promising formalism for the ques-
tion generation task. In order to move to a full system that can help users like Terry
write better posts, the model needs to be able to generalize. For instance, if our
model has access to posts in the training data that only discuss Ubuntu operating
system, then our ranking model will never be able to generate a question such as
“What version of Windows are you using?” even if it has seen questions such as
“What version of Ubuntu are you using?”. Another issue with our ranking model
is that it relies on Lucene to retrieve a good initial set of candidate questions. In
order to be able to exploit the usefulness of our model to the fullest, we therefore
move from question ranking to a question generation task setup where given a con-
text, we develop a model to generate a question from scratch. In our next chapter,
we describe our question generation model that is based on sequence-to-sequence
neural network models that have recently proven to be effective for several language
generation tasks (Serban et al., 2016b; Sutskever et al., 2014; Yin et al., 2016).
72
Title: Ubuntu 15.10 instant resume from suspend
Post: I have an ASUS desktop PC that I decided to install Ubuntu onto.
I have used Linux before, specifically for 3 years in High School.
I have never encountered suspend resume issues on Linux before.
It appears that my PC is resuming from suspend on Ubuntu 15.10
I am not sure what is causing this, but my hardware is as follows:
Intel Core i5 4460 @ 3.2 GHz
2 TB Toshiba 7200 RPM disk
8 GB DDR3 RAM
Corsair CX 500 Power Supply
AMD Radeon R9 270X Graphics - 4 Gigs
ASUS Motherboard for OEM builds
VIA technologies USB 3.0 Hub
Realtek Network Adapter
Any help is greatly appreciated.
I haven’t worked with Linux in over a year,
and as I plan to pursue a career in Comp Science
(specifically through internshipsl) and this is a problem,
as I don’t want to drive the power bill up.
(Even though I don’t pay it, my parents do.)
0.87 does suspend - resume work as expected ?
0.71 what , specifically , is the problem you want help with ?
0.70 the suspend problem exits only if a virtual machines is running ?
0.67 is the pasted workaround still working for you ?
0.57 just wondering if you got a solution for this ?
0.50 we *could* try a workaround , with a keyboard shortcut .
would that interest you ?
0.49 did you restart the systemd daemon after the
changes ‘sudo restart systemd-logind‘ ?
0.49 does running ‘sudo modprobe -r psmouse ; sleep 1 ;
sudo modprobe psmouse‘ enable the touchpad ?
0.49 2 to 5 minutes ?
0.49 does it work from the menu or not ?
Table 4.3: Example of human annotation from the askubuntu domain of our dataset.
The questions are sorted by expected utility, given in the first column. The “best”
annotation is marked with black ticks and the “valid”’ annotations are marked
with grey ticks .
73
Title: Frozen Linux Recovery Without SysReq
Post: RHEL system has run out of memory and is now frozen.
The SysReq commands are not working, so I am not even sure that
/proc/sys/kernel/sysrq is set to 1.
Is there any other ”safe” way I can reboot w/out power cycling?
0.91 why would you need this ?
0.77 maybe you need to use your ‘fn‘ key when pressing print screen ?
0.59 do you have sudo rights on this computer ?
0.55 are you sure sysrq is enabled on your machine ?
0.52 did you look carefully at the logs when you rebooted after it hung ?
0.51 i assume you have data open which needs to be saved ?
0.50 define “ frozen ” . did it panic ? or did something else happen ?
0.50 maybe you need to use your ‘fn‘ key when pressing print screen ?
0.50 tried ctrl + alt + f2 ?
0.49 does the script process 1 iteration successfully ?
0.49 laptop or desktop ?
Title: How to flash a USB drive?.
Post: I have a 8 GB Sandisk USB drive. Recently it became write somehow.
So I searched in Google and I tried to remove the write protection
through almost all the methods I found. Unfortunately nothing worked.
So I decided to try some other ways.
Some said that flashing the USB drive will solve the problem.
But I don’t know how. So how can it be done ?
1.01 what file system was the drive using ?
1.00 was it 16gb before or it has been 16mb from the first day you used it ?
0.74 which os are you using ? which file system is used by your pen drive ?
0.64 what operation system you use ?
0.51 can you narrow ’a hp usb down ’ ?
0.50 could the device be simply broken ?
0.50 does it work properly on any other pc ?
0.50 usb is an interface , not a storage device .
was it a flash drive or a portable disk ?
0.49 does usb flash drive tester have anything useful to say about the drive ?
0.49 your drive became writeable ? or read-only ?
Table 4.4: Examples of human annotation from the unix and superuser domain of
our dataset. The questions are sorted by expected utility, given in the first column.
The “best” annotation is marked with black ticks and the “valid”’ annotations
are marked with grey ticks .
74
Chapter 5: Question Generation Model
5.1 Introduction
In this chapter, we describe our clarification question generation model which
given a context, generates a question one word at a time. Our clarification question
generation model builds on the sequence-to-sequence approach that has proven ef-
fective for several language generation tasks (Du et al., 2017; Serban et al., 2016b;
Sutskever et al., 2014; Yin et al., 2016). Unfortunately, training a sequence-to-
sequence model directly on (context, question) pairs yields questions that are highly
generic1, corroborating a common finding in dialog systems (Li et al., 2016b). Our
goal is to be able to generate clarification questions that are useful and specific.
To achieve this, we begin with a recent observation of Rao and Daumé III
(2018), who considered the task of question reranking: a good clarification question
is the one whose answer has a high utility, which they defined as the likelihood that
this question would lead to an answer that will make the context more complete
(§5.2.3). Inspired by this, we construct a question generation model that first gen-
erates a question given a context, and then generates a hypothetical answer to that
1For instance, under home appliances, frequently asking “Is it made in China?” or “What are
the dimensions?”
75
question. Given this (context, question, answer) triple, we train a utility calculator
to estimate the usefulness of this question. We then show that this utility calculator
can be generalized using ideas for generative adversarial networks Goodfellow et al.
(2014) for text Yu et al. (2017), wherein the utility calculator plays the role of the
“discriminator” and the question generator is the “generator” (§ 5.2.2), which we
train using the Mixer algorithm Ranzato et al. (2015).
We evaluate our approach on two question generation datasets. The first is
the Stack Exchange dataset (Table 5.2) where given a post, we train a model to
generate a clarification question that points at missing information that could be
potentially useful to someone trying to resolve the issue in the post. The second
is the Amazon dataset (Table 5.1) where given a product description, we train a
model to generate a clarification question that points at missing information that
a potentially buyer might find useful. Using both automatic metrics and human
evaluation, we demonstrate that although all models generate questions that are
relevant to the context at hand, our adversarially-trained model generates more
useful and specific questions than all the baseline models.
5.2 Training a Clarification Question Generator
Our goal is to build a model that, given a context, can generate an appro-
priate clarification question. Our dataset consists of (context, question, answer)
triples where the context is an initial textual context, question is the clarification
question that asks about some missing information in the context and answer is the
76
Product T-fal Nonstick Cookware Set,
title 18 pieces, Red
Product Easy non-stick 18pc set includes every
description piece for your everyday meals.
Exceptionally durable dishwasher
safe cookware for easy clean up.
Durable non-stick interior.
Oven safe up to 350.F/177.C
Question Are they induction compatible?
Answer They are aluminium so the answer is NO.
Table 5.1: Sample product description from amazon.com paired with a clarification
question and answer.
Title Wifi keeps dropping on 5Ghz network
Post Recently my wireless has been very iffy at my
university. I notice that I am connected to a 5Ghz
network, while I am usually connected to a 2.4Ghz
everywhere else (where everything works just fine).
Sometimes it reconnects, but often I have to run
‘sudo service network-manager restart‘.
Is it possible a kernel update has caused this?
Question what is the make of your wifi card ?
Answer intel corporation wireless 7260 ( rev 73 )
Table 5.2: Sample post from stackexchange.com paired with a clarification question
and answer.
answer to the clarification question (details in ??). Representationally, our question
generator is a standard sequence-to-sequence model with attention (§ 5.2.1). The
learning problem is: how to train the sequence-to-sequence model to generate good
clarification questions.
An overview of our training setup is shown in Figure 5.1. Given a context, our
question generator, which is a sequence-to-sequence model, outputs a question. In
order to evaluate the usefulness of this question, we then have a second sequence-to-
77
Figure 5.1: Overview of our GAN-based clarification question generation model.
sequence model called the “answer generator” that generates a hypothetical answer
based on the context and the question (§ 5.2.5). This (context, generated question
and generated answer) triple is fed into a Utility calculator, whose initial goal
is to estimate the probability that this (question, answer) pair is useful in this
context (§ 5.2.3). This Utility is treated as a reward, which is used to update
the question generator using the Mixer Ranzato et al. (2015) algorithm (§ 5.2.2).
Finally, we reinterpret the answer-generator-plus-utility-calculator component as a
discriminator for differentiating between (context, true question, generated answer)
triples and (context, generated question, generated answer) triples , and optimize
the generator for this adversarial objective using Mixer (§5.2.4).
78
5.2.1 Sequence-to-sequence Model for Question Generation
We use a standard attention based sequence-to-sequence model (Luong et al.,
2015) for our question generator. Given an input sequence (context) c = (c1, c2, ..., cN),
this model generates an output sequence (question) q = (q1, q2, ..., qT ). The architec-
ture of this model is an encoder-decoder with attention. The encoder is a recurrent
neural network (RNN) operating over the input word embeddings to compute a
source context representation c̃. The decoder uses this source representation to
generate the target sequence one word at a time:
∏T ∏T
p(q|c̃t) = p(qt|q1, q2, ..., qt−1, c̃t) = softmax(Wsh̃t) ; where h̃t = tanh(Wc[c̃t;ht])
t=1 t=1
(5.1)
In Eq 5.1, h̃t is the attentional hidden state of the RNN at time t and Ws and Wc
are parameters of the model. The predicted token qt is the token in the vocabulary
that is assigned the highest probability using the softmax function. The standard
training objective for sequence-to-sequence model is to maximize the log-likelihood
of all (c, q) pairs in the training data D which is equivalent to minimizing the loss,
∑ ∑T
Lmle(D) = − log p(qt|q1, q2, ..., qt−1, c) (5.2)
(c,q)∈D t=1
In Eq 5.1, h̃t is the attentional hidden state of the RNN at time t obtained by
concatenating the target hidden state ht and the source-side context vector c̃t, and
Ws is a linear transformation that maps ht to an output vocabulary-sized vector.
The predicted token qt is the token in the vocabulary that is assigned the highest
probability using the softmax function. Each attentional hidden state h̃t depends
79
on a distinct input context vector c̃t computed using a global attention mechanism
over the input hidden states as:
∑N
c̃t = anthn (5.3)
n=1 [ ]/∑ [ ]
a T Tnt = align(hn, ht) = exp ht Wahn exp ht Wahn′ (5.4)
n′
The attention weights ant are calculated based on the alignment score between the
source hidden state hn and the current target hidden state ht.
5.2.2 Training the Generator to Optimize Utility
Training sequence-to-sequence models for the task of clarification question
generation (with context as input and question as output) using maximum likelihood
objective unfortunately leads to the generation of highly generic questions, such
as “What are the dimensions?” when asking questions about home appliances.
Recently, Rao and Daumé III (2018) observed that the usefulness of a question can
be better measured as the utility that would be obtained if the context were updated
with the answer to the proposed question. Following this observation, we first use
a pretrained answer generator (§ 5.2.5) to generate an answer given a context and
a question. We then use a pretrained Utility calculator (§ 5.2.3 ) to predict the
likelihood that the generated answer would increase the utility of the context by
adding useful information to it. Finally, we train our question generator to optimize
this Utility based reward.
Similar to optimizing metrics like Bleu and Rouge, this Utility calculator
also operates on discrete text outputs, which makes optimization difficult due to non-
80
differentiability. A successful recent approach dealing with the non-differentiability
while also retaining some advantages of maximum likelihood training is the Mixed
Incremental Cross-Entropy Reinforce (Ranzato et al., 2015) algorithm (Mixer). In
Mixer, the overall loss L is differentiated as in Reinforce (Williams, 1992):
L(θ) = −Eqs∼p r(qs) ;θ
(5.5)
∇θL(θ) = −Eqs∼p r(qs)∇θ log pθ(qs)θ
where ys is a random output sample according to the model pθ, where θ are the
parameters of the network. The expected gradient is then approximated using a
single sample qs = (qs, qs1 2, ..., q
s
T ) from the model distribution (pθ). In Reinforce,
the policy is initialized randomly, which can cause long convergence times. To solve
this, Mixer starts by optimizing maximum likelihood for the initial ∆ time steps,
and slowly shifts to optimizing the expected reward from Eq 5.5 for the remaining
(T −∆) time steps.
In our model, for the initial ∆ time steps, we minimize Lmle and for the
remaining steps, we minimize the following Utility-based loss:
∑T
L p bmax-utility = −(r(q )− r(q )) log p(qt|q1, ..., qt−1, ct) (5.6)
t=1
where r(qp) is the Utility based reward on the predicted question and r(qb) is
a baseline reward introduced to reduce the high variance otherwise observed when
using Reinforce. To estimate this baseline reward, we take the idea from the
self-critical training approach Rennie et al. (2017) where the baseline is estimated
using the reward obtained by the current model under greedy decoding during test
time. We find that this approach for baseline estimation stabilizes our model better
81
than the approach used in Mixer.
5.2.3 Estimating a Utility Function from Historical Data
Given a (context, question, answer) triple, in the previous chapter we in-
troduced a utility calculator Utility(c, q, a) to calculate the value of updating a
context c with the answer a to a clarification question q. The inspiration for their
utility calculator is to estimate the probability that an answer would be a mean-
ingful addition to a context, and treat this as a binary classification problem where
the positive instances are the true (context, question, answer) triples in the dataset
whereas the negative instances are contexts paired with a random (question, answer)
from the dataset. The model we use is to first embed the words in the context c,
then use an LSTM (long-short term memory) (Hochreiter and Schmidhuber, 1997)
to generate a neural representation c̄ of the context by averaging the output of each
of the hidden states. Similarly, we obtain a neural representation q̄ and ā of q and
a respectively using question and answer LSTM models. Finally, a feed forward
neural network FUtility(c̄, q̄, ā) predicts the usefulness of the question.
5.2.4 Utility GAN for Clarification Question Generation
The Utility function trained on true vs random samples from real data (as
described in the previous section) can be a weak reward signal for questions gener-
ated by a model due to the large discrepancy between the true data and the model’s
outputs. In order to strengthen the reward signal, we reinterpret the Utility
82
function (coupled with the answer generator) as a discriminator in an adversarial
learning setting. That is, instead of taking the Utility calculator to be a fixed
model that outputs the expected quality of a question/answer pair, we additionally
optimize it to distinguish between true question/answer pairs and model-generated
ones. This reinterpretation turns our model into a form of a generative adversarial
network (GAN) (Goodfellow et al., 2014).
A GAN is a training procedure for “generative” models that can be interpreted
as a game between a generator and a discriminator. The generator is an arbitrary
model g ∈ G that produces outputs (in our case, questions). The discriminator is
another model d ∈ D that attempts to classify between true outputs and model-
generated outputs. The goal of the generator is to generate data such that it can
fool the discriminator; the goal of the discriminator is to be able to successfully
distinguish between real and generated data. In the process of trying to fool the
discriminator, the generator produces data that is as close as possible to the real
data distribution. Generically, the GAN objective is:
LGAN(D,G) = max minEx∼p̂ log d(x) + Ez∼pz log(1− d(g(z))) (5.7)
d∈D g∈G
where x is sampled from the true data distribution p̂, and z is sampled from a prior
defined on input noise variables pz.
Although GANs have been successfully used for image tasks, training GANs
for text generation is challenging due to the discrete nature of outputs in text. The
discrete outputs from the generator make it difficult to pass the gradient update
from the discriminator to the generator. Recently, Yu et al. (2017) proposed a
83
sequence GAN model for text generation to overcome this issue. They treat their
generator as an agent and use the discriminator as a reward function to update
the generative model using reinforcement learning techniques. By modeling the
generator as a stochastic policy and directly training the policy via policy gradient,
they avoid the differentiation difficulty at the cost of a much harder optimization
problem. Our GAN-based approach is inspired by this sequence GAN model with
two main modifications: a) We use the Mixer algorithm as our generator (§5.2.2)
instead of policy gradient approach; and b) We use the Utility function (§ 5.2.3)
as our discriminator instead of a convolutional neural network (CNN).
Theoretically, the discriminator should be trained using (context, true ques-
tion, true answer) triples as positive instances and (context, generated question,
generated answer) triples as the negative instances. However, we find that training
a discriminator using such positive instances makes it very strong since the gener-
ator would have to not only generate real looking questions but also generate real
looking answers to fool the discriminator. Since our main goal is question genera-
tion and since we use answers only as latent variables, we instead use (context, true
question, generated answer) as our positive instances where we use the pretrained
answer generator to get the generated answer for the true question. Formally, our
objective function is:
LGAN-U(U ,M) = max min Eq∼p̂ log u(c, q,A(c, q)) + Ec∼p̂ log(1− u(c,m(c),A(c,m(c))))
u∈U m∈M
(5.8)
where U is the Utility discriminator, M is the Mixer generator, p̂ is our data of
(context, question, answer) triples and A is our answer generator.
84
5.2.5 Pretraining
Question Generator. We pretrain our question generator using the sequence-
to-sequence model (§5.2.1) to maximize the log-likelihood of all (context, question)
pairs in the training data. Parameters of this model are updated during adversarial
training.
Answer Generator. We pretrain our answer generator using the sequence-
to-sequence model (§5.2.1) to maximize the log-likelihood of all ([context+question],
answer) pairs in the training data. Parameters of this model are kept fixed during
the adversarial training.2
Discriminator. In our Utility GAN model (§ 5.2.4), the discriminator is
trained to differentiate between true and generated questions. However, since we
want to guide our Utility based discriminator to also differentiate between true
(“good”) and random (“bad”) questions, we pretrain our discriminator in the same
way we trained our Utility calculator. For positive instances, we use a context
and its true question, answer from the training data and for negative instances, we
use the same context but randomly sample a question from the training data (and
use the answer paired with that random question).
5.3 Experimental Results
We base our experimental design on the following research questions:
2We leave the experimentation of updating parameters of answer generator during adversarial
training to future work.
85
1. Do generation models outperform simpler retrieval baselines?
2. Does optimizing the Utility reward improve over maximum likelihood train-
ing?
3. Does using adversarial training improve over optimizing the pretrained Util-
ity?
4. How do the models perform when evaluated for nuances such as specificity
and usefulness?
We evaluate our model on both the StackExchange and the Amazon datasets de-
scribed in Chapter 3
5.3.1 Baselines and Ablated Models
We compare three variants (ablations) of our proposed approach, together with
an information retrieval baseline:
GAN-Utility is our full model which is a Utility function based GAN training
(§ 5.2.4) including the Utility discriminator, a Mixer question generator and a
sequence-to-sequence based answer generator.
Max-Utility is our reinforcement learning baseline with a pretrained question gen-
erator described model (§5.2.2) without the adversarial training.
MLE is the question generator model pretrained on context, question pairs using
86
maximum likelihood objective (§5.2.1).
Lucene Given a context, we use Lucene to retrieve top 10 contexts that are most
similar to the given context. We randomly choose a question from the 10 ques-
tions paired with these contexts to construct our Lucene baseline. For the Amazon
dataset, we ignore questions asked to products of the same brand as the given prod-
uct since Amazon replicates questions across same brand allowing the true question
to be included in that set.
5.3.2 Experimental Details
In this section, we describe the details of our experimental setup. We prepro-
cess all inputs (context, question and answers) using tokenization and lowercasing.
We set the max length of context to be 100, question to be 20 and answer to be 20.
We test with context length 150 and 200 and find that the automatic metric results
are similar as that of context length 100 but the experiments take much longer.
Hence, we set the max context length to be 100 for all our experiments. Similarity,
we find that an increased length of question and answer yields similar results with
increased experimentation time.
Our sequence-to-sequence model (§5.2.1) operates on word embeddings which
are pretrained on in domain data using Glove (Pennington et al., 2014). As fre-
quently used in previous work on neural network modeling, we use an embeddings
of size 200 and a vocabulary with cut off frequency set to 10. During train time,
87
we use teacher forcing (Williams and Zipser, 1989). During test time, we use beam
search decoding with beam size 5. We use a hidden layer of size two for both the
encoder and decoder recurrent neural network models with size of hidden unit set
to 100. We use a dropout of 0.5 and learning ratio of 0.0001. We use a batch size
of 128.
In the Mixer model, we start with ∆ = T and decrease it by 2 for every
epoch (we found decreasing ∆ to 0 is ineffective for our task, hence we stop at 2).
We run the pretrain the question generator and the answer generator for 100 epochs
and run the Reinforce and the adversarial training for 8 epochs.
We would like to note here that our decisions of these hyperparameter settings
have been influenced by the following previous works that have done a more system-
atic investigation of how these hyperparameters influence model predictions. Neishi
et al. (2017) perform a detailed analysis of hyperparameter tuning of sequence-
to-sequence models for the task of machine translation. Khandelwal et al. (2018)
discuss how neural language models make use of context and find that these models
are more sensitive to nearby contexts (upto 100 tokens) and less sensitive to tokens
beyond that window. Qi et al. (2018) investigate the usefulness of using pretrained
word embeddings and find that in case of scarcity of in-domain data (such as low
resource machine translation), the use of pretrained word embeddings can be very
effective.
88
5.3.3 Evaluation Metrics
We evaluate initially with several automated evaluation metrics, and then
more substantially based on crowdsourced human judgments.
5.3.3.1 Automatic Metrics
Diversity, which calculates the proportion of unique trigrams in the output
to measure the diversity as commonly used to evaluate dialogue generation (Li et al.,
2016b). We report trigrams, but bigrams and unigrams follow similar trends.
Bleu (Papineni et al., 2002), which evaluates n-gram precision between a
predicted sentence and reference sentences.
Meteor (Banerjee and Lavie, 2005), which is similar to Bleu but includes
stemmed and synonym matches when measuring the similarity between the pre-
dicted sequence and the reference sequences.
5.3.3.2 Human Judgements
We use Figure-Eight (https://www.figure-eight.com), which is a crowd-
sourcing platform, to collect human judgements. Each question was annotated by
five annotators. We paid crowdworkers 5 cents per judgment. Below are the exact
wordings of the questions we asked the annotators with the numeric scores corre-
sponding to each option:
Relevance: We ask ”Is the question on topic” and let workers choose from:
89
1: Yes
0: No
Grammaticality: We ask ”Is the question grammatical?”, and let workers choose
from:
1: Yes
0: No
Seeking new information: We ask “Does the question ask for new information
currently not included in the description?” and let workers choose from:
1: Yes
0: No
Specificity: We ask ”How specific is the question?” and let workers choose from:
4: Specific pretty much only to this product (or same product from different man-
ufacturer)
3: Specific to this and other very similar products
2: Generic enough to be applicable to many other products of this type
1: Generic enough to be applicable to any product under Home and Kitchen
N/A (Not applicable): Question is not on topic OR is incomprehensible
Usefulness: We ask “How useful is the question to a potential buyer (or a current
user) of the product?” and let workers choose from:
4: Useful enough to be included in the product description
3: Useful to a large number of potential buyers (or current users)
2: Useful to a small number of potential buyers (or current users)
1: Useful only to the person asking the question
90
Criteria Agreement
Relevance 0.92
Grammaticality 0.92
Seeking new information 0.84
Usefulness 0.65
Specificity 0.72
Table 5.3: Inter-annotator agreement on the five criteria used in human-based eval-
uation.
N/A (Not applicable): Question is not on topic OR is incomprehensible OR is not
seeking new information
Since the inter-annotator agreement on the usefulness criteria was low (refer to
Table 5.3), in order to reduce the subjectivity involved in the fine grained annotation,
we convert the range [1-4] to a more coarse binary range [0-1] by mapping the scores
4 and 3 to 1 and the scores 2 and 1 to 0.
The inter annotator agreement on each of the above five criteria is shown in
Table 5.3. Agreement on Relevance, Grammaticality and Seeking new information
is high. This is not surprising given that these criteria are not very subjective. On
the other hand, the agreement on usefulness and specificity is quite moderate since
these judgments can be very subjective.
91
Amazon StackExchange
Model Diversity Bleu Meteor Diversity Bleu Meteor
Reference 0.6934 — — 0.7509 — —
Lucene 0.6289 4.26 10.85 0.7453 1.63 7.96
MLE 0.1059 17.02 12.72 0.2183 3.49 8.49
Max-Utility 0.1214 16.77 12.69 0.2508 3.89 8.79
GAN-Utility 0.1296 15.20 12.82 0.2256 4.26 8.99
Table 5.4: Diversity as measured by the proportion of unique trigrams in model
outputs. Bleu and Meteor scores using up to 10 references for the Amazon
dataset and up to six references for the StackExchange dataset. Numbers in bold
are the highest among the models. All results for Amazon are on the entire test set
whereas for StackExchange they are on the 500 instances of the test set that have
multiple references.
5.3.4 Automatic Metric Results
Table 5.4 shows the results on the two datasets when evaluated according to
automatic metrics.
In the Amazon dataset, GAN-Utility outperforms both MLE and Max-Utility
models on Diversity, suggesting that it produces more diverse outputs. Lucene,
on the other hand, has the highest Diversity since it consists of human generated
questions, which tend to be more diverse because they are much longer compared to
92
model generated questions. This comes at the cost of lower match with the reference
as visible in the Bleu and Meteor scores. In terms of Bleu and Meteor,
there is inconsistency. Although GAN-Utility outperforms all baselines according
to Meteor, the fully ablated MLE model has a higher Bleu score. This is because
Bleu score looks for exact n-gram matches and since MLE produces more generic
outputs, it is much more likely that it will match one of 10 references compared to
the specific/diverse outputs of GAN-Utility, since one of those ten is highly likely
to itself be generic.
In the StackExchange dataset GAN-Utility outperforms both MLE and Max-
Utility models on both Bleu and Meteor. Unlike in the Amazon dataset, MLE
does not outperform GAN-Utility in Bleu. This is because the MLE outputs in
this dataset are not as generic as in the Amazon dataset due to the highly technical
nature of contexts in StackExchange. As in the Amazon dataset, GAN-Utility
outperforms MLE on Diversity. Interestingly, the Max-Utility ablation achieves a
higher Diversity score than GAN-Utility. On manual analysis we find that Max-
Utility produces longer outputs compared to GAN-Utility but at the cost of being
less grammatical.
5.3.5 Human Judgements Analysis
Table 5.5 shows the numeric results of human-based evaluation performed on
the reference and the system outputs on 300 random samples from the test set
93
Model Relevant [0-1] Grammatical [0-1] New Info [0-1] Useful [0-1] Specific [0-4]
Reference 0.96 0.99 0.93 0.72 3.38
Lucene 0.90 0.99 0.95 0.68 2.87
MLE 0.92 0.96 0.85 0.91 3.05
Max-Utility 0.93 0.96 0.88 0.91 3.29
GAN-Utility 0.94 0.96 0.87 0.96 3.52
Table 5.5: Results of human judgments on model generated questions on 300 sam-
ple Home & Kitchen product descriptions. The options described in § 5.3.3 are
converted to corresponding numeric range (see supplementary material). The dif-
ference between the bold and the non-bold numbers is statistically significant with
p <0.05. Reference is excluded in the significance calculation.
of the Amazon dataset.3 Overall, these results show that the GAN-Utility model
successfully generates the most useful and the most specific questions while being
equally good at seeking new information. All approaches produce relevant and
grammatical questions. All models are all equally good at seeking new information,
but are weaker than Lucene, which performs better at seeking new information but
at the cost of much lower specificity and lower usefulness.
Our full model, GAN-Utility, performs significantly better at the usefulness
criteria showing that the adversarial training approach generates more useful ques-
3We could not ask crowdworkers evaluate the StackExchange data due to its highly technical
nature.
94
tions. Interestingly, all our models produce questions that are more useful than
Lucene and Reference, largely because Lucene and Reference tend to ask questions
that are more often useful only to the person asking the question, making them
less useful for potential other buyers (see Figure 5.3). GAN-Utility also performs
significantly better at generating questions that are more specific to the product
(see details in Figure 5.2), which aligns with the higher Diversity score obtained
by GAN-Utility under automatic metric evaluation.
Table 5.6 contains example outputs from different models along with their use-
fulness and specificity scores. MLE generates questions such as “is it waterproof?”
and “what is the wattage?”, which are applicable to many other products. Whereas
our GAN-Utility model generates more specific question such as “is this shower
curtain mildew resistant?”. We provide further analysis of system outputs on both
Amazon and Stack Exchange datasets in the next section.
5.3.6 Analysis of System Outputs on Amazon Dataset
Table 5.7 shows the system generated questions for three product descriptions
in the Amazon dataset.
In the first example, the product is a shower curtain. The Reference question
is specific and highly useful. Lucene, on the other hand, picks a moderately specific
(“how to clean it?”) but useful question. MLE model generates a generic but useful
“is it waterproof?”. Max-Utility generates comparatively a much longer question
but in doing so loses out on relevance. This behavior of generating two unrelated
95
Figure 5.2: Results of human judgements on the specificity criteria.
sentences is observed quite a few times in both Max-Utility and GAN-Utility models.
This suggests that these models, in trying to be very specific, end up losing out on
relevance. In the same example, GAN-Utility also generates a fairly long question
which, although awkwardly phrase, is quite specific and useful.
In the second example, the product is a Duvet Cover Set. Both Reference and
Lucene questions here are examples of questions that are pretty much useful only
to the person asking the question. We find many such questions in both Reference
and Lucene outputs which is the main reason for the comparatively lower usefulness
scores for their outputs. All three of our models generate irrelevant questions since
96
Figure 5.3: Results of human judgements on the usefulness criteria.
the product description explicitly says that the set is full size.
In the last example, the product is a set of mopping clothes. Reference ques-
tion is quite specific but has low usefulness. Lucene picks an irrelevant question.
MLE and Max-Utility generate highly specific and useful questions. GAN-Utility
generates an ungrammatical question by repeating the last word many times. We
observe this behavior quite a few times in the outputs of both Max-Utility and
GAN-Utility models suggesting that our sequence-to-sequence models are not very
good at maintaining long range dependencies.
97
Title Raining Cats and Dogs Vinyl
Bathroom Shower Curtain
Product This adorable shower curtain measures
Description 70x72 inches and would make a great gift!
Useful [1-4] Specific [1-4]
Reference does the vinyl smells? 3 4
Lucene other than home sweet home , 2 4
what other sayings on the shower curtain ?
MLE is it waterproof ? 4 2
Max-Utility is this shower curtain mildew ? N/A N/A
GAN-Utility is this shower curtain mildew resistant ? 4 4
Title PURSONIC HF200 Pedestal
Bladeless Fan & Humidifier All-in-one
Product The first bladeless fan to incoporate a humidifier! ,
Description This product operates solely as a fan,
a humidifier or both simultaneously.
5.5L tank lasts up to 12 hours.
Useful [1-4] Specific [1-4]
Reference i can not get the humidifier to work 1 2
Lucene does it come with the vent kit 3 3
MLE what is the wattage of this fan ? 4 2
Max-Utility is this battery operated ? 3 2
GAN-Utility does this fan have an automatic shut off ? 4 4
Table 5.6: Example outputs from each of the systems for two product descriptions
along with the usefulness and the specificity score given by human annotators. De-
scriptions of scores are in the supplementary material.
5.3.7 Analysis of System Outputs on Stack Exchange Dataset
Table 5.8 includes system outputs for three posts from the Stack Exchange
dataset.
The first example is of a post where someone describes their issue of not being
able to recover from their boot. Reference and Lucene questions are useful. MLE
generates a generic question that is not very useful. Max-Utility generates a useful
question but has slight ungrammaticality in it. GAN-Utility, on the other hand,
generates a specific and an useful question.
98
In the second example, again Reference and Lucene questions are useful. MLE
generates a generic question. Max-Utility and GAN-Utility both generate fairly
specific question but contain unknown tokens. The Stack Exchange dataset contains
several technical terms leading to a long tail in the vocabulary. Owing to this, we
find that both Max-Utility and GAN-Utility models generate many instances of
questions with unknown tokens.
In the third example, the Reference question is very generic. Lucene asks a
relevant question. MLE again generates a generic question. Both Max-Utility and
GAN-Utility generate specific and relevant questions.
5.4 Conclusion
In this chapter, we describe a novel approach to the problem of clarification
question generation. Given a context, we use the observation from the previous
chapter that the usefulness of a clarification question can be measured by the value
of updating the context with an answer to the question. We use a sequence-to-
sequence model to generate a question given a context and a second sequence-to-
sequence model to generate an answer given the context and the question. Given the
(context, predicted question, predicted answer) triple we calculator the utility of this
triple and use it as a reward to retrain the question generator using reinforcement
learning based Mixer model. Further, to improve upon the utility function, we
reinterpret it as a discriminator in an adversarial setting and train both the utility
function and the Mixer model in a minimax fashion. We find that our adversar-
99
ial training approach produces more diverse questions compared to both a model
trained using maximum likelihood objective and a model trained using utility re-
ward based reinforcement learning. A model that can generate diverse questions is
useful since a model that only generates generic questions can be very repetitive and
might not help elicit useful information about a given context. Motivated by this
idea, in the next chapter, we look into how can we guide our clarification question
generation model to generate more specific questions.
100
Title Chezmoi Collection 7-piece Chic Ruched
Duvet Cover Set, Full Size (with Pillows)
Product Create a world of romance with the elegant,
Description and luxurious all white duvet cover set.
Useful [1-4] Specific [1-4]
Reference How long will it take to ship 1 1
this to my pin code?
Lucene and can you use the duvet as is ? if not what 1 4
shall i purchase to put in it for winter or fall ?
MLE what are the dimensions of the king size ? N/A N/A
Max-Utility what are the dimensions of the king size ? N/A N/A
GAN-Utility does the king size come with a duvet cover N/A N/A
or do you have to buy a king duvet ?
Title Microfiber 3-Pack, Pro-Clean Mopping
Cloths for Braava Floor Mopping Robot
Product Braavas textured Pro-Clean microfiber
Description mopping cloths remove dirt and
hair from your floors. The cloths
can be washed and used hundreds of times
They are compatible with all Braava models,
including the Pro-Clean Reservoir Pad.
Each cloth is easy to attach and remove
from the magnetic cleaning pad.
Useful [1-4] Specific [1-4]
Reference do i have to use a new cloth every 2 4
time i want to clean my floor?
$5/$6 seems expensive per clean
Lucene do they remove pet odor ? N/A N/A
MLE will these work with the scooba ? 3 3
Max-Utility do these cloths work on hardwood floors ? 3 4
GAN-Utility will this work with the scooba mop N/A N/A
mop mop mop mop mop mop
Table 5.7: Example outputs from each of the systems for three product descriptions
from the Home & Kitchen category of the Amazon dataset.
101
Title how can i recover my boot ?
Post since last week i am trying to recover my boot .
after the last update for ubuntu 12.04 i lost it and
nobody could help me , i used boot-repair but
there are problems with dependences , which ca n’t be fix .
i will be very grateful if somebody could help me .
Reference what happens when you try to boot ?
Lucene can you get into bios ?
MLE how much ram do you have ?
Max-Utility do you have a swap partition partition ?
GAN-Utility what happens when you try to boot into safe mode ?
Title packages have unmet dependencies
when trying to install anything
Post i ’m running ubuntu 14.04 lts . ive recently run
into this problem when updating or installing ,
after trying a few solutions to no avail ,
but now i ’m having the same issue with steam
trying to update , which i use quite a lot .
ive looked through dozens of posts about similar issues
and tried a lot of solutions and nothing seems to work.
Reference sudo dpkg -reconfigure all ? ?
Lucene if you use the graphical package manager ,
does n’t add the required packages automatically ?
MLE how long did you wait ?
Max-Utility can you post the output of ‘apt-cache policy UNK ?
GAN-Utility can you post a screenshot of the output
of ‘sudo apt-get install UNK
Title full lubuntu installation on usb ( uefi capable )
Post i want to do a full lubuntu installation on a
usb stick that can be booted in uefi mode.
i do not want persistent live usb but a full lubuntu
installation ( which happens to live on a usb stick )
and that can boot fromanyuefi-capable computer ...
Reference hello and welcome on askubuntu .
could you please clarify what you want ?
Lucene so , ubuntu was installed to the pen drive ?
MLE which version of ubuntu ?
Max-Utility do you have a live cd or usb stick ?
GAN-Utility what is the model of the usb stick ?
Table 5.8: Example outputs from each of the systems for three posts of the Stack
Exchange dataset.
102
Chapter 6: Specificity-Controlled Question Generation Model
6.1 Introduction
In the last chapter, we saw how we can train a sequence-to-sequence neural
network model to generate a useful question given an under-specified context. We
used answer-based adversarial training strategy to train the sequence-to-sequence
model. One of our key findings was that an adversarially trained model generates
questions that are more specific to the context compared to a model trained using
the traditional maximum-likelihood training objective. Generating questions with
a desired level of specificity can be useful in many scenarios. For instance, consider
an automated agent assisting a human in a technical issue through a dialogue.
At the start of the conversation, we would want the automated agent to ask the
human more generic questions in order to understand the general domain of the
problem. Whereas, at a later stage of the conversation, we would want the agent
to ask more specific questions to narrow down the problem. In the e-retail scenario
considered in this dissertation, if the given description belongs to a product which
is similar to several other products that currently exist in the dataset, then we
might want our automated system to generate more specific questions (since we
could easily generate generic questions for this product by retrieving the top-K
103
frequently asked questions in the dataset, for instance). On the other hand, if the
given product belongs to a fairly new category, then we might want our system to
generate more generic questions. In this chapter, therefore, we propose to build a
model that given a context and a level of specificity (specific or generic), generates
a question with that level of specificity. For instance, in Figure 6.1, given a product
description (context) and a level of specificity as “<generic>”, our goal is to generate
a question such as “Where was this manufactured?” which is applicable to many
products on amazon.com. Whereas, given the same product description and the
level of specificity as “<specific>”, we would like to generate a question that is
more specific to the given product such as “Is this induction safe?”
Figure 6.1: Sample product description from amazon.com paired with a generic and
a specific clarification question.
We take a semi-supervised approach to our problem of generating specificity
controlled questions. Motivated by Sennrich et al. (2016), we build a question gen-
eration model that incorporates the level of specificity as additional input signal
104
during training1. In our work, we hypothesize that at training time if we append
the context (source) with the level of specificity of the question (target), then the
model will learn how to generate questions that at a given level of specificity. In
Figure 6.2, the question generation model is trained using context appended with
specificity as input and question as the output. In order to do this training, we
would need to label all the questions in our training data with their level of speci-
ficity i.e. generic vs specific. Doing this labeling manually for the entire training
dataset of approximately 150K questions would be too expensive. Hence, we train
a supervised model that automatically labels a question (given a context) with its
level of specificity to the given context. Figure 6.2 shows our specificity classifier
trained using a relatively small set of questions manually annotated with their level
of specificity.
Our specificity classifier is inspired by the model introduced by Louis and
Nenkova (2011) who train a binary classifier to automatically identify generic vs
specific sentences in news articles. Their classifier is based on features that capture
lexical and syntactic information, as well as specificity and word polarity. They
use human annotators to manually annotate a set of sentences with generic/specific
labels and train a binary classifier using a logistic regression model. Following
their work, we use crowdsourcing to annotate a set of 3000 questions from the
Amazon dataset with their level of specificity to the product description. We use
this annotated data to train a binary classifier to predict the level of specificity of
a question, given a context. We use some of the features introduced by Louis and
1Sennrich et al. (2016) refer to this as side constraints.
105
Figure 6.2: Specificity-controlled question generation model.
Nenkova (2011) and introduce new features that are indicative of the specificity level
of the question to train our binary classifier.
We use our specificity classifier to append the context with the level of speci-
ficity of the target question. We finally retrain the question generation model de-
scribed in the previous chapter with the modified context. At test time, given a
context appended with a level of specificity (generic or specific), our model gener-
ates a clarification question at that level of specificity.
106
6.2 Related Work
We consider specificity as a dimension of style. Sociolinguistics defines style
as a set of linguistic variants with specific social meanings. Hovy (1987) argues that
by varying the style of a text, people convey more information than is present in the
literal meaning of the words. In order to build automated intelligent agents that can
effectively communicate with humans, it is important that we teach these agents to
recognize the various stylistic variations in human language and also teach them
to generate language in a particular given style. In the field of natural language
processing, there has been previous work on both identifying style and generating
text in a given style.
Under style identification, there has been work on detecting formality of a
given text at the lexical level (Brooke and Hirst, 2014; Brooke et al., 2010; Lahiri
et al., 2011; Pavlick and Nenkova, 2015), at the sentence level (Pavlick and Tetreault,
2016) and at the document level (Mosquera and Moreda, 2012; Peterson et al., 2011;
Sheikha and Inkpen, 2010). Markowitz and Hancock (2016) studied writing styles in
fraudulent papers whereas Feng et al. (2012) build models for deception detection.
Koppel et al. (2002, 2009, 2011) develop machine learning models for authorship
identification, where the style corresponds to the writing style of an author. Previous
work most relevant to us is the work around detecting generic/specific distinctions
of text. Reiter and Frank (2010) introduce a method for distinguishing between
noun phrases that describes class of individuals (generic) versus those that refer to
specific individuals. Mathew (2009) distinguish between sentences that relate to
107
specific event versus those that relate to general facts. Louis and Nenkova (2011)
build a model to automatically identify general and specific sentences motivated by
potential applications in summarization and writing feedback.
Generating style-controlled text has been studied in three different settings be-
fore: supervised learning, semi-supervised setting and unsupervised setting. Under
supervised setting, Xu et al. (2012) develop a statistical machine translation based
model for paraphrasing sentences into Shakespearean English whereas Jhamtani
et al. (2017) develop a neural machine translation based model for the same task.
Recently, we (Rao and Tetreault, 2018) developed models for automatically rewrit-
ing sentences from informal to formal style and vice-versa. Under semi-supervised
setting, Sennrich et al. (2016) develop models to control politeness of the generated
text using side constraints where the source is appended with an artificial token de-
noting the style in which we want the model to generate its target. Yamagishi et al.
(2016) use a similar idea for controlling the voice of the generated text. Niu et al.
(2017, 2018) control formality during translation. Under unsupervised setting, Hu
et al. (2017) control the sentiment and the tense of the generated text by learning
a disentangled latent representation in a neural generative model. Ficler and Gold-
berg (2017) control several linguistic style aspects simultaneously by conditioning
a recurrent neural network language model on specific style (professional, personal,
length) and content (theme, sentiment) parameters.
108
6.3 Annotating Questions with Specificity Level
The key idea behind the use of side constraints is to guide a model to generate
text constrained with a certain linguistic phenomenon by training it on sentences
that have been annotated with such constraints. In our scenario, the constraint is
the level of specificity. More specifically, our input is the context and the output is
the question as the per the specified level of specificity. Hence, while training this
model we need to append the source i.e. the context with the level of specificity
of the target i.e. the question. Given that our neural network based question
generation model requires huge amounts of training data, annotating the entire
training data (around 100K questions) with the level of specificity manually would
be too time consuming and costly. Therefore, we take a machine learning approach
to this problem where we annotate a subset of the training data using humans and
train a machine learning model on this annotated data which learns to predict the
level of specificity of a question given the context. In this section, we describe how
we collect human annotations on the subset of the training data.
6.3.1 Annotation Design
We define our annotation task as given a context and a clarification question,
annotate if the question is generic or specific to the given context. One obvious
way to do this task would be to show the annotators the context and the question
and ask them to choose between generic or specific. However, we found that doing
this annotation task for a question (given a context) without knowing the other
109
questions asked to that context is really hard and unintuitive. We found that an
easier task would be to compare the level of specificity of two questions given a
context. For instance, given the context in Figure 1.2, annotating the level of
specificity of the question “Are they ok for induction stove?” in isolation is difficult.
However, comparing the specificity level of this question with say another question
“Where are they made?” is easier, we can say that the former question is more
specific than the latter since the latter is applicable to a larger set of products.
Hence, we design an annotation scheme where given a context and two questions
Question A and Question B, we ask annotators to compare the level of specificity
of the two questions by choosing from the following options:
1. Question A is more specific
2. Question B is more specific
3. Both questions are at the same level of specificity
Each question pair is annotated by five annotators. We use Figure-Eight to collect
these annotations. Each pair of questions is annotated by five annotators.2
6.3.2 Getting Specificity Levels from Annotations
The next step would be how to convert these comparisons into individual
generic/specific labels for the questions. Given a context and the N questions asked
to that context, we collect annotations such that each question is compared to K
other questions in the set N . Each question pair (qi, qj) is annotated by five an-
2We started with three annotators per pair of questions but obtained a low inter-annotator
agreement and hence we moved up to five annotators.
110
notators. The platform we use to collect annotations assigns a trust value to each
of its annotators based on the number of annotations performed by the annotator
and how well the annotator performed on the test questions.3 This trust value is
between 0 and 1. We calculate the specificity score for each question as
∑ ∑
specificity score(q ) = 1 K 1 5i K j=1 5 a=1 ta ∗ da(i, j)
where ta is the trust of the annotator a who annotated the question pair (qi, qj),
da(i, j) = 1; if annotator a annotated qi as more specific than qj,
da(i, j) = −1; if annotator a annotated qj as more specific than qi,
da(i, j) = 0; if annotator a annotated qi is at the same level of specificity as qj
The specificity score calculated as above is a value between -1 and 1. Given
this value, we set a threshold S and when the score for a question is less than S,
we label it as generic whereas when it is greater than or equal to S, we label it as
specific. We set a global threshold of S = 0 for all contexts.
If we collect annotations such that each question is compared to every other
questions in the set of N , then we could get a more accurate specificity score for a
question. However, given that N can be as high as 10, collecting N(N−1) annotations
2
per context could be expensive. We, therefore, collect annotations such that each
question is compared to two other questions in the set N .
To ensure that this method is reliable, for 25 of the contexts, we collect anno-
3The trust score assigned by the platform is similar to inter-annotator agreement.
111
tations such that each question is compared to every other question in the set N .
On this subset, we calculate the specificity scores of the questions using (N − 1)
comparisons per question (Sall comparisons) and we calculate the specificity scores of
the questions using two comparisons per question (Stwo comparisons). In order to un-
derstand how much do the specificity scores vary when they are calculated using
these two different methods, we calculate the accuracy of the Stwo comparisons scores
over the Sall comparisons scores. We get an accuracy of 0.89 suggesting that, although
the scores calculated using two comparisons can be noisy, they do not deviate too
much from those obtained using all comparisons.
6.4 Model for Automatically Predicting Specificity Level
Given the specific/generic annotations on a subset of our training data, our
next step is to train a machine learning model that can learn to predict the speci-
ficity level given a context and a question. Louis and Nenkova (2011) introduce a
supervised classifier for automatically predicting whether a sentence in a summary
is generic or specific. They define specificity as the level of detail present in a given
sentence. The definition of specificity in our setting is how specific the question is
to the given context. Their classifier is based on lexical and syntactic features. We
use some of the features described in their work and introduce some new features
relevant to our setting to create a similar classifier that predicts the level of speci-
ficity of a question given its context. The features used in our model are described
below:
112
Question Length. Generic questions tend to be shorter in length compared to
specific questions. For instance, “What are the dimensions?”, “What is the size
of the pillow?” are shorter in length compared to questions like “Does this pillow
have a zipper or does it come with a cover?”. We count the number of words in the
question and use the count as a feature. Additionally, we use a part-of-speech tagger
to tag the words in the question and count the number of nouns in the question and
use that as feature. These two features were used by Louis and Nenkova (2011) in
their model as well.
Path in WordNet. Questions that are more specific to a context tend to have
more specific words. Motivated by this idea, we compute the length of the path of
every noun and verb in a question to the root of WordNet (Miller et al., 1990) tree
through hypernym relations. Longer paths would indicate that the words are more
specific. Similar to Louis and Nenkova (2011), we use the average, min and max
values of these lengths and use them as features.
Inverse Document Frequency. Another way to identify specific words is to cal-
culate its inverse document frequency (IDF). IDF of a given term is defined by
the inverse of the number of documents that contain that term. More formally
IDF(w) = log( 1 ). In our setting, we consider a product descrip-
count of docs containing w
tion to be a document. So the IDF of a word in a given question is defined by
the inverse of the counts of product descriptions that contain that word. We cal-
113
culate the IDF for every word in the question and include the maximum IDF, the
minimum IDF and the average IDF as features. This feature is similar to the one
used by Louis and Nenkova (2011) except that instead of calculating the document
frequency over New York Times articles, we calculate the document frequency over
product descriptions
Syntax. Similar to Louis and Nenkova (2011), we find that the use of nouns, ad-
jectives and cardinals are good indicators of specificity. For instance, more specific
questions tend to use more proper nouns, adjectives and cardinals (numbers). We
use parts-of-speech tagger to tag the words in the questions and include the counts
of proper nouns (NNP), adjectives (ADJ) and cardinals (CD) as features.
Polarity. Louis and Nenkova (2011) find that word polarity can be strong indica-
tor of the level of specificity. For instance, strong opinions are indicative of generic
sentences. To identify positive, negative and polar words, they use The General
Inquirer and the MPQA Subjectivity lexicons. We find that these two lexicons,
which mainly contain words frequently appearing in news articles, are less relevant
for us due to the different nature of our dataset. Hence, we use the Linguistic and
Word Inquiry (LIWC) (Pennebaker et al., 2001) instead.4 We use the dictionary
category of words in the question as features. Specifically, we consider the following
categories under cognitive processes: insight, causation, discrepancy, tentative, cer-
tainty, differentiation. For each of these categories, we count the number of words
4http://lit.eecs.umich.edu/~geoliwc/LIWC_Dictionary.htm
114
in question that belong to that category and include that as a feature.
Question bag-of-words. We define a vector of the size of the vocabulary over the
words in all the questions of our train set. Given a question, we set all the word
positions that are included in the question to one in the vector and set the remaining
to zero. We include this vector as a feature. This is similar to the “lexical (words)”
features used by Louis and Nenkova (2011).
The features described above were adapted from Louis and Nenkova (2011).
We now describe the new features we introduced specifically for our problem.
*Average word embeddings. We train GloVe (Pennington et al., 2014), a word
embedding model, on all contexts and questions in our Amazon dataset. We com-
pute an average over the word embeddings of all the words in the question (q̄) and
include it as a feature. Likewise, we compute an average over the word embeddings
of all the words in the context (c̄) and include it as a feature.
*Similarity to context using word embeddings. Louis and Nenkova (2011)
define generic/specific based on the level of detail present in a sentence in isolation.
In contrast, the specificity in our setting is measured by how specific is the question
to the given context. Hence, we find that the similarity between the question and
the given context to be a useful indicator of specificity. We measure this similarity
using two ways. In the first way, we measure the similarity between the context and
115
the question in the vector semantic space. We compute an average over the word
embeddings of all the words in the context (c̄). Similarly, we compute an average
over the word embeddings of all the words in the question (q̄). We calculate the
cosine similarity between c̄ and q̄ and use it as a feature.
*Similarity to context using WordNet. In the second way, we measure the
similarity in the WordNet space. Resnik (1995) compute semantic similarity be-
tween word pairs by looking at the minimal path between the words in WordNet.
Motivated by this idea, we look at the hypernym relation path of every word in the
question and every word in the context and count the number of hypernyms that
were common in the two paths. We do this for every word pair (wq, wc) where wq
is a word in the question and wc is the word in the context and use the aggregate
count as a feature.
Given these features, we train a logistic regression model to make a binary
prediction (-1: generic, 1: specific) given a context and a question. We use the
Adam (?) optimizer. We use L2 regularizer.
6.5 Specificity-Controlled Question Generation Model
We use the specificity classifier described in the previous section to label all the
questions in the training (and tune) data with generic/specific labels. We use these
labels to append each context with the <specific> tag when the question paired
116
with the context is labeled as specific and with the <generic> tag when the ques-
tion paired with the context is labeled as generic. We use this specificity annotated
training data to train two specificity-controlled question generation model:
Specificity-MLE: Similar to the MLE model in the previous chapter, we train a
sequence-to-sequence learning model (Sutskever et al., 2014) on (context+specificity,
question) pairs using maximum likelihood objective (§5.2.1).
Specificity-GAN-Utility: This is the full question generation model described in
previous chapter which we train using
(context+specificity, question) pairs instead of (context, question) pairs. We first
pretrain a question generator on (context+specificity, question) pairs and an answer
generator model using (context+specificity+question, answer) pairs using maximum
likelihood objective. We then fine tune the question generator model using Util-
ity function based GAN training (§ 5.2.4) including the Utility discriminator, a
Mixer question generator.
At test time, we predict the specificity level of the target question using our
specificity classifier and append the tag corresponding to that label to the context.
117
Features Train Accuracy Test Accuracy
Question length 0.55 0.55
Path in WordNet 0.63 0.64
Inverse Document Frequency 0.58 0.57
Syntax 0.71 0.70
Polarity 0.65 0.65
Question bag-of-words 0.80 0.71
*Average word embeddings 0.66 0.64
*Similarity to context using embeddings 0.58 0.59
*Similarity to context using WordNet 0.57 0.55
All features 0.79 0.73
Table 6.1: Average specificity classifier accuracy under 10 fold cross validation on
train set and test set using different feature sets. * denotes new features not present
in the model by Louis and Nenkova (2011).
6.6 Experimental Results
6.6.1 Specificity Classifier Results
We randomly select 500 contexts from our Amazon dataset and collect speci-
ficity annotations on the questions asked to those contexts. Given that each context
has six questions on an average, we collect annotations on a total of 3310 questions.
2034 questions were annotated as generic and remaining were annotated as specific.
118
Generic Specific
Model Diversity Bleu Meteor Diversity Bleu Meteor
Reference 0.6071 — — 0.7474 — —
Lucene 0.6289 2.90 12.04 0.6289 1.76 6.96
MLE 0.1201 12.61 13.29 0.1201 1.41 5.06
Max-Utility 0.1299 12.17 14.06 0.1299 1.79 5.57
GAN-Utility 0.1304 12.01 14.35 0.1304 2.69 6.12
Specificity-MLE 0.1023 12.61 13.53 0.1640 4.45 7.85
Specificity-GAN-Utility 0.1012 12.84 14.18 0.1357 2.95 6.08
Table 6.2: Diversity as measured by the proportion of unique trigrams in model
outputs. Bleu and Meteor scores are calculated using an average of 6 references
under generic setting and using an average of 3 references under specific setting.
The highest numbers within a column is in bold (except for diversity under generic
setting where the lowest number is bold).
Table 6.1 shows the result of our specificity classifier. We evaluate using 10-fold cross
validation on our labelled set of 3310 questions. We perform feature ablation where
we evaluate the performance of our model using each of the feature sets separately.
Similar to Louis and Nenkova (2011), we find that syntax and polarity are strong
indicators of specificity whereas question length is comparatively a weak indicator,
even though intuitively we might think length to be a strong indicator since specific
questions tend to be longer. Under specificity features, we find that path in Word-
Net feature to be more useful than the Inverse Document Frequency feature. Similar
to Louis and Nenkova (2011), we find that the question bag-of-words feature to be
119
the most useful. Among the newly introduced features, we find the average word
embeddings feature is more useful that the features that calculate the similarity of
the question to the context.
Our best model is the one that uses all the features and attains an accuracy
of 0.73 on the test set. In comparison, a baseline model that predicts the specificity
label at random gets an accuracy of 0.58 on the test set.
6.6.2 Question Generation Results
Table 6.2 compares the performance of our specificity-controlled question gen-
eration model to the question generation models described in the previous chapter.
We aim to evaluate how good are these models at generating questions at a given
level of specificity. In our amazon dataset, each context is paired with upto 10 ref-
erence questions. We use our specificity classifier to identify generic reference ques-
tions and specific reference questions. We then use our evaluation metrics Bleu and
Meteor to compare the model outputs to generic references and specific references
separately. We call these generic and specific settings respectively. In case of the
Lucene, MLE, Max-Utility and GAN-Utility models, the same model output is com-
pared to the references in the two cases. Whereas in case of Specificity-MLE and
Specificity-GAN-Utility models, under generic setting, the generic references are
compared to the model output when the context is append with the “<generic>”
token, whereas under specific setting, the specific references are compared to the
model output when the context is append with the “<specific>” token. Diversity
120
is measured using the proportion of unique trigrams in the model output.
Under generic setting, we find that given a context appended with a “<generic>”
token, the specificity-controlled models (Specificity-MLE & Specificity-GAN-Utility)
generates questions that is at a lower Diversity than the other models. Whereas,
under specific setting, we find that given a context appended with a “<specific>”
token, these models generate questions with a higher Diversity compared to the
other models. This shows that our specificity-controlled models are capable of gen-
erating questions are varied diversity, thus varied specificity.
Under specific setting, we find that the Specificity-MLE model generates ques-
tions that get much higher Bleu and Meteor scores when compare to the specific
reference questions compared to the other models. Under generic setting, however,
we find that the specificity-controlled models generate questions that are at a similar
Bleu and Meteor scores as the other models. This suggests that the specificity-
controlled models tend to be more closer to the specific reference questions than to
the generic reference questions. Interestingly, unlike the results from the previous
chapter, a maximum-likelihood (MLE) training objective seems to be more effec-
tive for training a specificity-controlled question generation model than the more
sophisticated GAN-Utility training objective.
Table 6.3 shows two example product descriptions and the questions generated
by different models. As you can see, the specificity-controlled models generate more
specific and more generic questions compared to other models.
121
6.7 Conclusion
In this chapter, we described our specificity-controlled question model which
given a context and a level of specificity, generates a question at that desired level of
specificity. We train a specificity classifier which given a context and a question can
predict the level of specificity of the question to the context with 73% accuracy. We
use this specificity classifier to automatically label all the questions in the training
data of the question generation model described in the previous chapter. Further,
we use the specificity label as additional signal during the training of the ques-
tion generation model described in the previous chapter. We use automatic metric
based evaluation to show that our specificity-controlled question generation model
can generate questions that are more generic or more specific to the given context
depending on the given input specificity level in comparison to other models.
122
Title Signature sleep renewfoam infused memory foam
and independently encased coil mattress , 8-inch
Product Undecided between a coil mattress and a memory foam mattress ?
Description Why not experience the best of both worlds with the signature
sleep 8x201d; renewfoam coil mattress.
The gel infused memory foam and coolmax; outer cover are
perfectly paired to provide a fresh and cool sleeping surface,
while the independently encased coils eliminate motion disturbance.
With the signature sleep renewfoam coil mattress,
always wake up feeling refreshed, rejuvenated and renewed.
Reference do you need a separate box springs to go with this mattress ?
Lucene how long does this matress last ?
MLE what is the weight limit for this mattress ?
Max-Utility what is the weight limit for this mattress ?
GAN-Utility what are the dimensions of the mattress pad pad ?
Spec-MLE (g) does it come with a cover ?
Spec-MLE (s) does this mattress come with a box spring ?
Spec-GAN-Utility (g) what is the warranty on this mattress ?
Spec-GAN-Utility (s) what is the density of the mattress ?
Title new cutting blade knife for kitchenaid mixer meat grinder; fga food chopper
Product New sharp design cutting blade for the white fga kitchenaid meat grinder &
Description food chopper. This knife is much improved from the original style
cutter that came with the grinder attachment.
You will see the improved difference when
using a true cutting blade when grinding meat or vegetables.
Stainless steel part with lifetime no rust guarantee from butcher-baker.
Making sausage with our kitchenaid meat grinder ?
We have the stainless steel stuffer tubes also.
Need replacment meat grinder discs? We have them also.
Add these parts to your order now for combined shipping discounts.
Reference does this fit an older model kitchenaid mixer-grinder attachment
fga model or not ? some reviewers are saying it does not fit ?
Lucene can anyone confirm the dimensions of the square hole ?
MLE will this fit the ?
Max-Utility can this be used to grind almonds ?
GAN-Utility does this blade fit the?
Spec-MLE (g) does it come with a blade ?
Spec-MLE (s) does this blade work with the kitchenaid professional model ?
Spec-GAN-Utility (g) will this blade work with the weston model ?
Spec-GAN-Utility (s) does this work well for a full size ? like a fine blade ?
Table 6.3: Example outputs from each of the systems for a single product descrip-
tion. g indicates generic token whereas s indicates specific token.
123
Chapter 7: Conclusion
7.1 Summary
In this dissertation we identify the importance of teaching machines to ask
clarification questions i.e. questions that point at missing information in a given text.
We propose to take a machine learning approach to clarification question generation
where a model is trained using large amounts of (context, question) pairs. In order
to do this learning effectively, we create datasets for two scenarios: technical support
(StackExchange) and e-retail (Amazon). We present two approaches to the problem
of clarification question generation. In the first approach, we develop a model which
given a context, extracts a set of potential candidate questions from a pool of existing
questions and then ranks them in the order of their usefulness to the given context.
We model the usefulness of a question using the idea of expected value of perfect
information: a good question is one whose expected answer will be useful. We
find that “answer” helps in identifying good clarification questions. In the second
approach, we develop a model which given a context, generates questions from
scratch instead of ranking existing question. We train a sequence-to-sequence neural
network model using the recent idea of Generative Adversarial Network (GAN) to
maximize an answer-based reward function. We show that our adversarially trained
124
model generates questions that are more specific to the given context. We further
explore the notion of controlling the specificity of generated question by explicitly
training a question generation model which given a context and a level of specificity
(generic or specific), generates a question at that level of specificity. To label the
large number of questions in our training data with the level of specificity, we train a
binary classifier which given a context and a question, predicts whether the question
is specific (to the context) or generic. We include the level of specificity as an
additional signal during the training of our question generation model and find
that our specificity-controlled question generation model can generate questions at
a desired level of specificity.
7.2 Future Directions
In this section, we discuss some potential future directions of research in the
area of clarification question generation.
7.2.1 Using Multi-modal Context
The question generation models proposed in this work only make use of textual
context. However, often contexts include other modals of information as well. For
instance, textual descriptions of products on amazon.com are paired with the image
of the product. We can make use of the image to ask more relevant questions.
Consider the description of a cookware set in Figure 7.1. A question generation
model that uses only the textual context might generate the question “Does the
125
Figure 7.1: An example of product description on amazon.com paired with the
image of the product.
set include a ladle?” since the description does not contain the details of the items
included in the cookware set. However, if the model were to use the image of the
product as well, then it could find that the ladle is already included and hence would
not generate such a redundant question. Thus, a potential future direction would
be to use both textual and image contexts to train a question generation model.
7.2.2 Using External Knowledge Sources
The models described in this work learn to ask a clarification question by
looking at previously asked questions in a similar context. More specifically, we
rely on our data to include the kinds of questions that we would like to ask. The
main purpose of generating the clarification questions is to identify the missing
information in a given text. To understand what is missing, one needs to first
know what should have been there. As humans, we rely on our prior knowledge
126
Figure 7.2: Example of a question generation model that uses a knowledge base
containing attributes of an operating system (or attributes of a toaster) to ask a
relevant clarification question.
about the subject to decide what should have been there but is missing and then
ask a clarification question pointing out the missing information. Therefore, one
potential extension to our work would be to automatically extract information from
existing knowledge sources and makes use of it to generate a clarification question.
For instance, in Figure 7.2, given a post related to Ubuntu operating system, if the
model had access to a knowledge base that contained the information that operating
systems differ by versions and bits, then the model could use that information to
generate a question. Similarly, in the context of Amazon, if the model had access
to a knowledge base containing various attributes of a product, then it could use
that to understand what information is missing from the given description and ask
a useful question.
127
7.2.3 Interactive Search Queries
With the emergence of internet, vast amounts of data is stored online. We
frequently use search engines to extract relevant information from this abundance
of online data. However, we might often find ourselves sifting through the search
results when our original search query is not specific enough. In such a scenario, it
might have been useful if the search engine would have asked us a follow-up question.
For instance, if a user query is “How long does it take to get a PhD?”, the search
engine could ask the user “In which field?” because the answer would differ based
on the field of study. Likewise, if a user query is “Historical gas prices”, the search
engine could ask “Which region?” or “Which year?” because the prices would differ
by region and year. Thus, a potential future direction of our work would be to train
a clarification question generation model which given a search query can generate
follow-up question(s) that can help narrow down the original query.
7.2.4 Question Asking in Writing Assistance
Figure 7.3: An example of a writing assistance tool which given a content, identifies
the missing information and asks a question about it.
128
In our day-to-day lives, we frequently use computers for writing documents,
emails, etc. With the advancements of technologies, many of the text processing
tools these days help us write better by pointing our spelling errors or other minor
grammatical errors. However, these tools still are not at par with humans when it
comes to suggesting content level changes. For example, consider a scenario where
a student is writing their statement of purpose. A human reviewing this document
might suggest changes such as possible addition of description of a project, addition
of a missing reference to a related work, etc. Given the vast amounts of available
data online, writing assistance tools might soon be able to suggest such informational
changes. A first step towards this direction might be an email assistance tools that
can point out missing information in your email. For instance, consider you have
drafted an email such as the one in Figure 7.4. Since you have forgotten to mention
the location of the meeting, Kathy might send you a follow-up email asking for the
location. Such a follow-up email exchange could have been avoided if the email
application could have suggested to include the location in the first place.
7.2.5 Towards Intelligent Dialogue Agents
Asking questions is one of the key components of a conversation. Humans often
ask questions to miss information gaps during a conversation. Therefore, in order
for automated agents to be successful at conversations, it is important that we teach
these agents to ask intelligent questions. Consider a scenario where I have asked a
robot to get me my coffee mug from the kitchen (Figure 7.4). If there are multiple
129
Figure 7.4: An example conversation with a robot where the robot asks questions
to resolve its uncertainty.
mugs in the kitchen, an intelligent robot would ask a question such as “What color
is your coffee mug?” to resolve this ambiguity. Further, if I reply by saying that the
color of my mug is black, and if the robot finds multiple black mugs in the kitchen,
it could ask a follow-up question to further resolve the ambiguity. Teaching robots
to ask such useful questions would enable them to be more intelligent.
7.2.6 Question Asking to Help Build Reasoning
Asking intelligent questions can also be used as a tool for enabling automated
building of reasoning. For example, consider a robot is reading the passage shown
in Figure 7.5. As it is reading this passage, assume it is building an understanding
of the world. Suppose the robot asks a question such as “Why was Jill upset?” as
it is building this reasoning. And a human answers the robot by saying “Because
she did not win the race.”. This will help the robot understand that reaching the
130
Figure 7.5: An example scenario where a robot is reading a passage and asking
questions to a human to build an understanding of the world.
finish line leads to winning the race and not winning a race would make someone
upset. The robot could then go ahead and update its understanding of the world
using these reasonings.
7.2.7 Generalization Beyond Large Datasets
In this dissertation, we have described methods for generating clarification
questions that rely heavily on learning from large datasets. In future, we would
want to be able to generate questions without going through the same substantial
dataset-building process. One method for this would be to bootstrap the process
by using template based approach (or humans) to initially generate some small set
of questions. Then train our model on this small set to generate more questions.
131
And finally use these generated questions to further retrain our model. Second
method of generalization would be using the idea of domain adaptation where we
could use large amounts of existing out-of-domain data to train a model and then
use small amounts of in-domain data to tune the model. Lastly, we could modify
existing reading comprehension datasets to create clarification questions dataset by
removing the answer sentence from the passage and then using the question as the
clarification question and the passage as the context.
132
Appendix A: Crowdsourcing Annotation Details
A.1 Question Ranking Task Evaluation
In this section we describe the details of the process of collecting human judg-
ments for the evaluation of the outputs of our question ranking model described in
Chapter 4. We use Upwork1 for collecting our expert human judgments. Upwork is
a platform which allows us to post a job description and recruit people specifically
for a task.
We show the following instructions to the annotator:
Your task is to ask the right question.
You will be shown a post to StackExchange that is incomplete: that is, in order
to provide a useful solution to this post, the original poster needs to provide some
additional information.
In order to elicit that additional information from the original poster, you want to
ask a question.
You will be provided a list of ten possible questions that you can ask. You must
provide two pieces of information:
1) Which of these questions is the single best one? If you could only ask one question,
1https://upwork.com
133
which one would you ask?
2) Which other questions would be valid to ask, even if not best.
The interface will force you to choose a single best question by marking it with a
radio button, and other valid questions with check boxes.
Some of these are hard. Try your best to answer them. It took us 5-6 minutes per
example, so please don’t rush.
After every question you’ll be asked for your confidence in your selection of the ‘best’
question. For some of them you may just have to take an educated guess, for others
you will be quite sure.
Note: ‘Best’ by definition is also ‘valid’: so whatever you select as ‘best’ you should
also mark as ‘valid’.
134
We show the Upwork annotators the following interface for performing the
task of annotating the one “best” question and one or more “valid” questions, given
a post from StackExchange dataset.
135
A.2 Question Generation Task Evaluation
In this section we describe the details of human based evaluation process for
evaluating outputs of the question generation models described in Chapter 5.
Figure A.2 is overview of the task shown to the annotators.
Figure A.3 is the set of instructions shown to the annotators.
Figure A.4 is the set of rules and tips shown to the annotators.
Figure A.5 shows two example annotations shown to the annotators.
Figure A.6 shows the interface shown to the annotators.
A.3 Specificity Labeling Task
In this section we describe the details of the annotation task for labeling ques-
tions with their specificity levels presented in Chapter 6. Figure A.7 shows the
instructions shown to the annotators.
Figure A.8 shows the rules and tips shown to the annotators.
136
Figure A.9 shows an example annotation shown to the annotators to guide them to
do the task.
137
Figure A.10 shows the interface shown to the annotators.
138
Figure A.1: Example of the interface shown to annotators on UpWork for annotating
“best” and “valid” questions, given a post.
139
Figure A.2: Task overview shown to annotators on Figure-Eight for the task of
evaluating model generated questions.
140
Figure A.3: Instructions shown to annotators on Figure-Eight for the task of eval-
uating model generated questions.
141
Figure A.4: Rules and tips shown to annotators on Figure-Eight for the task of
evaluating model generated questions.
142
Figure A.5: Example annotations shown to annotators on Figure-Eight for the task
of evaluating model generated questions.
143
Figure A.6: Interface shown to the annotators on Figure-Eight for the task of eval-
uating model generated questions.
144
Figure A.7: Instructions shown to the annotators for the task of comparing the
specificity of two questions asked about a product on amazon.com .
Figure A.8: Rules and Tips shown to the annotators for the task of comparing the
specificity of two questions asked about a product on amazon.com .
145
Figure A.9: Example shown to the annotators for the task of comparing the speci-
ficity of two questions asked about a product on amazon.com .
146
Figure A.10: Interface shown to the annotators for the task of comparing the speci-
ficity of two questions asked about a product on amazon.com .
147
Bibliography
Husam Ali, Yllias Chali, and Sadid A Hasan. 2010. Automation of question genera-
tion from sentences. In Proceedings of QG2010: The Third Workshop on Question
Generation. pages 58–67.
Jens Allwood. 2000. An activity based approach to pragmatics. Abduction, belief
and context in dialogue: Studies in computational pragmatics pages 47–80.
Jacob Andreas and Dan Klein. 2016. Reasoning about pragmatics with neural lis-
teners and speakers. In Proceedings of the 2016 Conference on Empirical Methods
in Natural Language Processing . pages 1173–1182.
Yoav Artzi and Luke Zettlemoyer. 2011. Bootstrapping semantic parsers from con-
versations. In Proceedings of the conference on empirical methods in natural lan-
guage processing . Association for Computational Linguistics, pages 421–432.
Muhammad Asaduzzaman, Ahmed Shah Mashiyat, Chanchal K Roy, and Kevin A
Schneider. 2013. Answering questions about unanswered questions of stack over-
flow. In Proceedings of the 10th Working Conference on Mining Software Reposi-
tories . IEEE Press, pages 97–100.
Mordecai Avriel and AC Williams. 1970. The value of information and stochastic
programming. Operations Research 18(5):947–954.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine
translation by jointly learning to align and translate. In International Conference
on Learning Representations .
Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt eval-
uation with improved correlation with human judgments. In Proceedings of the acl
workshop on intrinsic and extrinsic evaluation measures for machine translation
and/or summarization. pages 65–72.
Daniel G Bobrow, Ronald M Kaplan, Martin Kay, Donald A Norman, Henry
Thompson, and Terry Winograd. 1977. Gus, a frame-driven dialog system. Arti-
ficial intelligence 8(2):155–173.
148
Antoine Bordes and Jason Weston. 2016. Learning end-to-end goal-oriented dialog.
arXiv preprint arXiv:1605.07683 .
Julian Brooke and Graeme Hirst. 2014. Supervised ranking of co-occurrence pro-
files for acquisition of continuous lexical attributes. In Proceedings of COLING
2014, the 25th International Conference on Computational Linguistics: Technical
Papers . pages 2172–2183.
Julian Brooke, Tong Wang, and Graeme Hirst. 2010. Automatic acquisition of lexical
formality. In Proceedings of the 23rd International Conference on Computational
Linguistics: Posters . Association for Computational Linguistics, pages 90–98.
Christian Buck, Jannis Bulian, Massimiliano Ciaramita, Andrea Gesmundo, Neil
Houlsby, Wojciech Gajewski, and Wei Wang. 2017. Ask the right questions:
Active question reformulation with reinforcement learning. arXiv preprint
arXiv:1705.07830 .
Danqi Chen and Christopher Manning. 2014. A fast and accurate dependency parser
using neural networks. In Proceedings of the 2014 conference on empirical methods
in natural language processing (EMNLP). pages 740–750.
Wei Chen. 2009. Aist, g., mostow, j.: Generating questions automatically from
informational text. In Proceedings of the 2nd Workshop on Question Generation
(AIED 2009). pages 17–24.
Chenhui Chu, Raj Dabre, and Sadao Kurohashi. 2017. An empirical comparison of
domain adaptation methods for neural machine translation. In Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics (Volume
2: Short Papers). volume 2, pages 385–391.
Herbert H Clark. 1981. Definite reference and mutual knowledge. Elements of
discourse understanding pages 10–63.
Herbert H Clark. 1996. Using language. 1996. Cambridge University Press: Cam-
bridge 952:274–296.
Herbert H Clark, Susan E Brennan, et al. 1991. Grounding in communication.
Perspectives on socially shared cognition 13(1991):127–149.
Herbert H Clark and Thomas B Carlson. 1982. Hearers and speech acts. Language
pages 332–373.
Anni Coden, Daniel Gruhl, Neal Lewis, and Pablo N Mendes. 2015. Did you mean
a or b? supporting clarification dialog for entity disambiguation. In SumPre-
HSWI@ ESWC .
Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question
generation for reading comprehension. In Proceedings of the 55th Annual Meet-
ing of the Association for Computational Linguistics (Volume 1: Long Papers).
volume 1, pages 1342–1352.
149
Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. 2017. Question generation for
question answering. In Proceedings of the 2017 Conference on Empirical Methods
in Natural Language Processing . pages 866–874.
Philip Edmonds and Graeme Hirst. 2002. Near-synonymy and lexical choice. Com-
putational linguistics 28(2):105–144.
Ute Essen and Volker Steinbiss. 1992. Cooccurrence smoothing for stochastic lan-
guage modeling. In icassp. IEEE, pages 161–164.
Song Feng, Ritwik Banerjee, and Yejin Choi. 2012. Syntactic stylometry for de-
ception detection. In Proceedings of the 50th Annual Meeting of the Association
for Computational Linguistics: Short Papers-Volume 2 . Association for Compu-
tational Linguistics, pages 171–175.
Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural
language generation. Proceedings of the Workshop on Stylistic Variation, Empir-
ical Methods in Natural Language Processing 2017 .
Alejandro Figueroa and Günter Neumann. 2013. Learning to rank effective para-
phrases from query logs for community question answering. In Association for
the Advancement of Artificial Intelligence. volume 13, pages 1099–1105.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Domain adaptation for
large-scale sentiment classification: A deep learning approach. In Proceedings of
the 28th international conference on machine learning (ICML-11). pages 513–520.
David Goddeau, Helen Meng, Joseph Polifroni, Stephanie Seneff, and Senis
Busayapongchai. 1996. A form-based dialogue manager for spoken language appli-
cations. In Spoken Language, 1996. ICSLP 96. Proceedings., Fourth International
Conference on. IEEE, volume 2, pages 701–704.
Dave Golland, Percy Liang, and Dan Klein. 2010. A game-theoretic approach to
generating spatial descriptions. In Proceedings of the 2010 conference on empirical
methods in natural language processing . Association for Computational Linguis-
tics, pages 410–419.
Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep
learning , volume 1. MIT press Cambridge.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial
nets. In Advances in Neural Information Processing Systems . pages 2672–2680.
Art Graesser, Vasile Rus, and Zhiqiang Cai. 2008. Question classification schemes.
In Proc. of the Workshop on Question Generation.
Arthur C Graesser, Natalie Person, and John Huber. 1992. Mechanisms that gen-
erate questions. Questions and information systems pages 167–187.
150
Arthur C Graesser and Natalie K Person. 1994. Question asking during tutoring.
American educational research journal 31(1):104–137.
H Paul Grice. 1975. Logic and conversation. 1975 pages 41–58.
Barbara J Grosz and Candace L Sidner. 1986. Attention, intentions, and the struc-
ture of discourse. Computational linguistics 12(3):175–204.
Michael Heilman. 2011. Automatic factual question generation from text . Ph.D.
thesis, Carnegie Mellon University.
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural
computation 9(8):1735–1780.
John J Hopfield. 1982. Neural networks and physical systems with emergent col-
lective computational abilities. Proceedings of the national academy of sciences
79(8):2554–2558.
Eduard Hovy. 1987. Generating natural language under pragmatic constraints. Jour-
nal of Pragmatics 11(6):689–719.
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing.
2017. Toward controlled generation of text. In International Conference on Ma-
chine Learning . pages 1587–1596.
Diana Inkpen and Graeme Hirst. 2006. Building and using a lexical knowledge base
of near-synonym differences. Computational linguistics 32(2):223–262.
Harsh Jhamtani, Varun Gangal, Eduard Hovy, and Eric Nyberg. 2017. Shake-
spearizing modern language using copy-enriched sequence to sequence models. In
Proceedings of the Workshop on Stylistic Variation. pages 10–19.
Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng
Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al.
2017. Google’s multilingual neural machine translation system: Enabling zero-
shot translation. Transactions of the Association of Computational Linguistics
5(1):339–351.
Jad Kabbara and Jackie Chi Kit Cheung. 2016. Stylistic transfer in natural lan-
guage generation systems using recurrent neural networks. In Proceedings of the
Workshop on Uphill Battles in Language Processing: Scaling Early Achievements
to Robust Methods . pages 43–47.
Tomoyuki Kajiwara and Mamoru Komachi. 2016. Building a monolingual parallel
corpus for text simplification using sentence similarity based on alignment be-
tween word embeddings. In Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical Papers . pages 1147–1158.
151
Saidalavi Kalady, Ajeesh Elikkottil, and Rajarshi Das. 2010. Natural language
question generation using syntax and keywords. In Proceedings of QG2010: The
Third Workshop on Question Generation. questiongeneration. org, volume 2.
Jaap Kamps, Maarten Marx, Robert J Mokken, Maarten De Rijke, et al. 2004. Using
wordnet to measure semantic orientations of adjectives. In Language Resources
and Evaluation Conference. Citeseer, volume 4, pages 1115–1118.
Urvashi Khandelwal, He He, Peng Qi, and Dan Jurafsky. 2018. Sharp nearby,
fuzzy far away: How neural language models use context. In Proceedings of the
56th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers). volume 1.
Catherine KOBUS, Josep Crego, and Jean Senellart. 2017. Domain control for
neural machine translation. In Proceedings of the International Conference Recent
Advances in Natural Language Processing, RANLP 2017 . pages 372–378.
Moshe Koppel, Shlomo Argamon, and Anat Rachel Shimoni. 2002. Automatically
categorizing written texts by author gender. Literary and Linguistic Computing
17(4):401–412.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational meth-
ods in authorship attribution. Journal of the American Society for information
Science and Technology 60(1):9–26.
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribu-
tion in the wild. Language Resources and Evaluation 45(1):83–94.
Igor Labutov, Sumit Basu, and Lucy Vanderwende. 2015. Deep questions without
deep understanding. In Proceedings of the 53rd Annual Meeting of the Associa-
tion for Computational Linguistics and the 7th International Joint Conference on
Natural Language Processing (Volume 1: Long Papers). volume 1, pages 889–898.
Shibamouli Lahiri, Prasenjit Mitra, and Xiaofei Lu. 2011. Informality judgment at
sentence level and experiments with formality score. In International Conference
on Intelligent Text Processing and Computational Linguistics . Springer, pages
446–457.
Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E
Howard, Wayne E Hubbard, and Lawrence D Jackel. 1990. Handwritten digit
recognition with a back-propagation network. In Advances in neural information
processing systems . pages 396–404.
Oliver Lemon, Kallirroi Georgila, James Henderson, and Matthew Stuttle. 2006. An
isu dialogue system exhibiting reinforcement learning of dialogue policies: generic
slot-filling in the talk in-car system. In Proceedings of the Eleventh Conference of
the European Chapter of the Association for Computational Linguistics: Posters
& Demonstrations . Association for Computational Linguistics, pages 119–122.
152
Will Lewis, Christian Federmann, and Ying Xin. 2015. Applying cross-entropy dif-
ference for selecting parallel training data from publicly available sources for con-
versational machine translation. In International Workshop on Spoken Language
Translation.
Jiwei Li, Michel Galley, Chris Brockett, Georgios Spithourakis, Jianfeng Gao, and
Bill Dolan. 2016a. A persona-based neural conversation model. In Proceedings of
the 54th Annual Meeting of the Association for Computational Linguistics (Vol-
ume 1: Long Papers). volume 1, pages 994–1003.
Jiwei Li, Will Monroe, Alan Ritter, Dan Jurafsky, Michel Galley, and Jianfeng Gao.
2016b. Deep reinforcement learning for dialogue generation. In Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing . pages
1192–1202.
Yikang Li, Nan Duan, Bolei Zhou, Xiao Chu, Wanli Ouyang, Xiaogang Wang, and
Ming Zhou. 2018. Visual question generation as dual task of visual question an-
swering. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pages 6116–6124.
Ming Liu, Rafael A Calvo, and Vasile Rus. 2010. Automatic question generation
for literature review writing support. In International Conference on Intelligent
Tutoring Systems . Springer, pages 45–54.
Annie Louis and Ani Nenkova. 2011. Automatic identification of general and specific
sentences by leveraging discourse annotations. In Proceedings of 5th International
Joint Conference on Natural Language Processing . pages 605–613.
Ryan Lowe, Nissan Pow, Iulian V Serban, and Joelle Pineau. 2015. The ubuntu
dialogue corpus: A large dataset for research in unstructured multi-turn dialogue
systems. In Special Interest Group on Discourse and Dialogue.
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective ap-
proaches to attention-based neural machine translation. In Empirical Methods in
Natural Language Processing .
David M Markowitz and Jeffrey T Hancock. 2016. Linguistic obfuscation in fraud-
ulent science. Journal of Language and Social Psychology 35(4):435–445.
Thomas A Mathew. 2009. Supervised categorization of habitual versus episodic sen-
tences . Ph.D. thesis, Georgetown University.
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
2015. Image-based recommendations on styles and substitutes. In Proceedings of
the 38th International ACM SIGIR Conference on Research and Development in
Information Retrieval . ACM, pages 43–52.
153
Julian McAuley and Alex Yang. 2016. Addressing complex and subjective product-
related queries with customer reviews. In Proceedings of the 25th International
Conference on World Wide Web. International World Wide Web Conferences
Steering Committee, pages 625–635.
Paul Michel and Graham Neubig. 2018. Extreme adaptation for personalized neural
machine translation. In Association for Computational Linguistics .
Tomas Mikolov. 2010. Recurrent neural network based language model. In Inter-
speech.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
Advances in neural information processing systems . pages 3111–3119.
George A Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Kather-
ine J Miller. 1990. Introduction to wordnet: An on-line lexical database. Inter-
national journal of lexicography 3(4):235–244.
Hideki Mima, Osamu Furuse, and Hitoshi Iida. 1997. Improving performance of
transfer-driven machine translation with extra-linguistic informatioon from con-
text, situation and environment. In International Joint Conferences on Artificial
Intelligence (2). pages 983–989.
Alejandro Mosquera and Paloma Moreda. 2012. Smile: An informality classification
tool for helping to assess quality and credibility in web 2.0 texts. In Proceedings
of the International Association for the Advancement of Artificial Intelligence
Conference on Web and Social Media workshop: Real-Time Analysis and Mining
of Social Streams (RAMSS).
Nasrin Mostafazadeh, Chris Brockett, Bill Dolan, Michel Galley, Jianfeng Gao,
Georgios Spithourakis, and Lucy Vanderwende. 2017. Image-grounded conver-
sations: Multimodal context for natural question and response generation. In
Proceedings of the Eighth International Joint Conference on Natural Language
Processing (Volume 1: Long Papers). volume 1, pages 462–472.
Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, Margaret Mitchell, Xiaodong He,
and Lucy Vanderwende. 2016. Generating natural questions about an image.
In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). volume 1, pages 1802–1813.
Preslav Nakov, Doris Hoogeveen, Llúıs Màrquez, Alessandro Moschitti, Hamdy
Mubarak, Timothy Baldwin, and Karin Verspoor. 2017. Semeval-2017 task 3:
Community question answering. In Proceedings of the 11th International Work-
shop on Semantic Evaluation (SemEval-2017). pages 27–48.
Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing
Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns
154
and beyond. In Proceedings of The 20th SIGNLL Conference on Computational
Natural Language Learning . pages 280–290.
Titas Nandi, Chris Biemann, Seid Muhie Yimam, Deepak Gupta, Sarah Kohail,
Asif Ekbal, and Pushpak Bhattacharyya. 2017. Iit-uhh at semeval-2017 task
3: Exploring multiple features for community question answering and implicit
dialogue identification. In Proceedings of the 11th International Workshop on
Semantic Evaluation (SemEval-2017). pages 90–97.
Masato Neishi, Jin Sakuma, Satoshi Tohda, Shonosuke Ishiwatari, Naoki Yoshinaga,
and Masashi Toyoda. 2017. A bag of useful tricks for practical neural machine
translation: Embedding layer initialization and large batch size. In Proceedings
of the 4th Workshop on Asian Translation (WAT2017). pages 99–109.
Xing Niu and Marine Carpuat. 2016. The UMD Machine Translation Systems at
IWSLT 2016: English-to-French Translation of Speech Transcripts. In Interna-
tional Workshop on Spoken Language Translation.
Xing Niu, Marianna Martindale, and Marine Carpuat. 2017. A study of style in
machine translation: Controlling the formality of machine translation output. In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language
Processing . pages 2814–2819.
Xing Niu, Sudha Rao, and Marine Carpuat. 2018. Multi-task neural models for
translating between styles within and across languages. In Proceedings of the 27th
International Conference on Computational Linguistics . pages 1008–1021.
Andrew M Olney, Arthur C Graesser, and Natalie K Person. 2012. Question gener-
ation from concept maps. Dialogue & Discourse 3(2):75–99.
Naho Orita, Eliana Vornov, Naomi Feldman, and Hal Daumé III. 2015. Why dis-
course affects speakers’ choice of referring expressions. In Proceedings of the 53rd
Annual Meeting of the Association for Computational Linguistics and the 7th In-
ternational Joint Conference on Natural Language Processing (Volume 1: Long
Papers). volume 1, pages 1639–1649.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a
method for automatic evaluation of machine translation. In Proceedings of the
40th annual meeting on association for computational linguistics . Association for
Computational Linguistics, pages 311–318.
Ellie Pavlick and Ani Nenkova. 2015. Inducing lexical style properties for para-
phrase and genre differentiation. In Proceedings of the 2015 Conference of the
North American Chapter of the Association for Computational Linguistics: Hu-
man Language Technologies . pages 218–224.
Ellie Pavlick and Joel Tetreault. 2016. An empirical analysis of formality in online
communication. Transactions of the Association for Computational Linguistics
4:61–74.
155
James W Pennebaker, Martha E Francis, and Roger J Booth. 2001. Linguistic
inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates
71(2001):2001.
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove:
Global vectors for word representation. In Empirical Methods on Natural Language
Processing .
Kelly Peterson, Matt Hohensee, and Fei Xia. 2011. Email formality in the workplace:
A case study on the enron corpus. In Proceedings of the Workshop on Languages
in Social Media. Association for Computational Linguistics, pages 86–95.
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham
Neubig. 2018. When and why are pre-trained word embeddings useful for neural
machine translation? In Proceedings of the 2018 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 2 (Short Papers). volume 2, pages 529–535.
Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba.
2015. Sequence level training with recurrent neural networks. arXiv preprint
arXiv:1511.06732 .
Sudha Rao and Hal Daumé III. 2018. Learning to ask good questions: Ranking clar-
ification questions using neural expected value of perfect information. In Proceed-
ings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). volume 1.
Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may i introduce the gyafc
dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceed-
ings of the 2018 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies, Volume 1 (Long
Papers). volume 1, pages 129–140.
Ehud Reiter, Somayajulu Sripada, Jim Hunter, Jin Yu, and Ian Davy. 2005. Choos-
ing words in computer-generated weather forecasts. Artificial Intelligence 167(1-
2):137–169.
Nils Reiter and Anette Frank. 2010. Identifying generic noun phrases. In Proceedings
of the 48th Annual Meeting of the Association for Computational Linguistics .
Association for Computational Linguistics, pages 40–49.
Steven J Rennie, Etienne Marcheret, Youssef Mroueh, Jerret Ross, and Vaibhava
Goel. 2017. Self-critical sequence training for image captioning. In Computer
Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, pages
1179–1195.
Philip Resnik. 1995. Using information content to evaluate semantic similarity in a
taxonomy. In Proceedings of the 14th international joint conference on Artificial
intelligence-Volume 1 . Morgan Kaufmann Publishers Inc., pages 448–453.
156
Philip Stuart Resnik. 1993. Selection and information: a class-based approach to
lexical relationships. IRCS Technical Reports Series page 200.
Salvatore Romeo, Giovanni Da San Martino, Alberto Barrón-Cedeno, Alessandro
Moschitti, Yonatan Belinkov, Wei-Ning Hsu, Yu Zhang, Mitra Mohtarami, and
James Glass. 2016. Neural attention for learning to rank questions in commu-
nity question answering. In Proceedings of COLING 2016, the 26th International
Conference on Computational Linguistics: Technical Papers . pages 1734–1745.
Barak Rosenshine, Carla Meister, and Saul Chapman. 1996. Teaching students to
generate questions: A review of the intervention studies. Review of educational
research 66(2):181–221.
Vasile Rus, Paul Piwek, Svetlana Stoyanchev, Brendan Wyse, Mihai Lintean, and
Cristian Moldovan. 2011. Question generation shared task and evaluation chal-
lenge: Status report. In Proceedings of the 13th European Workshop on Natural
Language Generation. Association for Computational Linguistics, pages 318–320.
Vasile Rus, Brendan Wyse, Paul Piwek, Mihai Lintean, Svetlana Stoyanchev, and
Cristian Moldovan. 2010. The first question generation shared task evaluation
challenge. In Proceedings of the 6th International Natural Language Generation
Conference. Association for Computational Linguistics, pages 251–257.
Mrinmaya Sachan and Eric Xing. 2018. Self-training for jointly learning to ask and
answer questions. In Proceedings of the 2018 Conference of the North Ameri-
can Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers). volume 1, pages 629–640.
Cicero D Santos and Bianca Zadrozny. 2014. Learning character-level representa-
tions for part-of-speech tagging. In Proceedings of the 31st International Confer-
ence on Machine Learning (ICML-14). pages 1818–1826.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Controlling politeness
in neural machine translation via side constraints. In Proceedings of the 2016
Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies . pages 35–40.
Iulian Vlad Serban, Alberto Garćıa-Durán, Caglar Gulcehre, Sungjin Ahn, Sarath
Chandar, Aaron Courville, and Yoshua Bengio. 2016a. Generating factoid ques-
tions with recurrent neural networks: The 30m factoid question-answer corpus.
In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers). volume 1, pages 588–598.
Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and
Joelle Pineau. 2016b. Building end-to-end dialogue systems using generative hier-
archical neural network models. In Association for the Advancement of Artificial
Intelligence. volume 16, pages 3776–3784.
157
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau,
Aaron C Courville, and Yoshua Bengio. 2017. A hierarchical latent variable
encoder-decoder model for generating dialogues. In Association for the Advance-
ment of Artificial Intelligence. pages 3295–3301.
Fadi Abu Sheikha and Diana Inkpen. 2010. Automatic classification of documents
by formality. In Natural Language Processing and Knowledge Engineering (NLP-
KE), 2010 International Conference on. IEEE, pages 1–5.
Fadi Abu Sheikha and Diana Inkpen. 2011. Generation of formal and informal
sentences. In Proceedings of the 13th European Workshop on Natural Language
Generation. Association for Computational Linguistics, pages 187–193.
Nathaniel J Smith, Noah Goodman, and Michael Frank. 2013. Learning and using
language via recursive pragmatic reasoning about other agents. In Advances in
neural information processing systems . pages 3039–3047.
Noah A Smith and Jason Eisner. 2005. Contrastive estimation: Training log-linear
models on unlabeled data. In Proceedings of the 43rd Annual Meeting on Associ-
ation for Computational Linguistics . Association for Computational Linguistics,
pages 354–362.
Svetlana Stoyanchev, Alex Liu, and Julia Hirschberg. 2014. Towards natural clarifi-
cation questions in dialogue systems. In AISB symposium on questions, discourse
and dialogue. volume 20.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
with neural networks. In Advances in neural information processing systems .
pages 3104–3112.
Duyu Tang, Nan Duan, Zhao Yan, Zhirui Zhang, Yibo Sun, Shujie Liu, Yuanhua
Lv, and Ming Zhou. 2018. Learning to collaborate for question answering and
asking. In Proceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language Technologies,
Volume 1 (Long Papers). volume 1, pages 1564–1574.
Sean Trott, Manfred Eppe, and Jerome Feldman. 2016. Recognizing intention from
natural language: clarification dialog and construction grammar. In Workshop on
Communicating Intentions in Human–Robot Interaction.
Lucy Vanderwende. 2008. The importance of being important: Question genera-
tion. In Proceedings of the 1st Workshop on the Question Generation Shared Task
Evaluation Challenge, Arlington, VA.
Tong Wang, Ping Chen, John Rochford, and Jipeng Qiang. 2016. Text simplification
using neural machine translation. In Association for the Advancement of Artificial
Intelligence. pages 4270–4271.
158
Jason Williams, Antoine Raux, Deepak Ramachandran, and Alan Black. 2013. The
dialog state tracking challenge. In Proceedings of the SIGDIAL 2013 Conference.
pages 404–413.
Ronald J Williams. 1992. Simple statistical gradient-following algorithms for con-
nectionist reinforcement learning. Machine learning 8(3-4):229–256.
Ronald J Williams and David Zipser. 1989. A learning algorithm for continually
running fully recurrent neural networks. Neural computation 1(2):270–280.
Shuly Wintner, Shachar Mirkin, Lucia Specia, Ella Rabinovich, and Raj Nath Patel.
2017. Personalized machine translation: Preserving original author traits. In
European Chapter of the Association for Computational Linguistics . pages 1074–
1084.
Sander Wubben, Antal Van Den Bosch, and Emiel Krahmer. 2012. Sentence simpli-
fication by monolingual machine translation. In Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics: Long Papers-Volume
1 . Association for Computational Linguistics, pages 1015–1024.
Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch.
2016. Optimizing statistical machine translation for text simplification. Transac-
tions of the Association for Computational Linguistics 4:401–415.
Wei Xu, Alan Ritter, Bill Dolan, Ralph Grishman, and Colin Cherry. 2012. Para-
phrasing for style. Proceedings of International Conference on Computational
Linguistics 2012 pages 2899–2914.
Hayahide Yamagishi, Shin Kanouchi, Takayuki Sato, and Mamoru Komachi. 2016.
Controlling the voice of a sentence in japanese-to-english neural machine trans-
lation. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016).
pages 203–210.
Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2016.
Neural generative question answering. In North American Association of Com-
putational Linguistics .
Steve Young, Milica Gašić, Blaise Thomson, and Jason D Williams. 2013. Pomdp-
based statistical spoken dialog systems: A review. Proceedings of the IEEE
101(5):1160–1179.
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence
generative adversarial nets with policy gradient. In arxiv .
Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman,
Saizheng Zhang, Sandeep Subramanian, and Adam Trischler. 2017. Machine com-
prehension by text-to-text neural question generation. In Proceedings of the 2nd
Workshop on Representation Learning for NLP . pages 15–25.
159
Zhicheng Zheng, Xiance Si, Edward Chang, and Xiaoyan Zhu. 2011. K2q: Gener-
ating natural language questions from keywords with user refinements. In Pro-
ceedings of 5th International Joint Conference on Natural Language Processing .
pages 947–955.
Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych. 2010. A monolingual tree-
based translation model for sentence simplification. In Proceedings of the 23rd
international conference on computational linguistics . Association for Computa-
tional Linguistics, pages 1353–1361.
160