ABSTRACT Title of dissertation: STRUCTURED APPROACHES FOR EXPLORING INTERPERSONAL RELATIONSHIPS IN NATURAL LANGUAGE TEXT Snigdha Chaturvedi, Doctor of Philosophy, 2016 Dissertation directed by: Dr. Hal Daume´ III Department of Computer Science Human relationships have long been studied by scientists from domains like so- ciology, psychology, literature, etc. for understanding people’s desires, goals, actions and expected behaviors. In this dissertation we study inter-personal relationships as expressed in natural language text. Modeling inter-personal relationships from text finds application in general natural language understanding, as well as real-world domains such as social networks, discussion forums, intelligent virtual agents, etc. We propose that the study of relationships should incorporate not only linguis- tic cues in text, but also the contexts in which these cues appear. Our investigations, backed by empirical evaluation, support this thesis, and demonstrate that the task benefits from using structured models that incorporate both types of information. We present such structured models to address the task of modeling the nature of relationships between any two given characters from a narrative. To begin with, we assume that relationships are of two types: cooperative and non-cooperative. We first describe an approach to jointly infer relationships between all characters in the narrative, and demonstrate how the task of characterizing the relationship between two characters can benefit from including information about their relation- ships with other characters in the narrative. We next formulate the relationship- modeling problem as a sequence prediction task to acknowledge the evolving nature of human relationships, and demonstrate the need to model the history of a rela- tionship in predicting its evolution. Thereafter, we present a data-driven method to automatically discover various types of relationships such as familial, romantic, hostile, etc. Like before, we address the task of modeling evolving relationships but don’t restrict ourselves to two types of relationships. We also demonstrate the need to incorporate not only local historical but also global context while solving this problem. Lastly, we demonstrate a practical application of modeling inter-personal rela- tionships in the domain of online educational discussion forums. Such forums offer opportunities for its users to interact and form deeper relationships. With this view, we address the task of identifying initiation of such deeper relationships between a student and the instructor. Specifically, we analyze contents of the forums to au- tomatically suggest threads to the instructors that require their intervention. By highlighting scenarios that need direct instructor-student interactions, we alleviate the need for the instructor to manually peruse all threads of the forum and also as- sist students who have limited avenues for communicating with instructors. We do this by incorporating the discourse structure of the thread through latent variables that abstractly represent contents of individual posts and model the flow of infor- mation in the thread. Such latent structured models that incorporate the linguistic cues without losing their context can be helpful in other related natural language understanding tasks as well. We demonstrate this by using the model for a very different task: identifying if a stated desire has been fulfilled by the end of a story. STRUCTURED APPROACHES FOR EXPLORING INTERPERSONAL RELATIONSHIPS IN NATURAL LANGUAGE TEXT by Snigdha Chaturvedi Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2016 Advisory Committee: Professor Hal Daume´ III, Chair/Advisor Professor Hector Corrada Bravo Professor Marine Carpuat Professor Chris Dyer Professor Philip Resnik c© Copyright by Snigdha Chaturvedi 2016 Dedication To my wonderful mother ii Acknowledgments I would like to thank several people who have helped me in working on this dissertation. First and foremost, I would like to thank Hal, who has been an excep- tional advisor, mentor and teacher. I could not have wished for a better advisor. I would also like to thank my committee members, Philip, Chris, Marine and Hec- tor for their inputs and useful advice through the course of my dissertation work. Thanks also to Dan Goldwasser and Ben Shneiderman for their valuable mentor- ship and support. I am also grateful to all my friends and colleagues, who made my graduate school life enriching and enjoyable. Finally, I would like to thank my parents, my husband, and my family for their unwavering encouragement and support throughout this process. iii Table of Contents List of Tables vii List of Figures ix 1 Introduction 1 1.1 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Outline of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . 7 2 Background 11 2.1 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . 11 2.1.1 Event-centric Natural Language Understanding . . . . . . . . 11 2.1.2 Character-centric Natural Language Understanding . . . . . . 13 2.2 Social Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Relationships in Literature . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Technical Background . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.1 Structured Prediction . . . . . . . . . . . . . . . . . . . . . . . 18 2.4.2 Latent Variable Models . . . . . . . . . . . . . . . . . . . . . . 20 2.4.3 Expectation Maximization . . . . . . . . . . . . . . . . . . . . 22 3 Joint Relationship Identification in Movie Summaries 24 3.1 Problem Motivation and Definition . . . . . . . . . . . . . . . . . . . 24 3.2 Role of Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.6.1 Feature ablation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.6.2 Structured vs. unstructured models . . . . . . . . . . . . . . . 33 3.6.3 Qualitative Discussion . . . . . . . . . . . . . . . . . . . . . . 34 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 iv 4 Modeling Evolving Relationships Between Literary Characters 38 4.1 Problem Motivation and Definition . . . . . . . . . . . . . . . . . . . 38 4.2 Role of Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 Segmentation Model . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.2 Semi-supervised Framework . . . . . . . . . . . . . . . . . . . 43 4.4 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4.1 Content features . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.2 Transition features . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6.1 Evaluation on the SparkNotes dataset . . . . . . . . . . . . . 52 4.6.2 Evaluation on the AMT dataset . . . . . . . . . . . . . . . . . 53 4.6.3 Feature Ablation . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.6.4 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Unsupervised Modeling of Evolving Inter-character Relationships 59 5.1 Problem Motivation and Definition . . . . . . . . . . . . . . . . . . . 59 5.2 Role of Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Feature-vectors Extraction . . . . . . . . . . . . . . . . . . . . . . . . 62 5.4 Learning Relationship Sequences . . . . . . . . . . . . . . . . . . . . 65 5.4.1 GHMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4.2 Penalized GHMM . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.3 Globally Aware GHMM . . . . . . . . . . . . . . . . . . . . . 67 5.5 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.6.1 Supervised Evaluation . . . . . . . . . . . . . . . . . . . . . . 73 5.6.2 Relationship Analogy Task . . . . . . . . . . . . . . . . . . . . 76 5.6.3 Coherence of Relationship States . . . . . . . . . . . . . . . . 78 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 6 Application: Identifying Existence of Instructor-student Relationship in MOOC Forums 81 6.1 Problem Motivation and Definition . . . . . . . . . . . . . . . . . . . 83 6.2 Role of Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.3.1 Logistic Regression (LR) . . . . . . . . . . . . . . . . . . . . . 86 6.3.2 Linear Chain Markov Model (LCMM) . . . . . . . . . . . . . 88 6.3.3 Global Chain Model (GCM) . . . . . . . . . . . . . . . . . . . 93 6.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.5.1 Quantitative Results . . . . . . . . . . . . . . . . . . . . . . . 97 6.5.2 Visual Exploration of Categories . . . . . . . . . . . . . . . . . 99 6.5.3 Choice of Number of Categories . . . . . . . . . . . . . . . . . 101 v 6.5.4 How important are linguistic features? . . . . . . . . . . . . . 104 6.6 Additional Application: Understanding Desire Fulfillment . . . . . . . 105 6.6.1 Problem Motivation and Definition . . . . . . . . . . . . . . . 105 6.6.2 Model: LCMM . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.6.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7 Conclusion 115 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.2.1 Domain-specific Models . . . . . . . . . . . . . . . . . . . . . 117 7.2.2 Enhancing Latent Variable Models . . . . . . . . . . . . . . . 117 7.2.3 Applications to other NLP problems . . . . . . . . . . . . . . 118 7.2.4 Widening the definition of relationships . . . . . . . . . . . . . 119 7.2.5 Relevance to Digital Humanities . . . . . . . . . . . . . . . . . 121 7.2.6 Relevance to Psychological and Sociological theories . . . . . . 122 A Frames Lexicon 124 B Conformity Lexicon 130 References 132 vi List of Tables 3.1 Test set results for relation classification. . . . . . . . . . . . . . . . 34 4.1 Frame samples used by ‘Semantic Parse’ features. . . . . . . . . . . . 47 4.2 Cross validation performances on SparkNotes data. The second order model based framework outperforms the one that uses a first order model and the unstructured baselines. . . . . . . . . . . . . . . . . . 51 4.3 Performance comparison on the AMT dataset. The second order model based framework outperforms the one that uses a first order model and the unstructured models. . . . . . . . . . . . . . . . . . . . 53 4.4 Sample character pairs (and book titles) from the clusters obtained from the summaries of the Harry Potter series. Pairs in clusters 1 and 3 had a cooperative and non-cooperative relationship throughout the novel (respectively). Cluster 2 contains pairs for which the rela- tionship became non-cooperative once in the novel but then finally became cooperative. . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.1 Representative words for relationship states learned by the Globally Aware GHMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.1 Held-out test set performances of chain models, LCMM and GCM, are better than that of the unstructured models, LR and J48. . . . . 97 6.2 Representative posts from the four categories learnt by our model. Due to space and privacy concerns we omit some parts of the text, indicated by “. . . ”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 Feature definitions (Section 6.6.3). F1-F3 are extracted for each ex- ample while F4-F17 are extracted for each evidence. . . . . . . . . . . 108 6.4 Test set performances of various models reporting averaged Preci- sion, Recall and F measure of the two classes. Our Latent Chain Markov Model, LCMM, outperforms the unstructured models (Logis- tic Regression, LR and Decision Trees, DT) and the majority baseline (which always predicts the majority class). . . . . . . . . . . . . . . . 113 A.1 Lists of negative frames. . . . . . . . . . . . . . . . . . . . . . . . . . 126 vii A.2 Lists of positive frames. . . . . . . . . . . . . . . . . . . . . . . . . . 128 A.3 Lists of ambiguous frames. . . . . . . . . . . . . . . . . . . . . . . . . 129 A.4 Lists of relationship frames. . . . . . . . . . . . . . . . . . . . . . . . 129 B.1 Lists of conforming and dissenting phrases used by the Sustenance features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 viii List of Figures 1.1 General form of our context aware relationship modeling approaches . 5 3.1 Figure depicting importance of textual as well as contextual evidence in determining relationship polarities. . . . . . . . . . . . . . . . . . . 25 3.2 Triadic structural features. ‘+’ signs indicate cooperative and ‘–’ indicate adversarial relationships . . . . . . . . . . . . . . . . . . . . 30 3.3 Sample annotation for a very short summary . . . . . . . . . . . . . . 31 3.4 Ablation study for text feature families . . . . . . . . . . . . . . . . . 32 3.5 Inferred relationships for movies ‘Titanic’ (1997) and ‘Ice Age’ (2002) 35 3.6 Two relevant subsets of the inferred relationships for movie: ‘Smokin’ Aces 2: Assassins’ Ball’ (2010). The model incorrectly learns a co- operative relationship between Lester and McTeague, and a non- cooperative relationship between Agents Baker and Redstone. . . . . 36 4.1 Sample sentences from a narrative depicting evolving relationship be- tween characters: Tom and Becky. The relationship changes from cooperative (+) to non-cooperative (-) and then back to cooperative (+). ‘. . . ’ represents text omitted for brevity. . . . . . . . . . . . . . 40 4.2 Ablation results on SparkNotes dataset. All represents performance with full feature-set and rest of the bars indicate performance with incrementally removing various feature-families. . . . . . . . . . . . . 55 5.1 Excerpts from the summary of the novel ‘Harry Potter and the Pris- oner of Azkaban’ depicting non-stationary nature of relationship be- tween Harry Potter and Sirius Black. Their relationship changes from hostility to cooperative alliance. . . . . . . . . . . . . . . . . . . . . 60 5.2 Various relationship feature-sets extracted for a sample sentence de- picting relationship between Jim and Maria. . . . . . . . . . . . . . . 64 ix 5.3 Diagram for the Globally Aware GHMM (for 2 sentences of a sequence). Feature-vector representations of individual sentences, ~ft, and that of the complete sequence (global), ~F , are observed. Relationship states, rt and global versus local choices, ct are hidden. γ is the model’s preference towards the local component. ~w represents the weights of the global com- ponent (with a logistic regression form), and φ and pi represent the local component (transition and start-state probabilities respectively). Penalty ρ is not shown for clarity. ~µ and Σ are the parameters of Normal distri- bution. L is the number of sequences in the dataset and R is the number of relationship states. . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4 Performance comparison of various models. Previously, we obtained an F-measure of 76.76 on the same dataset using supervised models (Chapter 4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 Global vs local preference learned by the Globally Aware GHMM with and without penalty. The model without penalty has weaker prefer- ence for a global categorization which can result in frequent shifts in relationship states within a sequence, resulting in lower accuracy. . . 76 5.6 Screenshot from human evaluation task . . . . . . . . . . . . . . . . . 77 5.7 Model Precision (MP) for the Word-intrusion detection task for re- lationship states learned by our model. MP is high for most states indicating semantic coherence. . . . . . . . . . . . . . . . . . . . . . . 79 6.1 Example posts demonstrating that in order to attract instructor’s attention students have to resolve to the inefficient method of getting their posts upvoted by their classmates. Therefore, there is a need to design models that could automatically identify threads on which instructors intervene. . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.2 Diagrams of the Linear Chain Markov Model (LCMM) and the Global Chain Model (GCM). pi, r and φ(t) are observed and hi are the latent variables. pi and hi represent the posts of the thread and their la- tent categories respectively; r represents the instructor’s intervention and φ(t) represent the non-structural features used by the logistic regression model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 6.3 Instructor’s intervention decision process for the Linear Chain Markov Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.4 Visualization of lexical contents of the categories learnt by our model from the GHC dataset. Each row is a category and each column rep- resents a feature vector. A light color represents high values while lower values are represented by darker shades. Dark monochromatic columns are used to better separate the five feature clusters, F1-F5, which represent words that are common in thanking, logistics-related, introductory, syllabus related and miscellaneous posts respectively. Categories 1,2,3 and 4 are dominated by F2, F4, F1 and F3 respec- tively indicating a semantic segregation of posts by our model’s cat- egories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 x 6.5 Cross validation performances of the two models with increasing num- ber of categories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.6 Cross validation performances of the various feature types for the two datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.7 Example consisting of: a Desire Expression, Evidence fragments (e1. . .e5) and a binary Desire Fulfillment Status. The Desire-subject is marked in bold fonts in the Desire-expression. . . . . . . . . . . . . . . . . . . 106 6.8 Artificial example indicating feature utility. The Desire-subject men- tions are marked in blue, actions in bold and emotions in italics. Discourse feature is underlined. . . . . . . . . . . . . . . . . . . . . . 109 7.1 Network Inferred from the Wikipedia article on 2003 Invasion of Iraq. The + and - signs indicate cooperative and non-cooperative relation- ships respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 xi Chapter 1: Introduction Relationships and social bonds are a fundamental aspect of human nature. They serve not only as manifestations of our social temperament, but are also con- sequences of evolutionary design which has led to our emergence as a successful species [3]. In particular, human relationships exhibit a range of phenomenon such as family, friendship, hostility, romantic love, formality, authority, etc. Understanding human interactions and relationships is hence necessary for understanding people’s actions, behavior and goals. Due to the insights offered by relationships into human conduct and behavior, scientists from diverse fields like sociology [4, 60], psychol- ogy [14] and literature [87] have been interested in studying human relationships. As early as 4th century BC, Plato suggested that love (representing human bonds) directs the worlds of Gods as well as humans [86]. In this thesis, we adopt a com- putational route to studying relationships, and model inter-personal relationships from digitally available natural language text. Such texts, especially narratives, are rich reflections of social relationships; and hence provide an attractive medium for their analysis. A narrative is an account of events or experiences, true or fictitious; and forms an indispensible part of our day-to-day lives. Ubiquitous examples of real or 1 fictional narratives include novels, movies, TV shows, news articles, blog articles, discussion forums, etc. Children learn and internalize societal norms and values through narratives in stories [30, 43]. As they grow older, they learn to increasingly share desires, experiences, and information through narratives [18]. In short, humans use narratives as a tool to express their (or others) interactions and relationships with others. In this sense, automatically reading and understanding narratives can serve as a platform for automatic systems to study inter-personal relationships and learn of human behavior and norms. The complexity of these narratives ranges from simple fables to abstract literary novels and news articles and interactive discussion forums, and interpreting and comprehending them can not only assist in acquiring background knowledge or common-sense but is essential for the general task of Natural Language Understanding (NLU). The field of computational linguistics has long been interested in algorithmi- cally understanding, representing and generating narratives [71]. In this context, the focus of much of early research has been on understanding text, including narratives, from the perspective of events mentioned in them [94, 65]. Prior research assumes that events are the core building blocks of representing text and interprets a text’s meaning as a sequence of events that are directly observed or inferred from the text. This style of research focuses more on what happened (plot) and less on who did it (characters). For example, in ‘John went to play basketball.’ the focus would be on the act of playing basketball with John acting as an ancillary participant in the event, rather than viewing John as the protagonist who intentionally performs an act. An alternative line of research flips this attention on studying characters in a 2 text and what they do [38]. Recent approaches have proposed studying text from the perspective of people involved [9, 40]. E.g., reading Wikipedia to gain infor- mation related to an individual’s life [10]; incorporating attributes of authors while understanding text [80, 93]; representing documents through their readers [39]; un- derstanding roles of users of an email system [75]; etc. Especially, while studying fictional narratives, there have been several character-centric approaches that ex- plain the set of observed actions in the narrative using characters’ personas or roles and their expected behavior in those roles [105, 11, 12]. However, people’s roles, attributes and actions are indispensably connected to and influence others around them. So, rather than analyzing personal attributes in isolation, this dissertation proposes studying interactions or relationships between people. Existing research on analysis of people’s relationships in text has been limited to simple schemes. These include constructing social networks of characters in a novel based on their co-occurrences in quoted conversations [42] and social events [2], and networks of participants of online-discussion forums [54] and email systems [75]. These methods primarily focus on identifying existence of relationships. Previous research has also focused on methods to explore certain specific aspects of relations such as power or authority in email conversations [17] and Supreme court dialogs and discussion among Wikipedia editors [32], address formality [64], and sentiment [55] in online discussion forums. Modeling relationships has utility for several real-world tasks. For example, predicting plausible relationships between people, using their posts or messages, can help social networking platforms and discussion forums in personalizing news 3 feeds, predicting virality, suggesting friends or topics of interest, etc. for a particular user. In doing so they could incorporate evolving nature of social relationships by focusing on activities of current friends [7, 63, 35, 6] and paying lesser attention to activities of connections who are no longer of much interest to the user. Similarly in email systems, understanding how users interact with each other can help in making smarter email systems [50, 16]. For example, determining the level of formality of the textual content of an email being composed, and also identifying plausible recipients who are in a formal relationship with the user can help in making better recipient recommendations. Similarly arranging emails according to the nature of relationship of the recipient with the sender can help in better managing the recipient’s inbox by clustering emails from friends in a different cluster than emails from office-colleagues. In general, identifying the nature of relationships between individuals can assist automatic understanding of text by explaining the actions of people mentioned in the text and building expectations of their behavior towards others [2, 42]. In particular, predictive models for inferring relationships from natural text can be especially use for researchers from the digital humanities domain. Though we make simplifying assumptions about definition of relationships to focus on how they can be inferred from text, applications based on our ideas could range from stylistic analysis and author preferences in depicting character relationships in lit- erary works to mining geopolitical or business relations from historical archives or correspondences. 4 Figure 1.1: General form of our context aware relationship modeling approaches 1.1 Thesis Statement In this dissertation, I propose that incorporating linguistic cues as well as the context in which they appear helps in modeling the nature and the existence of inter-personal relationships in text. Specifically, I address aspects of modeling relationships between people in narratives such as movie and novel summaries; and provide a real-world application of predicting existence of inter-personal relationships in discussion forums. To solve these problems, I provide structured models based on hand-crafted features that depend on the input text, candidate output and the context in the text (Figure 1.1). 1.2 Contributions The contributions of this work are summarized as: 1. We formulate the problem of modeling inter-personal relationships in vari- ous settings including fictional narratives as well as real-world domains like discussion forums. 5 2. We present methods to model nature of relationships between pairs of char- acters in a narrative summary, and demonstrate the utility of incorporating information about characters’ relations with other characters in the narrative (Chapter 3). 3. We then formulate the task of modeling the evolving nature of inter-personal relationships by addressing this as a supervised sequence prediction task which is able to incorporate historical context (Chapter 4). 4. We also allow a data-driven exploration of various types of relationships while modeling their evolving nature, and simultaneously demonstrate the impor- tance of including local as well as global context (Chapter 5). 5. We present a real-world application of our idea by addressing the problem of analyzing the contents of educational discussion forums to predict when the instructor would reply on a thread. While solving this problem we demonstrate the need for sequential models that can view individual posts in context of other posts of the thread. We additionally validate the utility of our model by applying it to another task of identifying if a desire expressed in a short paragraph was fulfilled (Chapter 6). 6. We annotate and release two datasets of inter-personal relationships. The first dataset consists of relationship annotations in about 150 movie summaries (Chapter 3), and the second dataset of novel summaries depicts evolving re- lationships between about 100 pairs of characters (Chapter 4). 6 7. We also release two other datasets of textual paragraphs containing expression of a desire and annotated with desire fulfillment (Section 6.6). 1.3 Outline of this Dissertation With this goal of inferring relationships from text in mind, we start with fictional narratives (like movies and novels) as they are relatively well-studied and are more self-contained as compared to real narratives like news articles. We then also demonstrate a real-world application while analyzing discussion forums. We start with a short description of previous work that has been done in the general area of understanding natural language text (Chapter 2). In this chapter we provide a brief introduction to the two major types of techniques explored in this field: event-centric and character-centric natural language understanding. We then begin with addressing the problem of characterizing the nature of relationships between pairs of characters in movie summaries in Chapter 3. While real-life relationships have multiple facets (power, influence, friendship, animosity, romance, etc.), we make a simplifying assumption of a binary relationship status (cooperative/non-cooperative) in this chapter; and instead focus on teasing apart the connection between text and relationships. In other words, this assumption makes the problem computationally tractable and allows us to focus on analyzing the different types and sources of information for inferring relationships from text. For example, while analyzing a typical romantic movie, we want to identify that the protagonist is in a non-cooperative relationship with the villain because they 7 try to engage in actions that harm each other, while he is in a cooperative rela- tionship with his lover because they assist each other. While solving this problem, we demonstrate the importance of understanding the relationships between the two characters in context of their relationships with other characters in the narrative. In other words, rather than looking at relationship between pairs of characters in isola- tion, we also incorporate structural cues from the social community in the narrative by incorporating transitive phenomena like ‘friend of a friend is a friend’, ‘common enemy’, etc. Specifically, if two people have a common friend or a common enemy, they are more likely to be in a cooperative relationship. However in real-world, relationships between people can change with time. In Chapter 4, we model the evolution of inter-personal relationships in novel sum- maries. Like the previous chapter we assume that relationships are only of two types: cooperative/non-cooperative. With this intuition, we present the relationship-modeling problem as a structured prediction task that can incorporate historical context while modeling the evolving nature of relationships between characters. For example, in the novel ‘Adventures of Tom Sawyer’, Tom is in love with Becky Thatcher. How- ever there is a period in the novel when their relationship sours (see Figure 4.1 in Chapter 4). In this work, we want to identify their relationship as not just a sin- gle (most-likely) cooperative state but a sequence of cooperative, non-cooperative and then back to cooperative states depicting their relationships at various points in the novel. We empirically show how this problem can benefit from structured models that apart from including linguistic cues about a relationship’s nature can also remember a history of the characters’ relationship in the novel. 8 The previous chapters make a limiting assumption about types of relationships– they are binary in nature (cooperative/non-cooperative). In the next chapter, Chap- ter 5, we present an unsupervised method that relaxes this constraint, and allows a data-driven inference of these inter-character relationship types. We assume that the textual excerpts, that serve as our evidences for modeling relationships, belong to discrete latent clusters. Each cluster represents a single relationship-type identi- fiable by the frequent words in the cluster. For example, a cluster containing words like father, mother, brother is likely to represent familial relationships. We focus on the problem of modeling evolving inter-personal relationships like before, and while addressing this problem we demonstrate the importance of incorporating a local as well as a global context of inter-character relationships. Thereafter in Chapter 6, we present real-world applications of our work on interaction modeling in the domain of MOOCs (Massive Open Online Courses) forums. These forums serve as platforms for their users to interact or develop rela- tionships. In this work, we focus on the student-instructor relationship, and viewing their textual exchanges on the forum as cues of development of their relationship we address the task of predicting which threads are likely to be in need of an instructor’s intervention. This implicitly models the intervention of an instructor in a discussion thread as the initiation of a stronger relationship with the student who started the thread. By doing this, we alleviate the need for MOOC’s instructional staff to man- ually inspect all threads in forums, which is prohibitively time consuming owing to typically large sizes of MOOC forums. We propose two models that include context in form of the linear chain structure of the threads and empirically demonstrate that 9 they outperform structure-unaware models. We also apply our model to a different problem of identifying if a desire expressed in a short narrative subtext is fulfilled. In this case, the model extracts cues about the likely desire fulfillment status from individual sentences of the paragraph while viewing them in context of each other. Lastly in Chapter 7, we summarize our contributions and present avenues for future work in this field. 10 Chapter 2: Background 2.1 Natural Language Understanding This section provides a brief overview of prominent event-centric and character- centric approaches to understanding narratives. 2.1.1 Event-centric Natural Language Understanding Understanding natural language text has attracted linguistics for a long time. Some of the earliest works have focused on understanding text by designing knowl- edge structures of the events described in the text. These methods have focused on representing text using structured sequences of prototypical events, their par- ticipants and causal relationships describing them called scripts [94] or Fillmorean frames. For example, these methods argue that understanding a text document about eating at a restaurant involves understanding the ‘restaurant script’ which de- scribes the usual events in the process (sitting, ordering, bringing food, eating, etc.), their relative orders (ordering happens before eating, etc.), participants (customer, waiters, etc.), preconditions and results. Frames [8] are also ‘script-like’ structures that are hierarchically arranged and represent mental concepts (e.g. commercial 11 transaction) and their participants (e.g. buyer, seller). Another one of the earliest representations of narratives include plot-units [65]. This representation attempted to summarize a narrative using several ‘plot-units’, each of which consists of a graphical structure involving affect states. An affect state occurs with respect to single character and indicates a positive reaction (+); or a negative reaction (-) to a situation; or a neutral affect ‘Mental’ state (M) indicating a combination of desires and intentions. The events in the narrative are described using causal links between affect states of the various characters involved. Lehnert uses four types of links describing causalities behind mental states (Motivation), intentions behind events (Actualization), displaced impact of an event terminat- ing another state (Termination), and equivalence behind multiple perspectives of a single state (Equivalence). Using this structure, she represents common plot-units describing success or failure in fulfilling an intention, an instance of honored promise, an act of kindness, etc. For example, a success (or failure) can be represented using an actualization link between an M state and a + (or -) state. The narrative is then described using these plot-units. However, the representational structures used in these methods are hand-coded and also domain-dependent in some cases. Later works have focused on automati- cally learning these structures. Mooney and DeJong [79] presented a non-statistical method of automatically inducing scripts from unstructured text. Chambers and Jurafsky [23] present a three-step statistical process to learn partially ordered set of events related to a protagonist, which they refer to as the ‘narrative chains’. The directed edges between the events in a chain represent the temporal relationships 12 between them. They use a novel application of pairwise mutual information and temporal relation learning to learn these chains. Later, they build upon this idea to represent events related to all characters in the narrative by building ‘narrative schemas’ [22]. The arguments of events in the schemas are filled by participants’ semantic roles. They also present methods to learn these schemas probabilistically albeit without the temporal ordering [21]. Another approach that focuses on learn- ing scripts as a bag of related events was proposed by Cheung et al. [29]. Regneri et al. [91] propose a graph learning approach based on Multiple Sequence Alignment methods popular in bioinformatics, and a semantic similarity function to cluster event sequences describing a scenario (e.g. eating at a restaurant) into a directed graph. Recent methods have also used Recurrent Neural Networks [84] and Hidden Markov Models [83] to learn scripts from unstructured text. Similarly, statistical methods have also been employed to produce plot-units representations of narrative text [53], and to learn plot-units from folktales using Bayesian model merging, a grammar induction technique [47]. 2.1.2 Character-centric Natural Language Understanding An alternate perspective attempts to understand text, especially narratives, from the viewpoint of characters. Most approaches in this field are based on Vladimir Propp’s Structuralist narrative theory [89] on character-roles in folktales. According to Propp’s theory, each role (e.g. a hero, a villain, etc.) has a sphere of actions, which defines the core actions of the character fulfilling that role. For example, irrespective 13 of what character fulfills the role of the villain in a story, his/her actions will be centered around malicious acts, violence, cheating, etc. Based on this theory, Valls- Vargas et al. [105] have studied actions of various characters of Russian folktales to assign roles to them. Similarly, Bamman et al. [11] and [12] present probabilistic methods to learn roles or personas of the various characters described in movie plots and novels (respectively). However, unlike previous methods, they do not pre-define the meanings of roles, but instead learn the definitions of personas in a data-driven manner. Wilenskys PAM (Plan Applier Mechanism) system [109] is also related to this perspective of understanding narratives. PAM understands a narrative by treating characters as intentional agents. It interprets every action of a character as instan- tiating some goal and the character’s plan to achieve that goal. Such methods that treat characters as intentional agents can be useful in explaining characters’ moti- vations and actions, and also help text generation systems that program interesting behaviors for agents [72]. This character-centric understanding of text motivates the problems and meth- ods described in this dissertation. However, instead of viewing characters as assum- ing specific roles, we study relationships between them. In the next two sections we discuss relationships in detail. 14 2.2 Social Relationships Max Weber [106] uses social actions to describe social relationships. He defines actions as human activities to which the actor attaches a subjective meaning. A social action is one whose subjective meaning takes account of the behavior of others and is oriented in its course. Social relationships are then defined on the basis of patterns of social ac- tions [106, 62]. They denote behavior of multiple actors such that the action of each takes account of that of the others and is oriented in these terms. A social relationship essentially assumes mutual orientation of the actors and the content of mutual orientation could be characterized by friendship, anger, sexual attraction, competition, conflict, or economic exchange. It is possible that the concerned par- ties attach different ‘meanings’ to their actions in which case the relationship will be asymmetrical from the points of view of the two parties. A social relationship can be of a fleeting character or permanent. In case of a permanent relationship, there is a probability of repetition of the behavior, which corresponds to its subjective mean- ing. This regularly repeated behavior eventually come to be expected, and forms patterns that we may call institutions which include formal institutions like school and workplace as well as voluntary ones like friendship, or peer-group. The content of these regularized relationships can become ‘maxims’ - forms of actions that are adhered to or are expected to be adhered to. When these maxims become more regularized they develop into customs, or even laws. Lastly, subjective meanings of relationships can change due to mutual agreement, thus a political relationship once 15 based on solidarity may develop into a conflict of interests. The work presented in this dissertation is guided by this definition of rela- tionships. We use character’s or people’s actions (apart from other indicators) as features that help us determine the nature of relationships. This is especially true of our data driven method presented in Chapter 5 which, in some sense, exploits repeated actions by individuals to discover norms or expected behavior that consti- tute the meaning of a relationship. In Chapters 4 and 5, we also allow relationships to be fleeting or change with the progress of the narrative. However, in this work we do not study asymmetric relationships and leave that as possible future work. 2.3 Relationships in Literature In this section we present some previous works that study social relationships in narrative texts. While this domain is large, we provide representative examples of various types of work to familiarize the reader with this domain. Polhemus [87] focused on erotic love and studied works by several authors in- cluding Jane Austen, Walter Scott, the Bronte¨s, Dickens, George Eliot, Trollope, Thomas Hardy, Joyce, D. H. Lawrence, Virginia Woolf, and Samuel Beckett. Pol- hemus draws parallels between religious faith and erotic love described by these authors which ranges from decorous to bold portrayal of love. Several works have also studied romantic relationships in works of specific authors. For example, White [108] studied evolution and history of romantic dramas that constitute early works of Shakespeare; previous works [100, 57] studied the love- 16 hate relationships as depicted in Jane Austen’s novels; etc. Apart from romantic relationships, literary work has also focused on other types of relationships. Abel [1] used object relations theory to study dynamics of female friendships in various novels. Frith [49] provided an in-depth study of rela- tionships as depicted in works by female novelists. They analyzed female friendship as a formative experience and its role in shaping gendered identity. Ford [48] ana- lyzed father-daughter relationships, especially in context of a suitor for the daughter, in works of several writers. Previous work have also studied the implications of such fictional father-daughter relations on the family unit as well as the extent to which an author’s work mirrors the social structure at that time [15]. There has also been work on closely studying and analyzing specific types of relationships as depicted by certain novelists or relationships between specific char- acters in certain novels. These include study of female friendship between characters of Matilde Serao [44]; the complicated relationship between Catherine and Heathcliff in Wuthering Heights [52]; etc. Our goal in this dissertation is not to provide such ‘close reading’ or in-depth analysis of specific novels or authors; but to provide methods that could lead to ‘dis- tant reading’ of several narratives simultaneously, and could infer relationship-types automatically. As pointed out before, 1 specific assumptions made by computational models in the domain of social sciences might limit their direct utility. However, such methods can be extended for use in digital humanities and social sciences. 1http://languagelog.ldc.upenn.edu/nll/?p=6968 17 2.4 Technical Background We now describe the prominent Machine Learning tools that motivate the methods described in this dissertation. 2.4.1 Structured Prediction Structured prediction refers to a class of machine learning methods in which the output of the algorithm is structured in nature rather than being a scalar bi- nary, categorical or real value. The input to the algorithm can be structured or unstructured. Structured prediction finds numerous applications in Natural Lan- guage Processing, Computer Vision, Computational Biology etc. Owing to the interaction between the various parts of the structured output, the number of its possible configurations is often very large, making the task of inference (or finding the best output) non-trivial. For this reason, most structured prediction models are often more complicated than their unstructured counterparts, employing dynamic programming or approximate methods for inference. These methods often assume joint features that depend on the input as well as the candi- date output, and most commonly maximize a linear scoring function based on the dot product of features and their weights. The approaches proposed in this disser- tation are based on two major structured prediction models: Structured Perceptron and Structured SVM. Structured Perceptron [31] learns the weights of the joint features using an itera- tive method based on the classic perceptron algorithm for learning linear boundaries. 18 In particular, given an input, x, and a candidate output, y, it assigns a ‘score’ to this candidate output as follows: score = w · φ(x, y) where φ(x,y) represents joint features that can depend on the input as well as the candidate output, and w are the candidate weights. Algorithm 1 shows the procedure for learning the feature weights. The inference procedure usually employs a dynamic programming based method (such as Viterbi) or an approximate method (such as belief propagation). Algorithm 1 Training algorithm for Structured Perceptron 1: Inputs: Training examples (xi,yi) 2: Output: Feature Weights w 3: Initialization: Set w = 0 4: for t : 1 to T , i : 1 to N do 5: yˆi = arg maxy∈Y [wt · φ(xi, y)] 6: If (yˆi 6= yi) wt+1 = wt + φ(xi, yi)− φ(xi, yˆi) 7: end for 8: return w The structured perceptron algorithm has convergence guarantees similar to the traditional perceptron method. Collins [31] showed that if the training set is separable with margin δ, then the above mentioned algorithm (Algorithm 1) makes at most R2/δ2 mistakes before convergence, where R is a constant such that ∀i, ∀y ∈ Y , ||φ(xi, yi)− φ(xi, y)|| ≤ R. 19 Structured SVM is a method that incorporates the idea of margin-based learning for structured prediction. Like before, the goal is to learn a linear prediction rule of the form: fw(x) = arg max y∈Y [w · φ(x, y)] Training a Structural SVMs involves solving the following optimization [104, 61]: min w 1 2 ||w||2 + C N∑ i=1 l(yi, yˆi(w)) where, yˆi(w) = arg maxy∈Y w · φ(xi, y), and l is a user supplied loss function. However, solving this problem is difficult due to the non-convex and discon- tinuous nature of typical forms for l. Tsochantaridis et al. [104] overcome these difficulties by replacing the loss function l with a piecewise linear convex upper bound (margin rescaling). Thus training this method now requires solving the fol- lowing convex optimization problem: min w 1 2 ||w||2 + C N∑ i=1 [ max yˆ∈Y [l(yi, yˆ) + w · φ(xi, yˆ)]− w · φ(xi, yi) ] 2.4.2 Latent Variable Models Latent Variable Models are a broad class of methods that assume that apart from the observed input and output variables, there exist another set of variables which are hidden or latent and so are not directly observed. Hence instead of directly generating/predicting the output from the input, these methods additionally assume one or more layers of latent variables that are directly related to the input and the output. Assuming their existence may not only help in predicting better outputs, but also in explaining the data. 20 There have been several approaches that couple latent variables with struc- tured predictions. Examples of such methods include Structured Perceptron [102], and Latent Structural SVMs [113]. Both these methods assume a set of (structured) latent variables, h. The features for the model, φ(x, y, h), are extracted not only from the input, x, and a candidate output, y but also on this latent variables h. Methods like these help in capturing latent dependencies and hidden sub-structure [103]. The formulation for Latent Structured SVM is similar to that of Structured SVM. The goal is to now learn linear prediction rule of the form fw(x) = arg max (y,h)∈Y×H [w · φ(x, y, h)] The objective function is now written as: min w [1 2 ||w||2+C N∑ i=1 max (yˆ,hˆ)∈Y×H [l(yi, yˆ, hˆ)+w ·φ(xi, yˆ, hˆ)] ]−C N∑ i=1 [ max h∈H w ·φ(xi, yi, h) ] Hidden Markov Model (HMM) [13] is another latent structured prediction method typically used for sequential or temporal input (chain of observations), x = 〈x1, x2, x3, . . . xT 〉, and output (a chain of states), y = 〈y1, y2, y3, . . . yT 〉. An exam- ple of such a problem includes Part-of-speech (POS) tagging where the input is a sentence (viewed as a chain of words) and the goal is to label each word with a POS tag. Thus the structured output is a chain of POS tags. The model makes Markov assumption which means that at any point in the process, the current output state depends on previous few states. In a simple Markov Model, the states are observed while in a Hidden Markov Model the output states are hidden or latent. Assuming a first order Markov assumption for simplicity, HMM defines the joint distribution 21 over input and output sequences as: p(x,y) = T∏ t=1 p(yt|yt−1)p(xt|yt) where, p(yt|yt−1) and p(xt|yt) are the state transition and observation emission prob- abilities respectively. The parameters of an HMM, the emission and the transition probabilities, and are usually learned using the Baum-Welch algorithm. 2.4.3 Expectation Maximization Expectation-Maximization (EM) [36] is an optimization algorithm used to learn parameters of a model which depends on latent variables. It is applicable when the objective function resembles a likelihood in presence of the missing data. It is an iterative process consisting of two steps: the E step in which the current parameters are used to obtain an expectation of the sufficient statistics of the missing variables; and the M step in which these expectations are used to update parameters. This method is guaranteed to converge to a local optimum of the likelihood. A variant of the above mentioned method is hard-EM. In this method, instead of computing expectations, the hidden variables are assigned their ‘most likely’ value. Another variation of EM, Generalized (Incomplete) EM, is used in cases when finding the maximum likelihood parameters in the M-step might be difficult even with the given expectations (or completed data). In such a case, M-step can proceed by improving the likelihood slightly (for example by using a gradient step). In this dissertation we use structured latent variables to help us model the underlying structure of the textual data while inferring relationships. We jointly 22 learn/assign these latent variables along with the rest of the data. However, the learning method depends on the nature of the latent variable and the problem setting and will be discussed the corresponding chapters. 23 Chapter 3: Joint Relationship Identification in Movie Summaries Work described in this chapter was published in AAAI 2016 [97] In this chapter we start with the problem of identifying relationships between characters in a narrative. Our domain is movie summaries and we make a sim- plifying assumption that relationships can be of only two types: cooperative vs non-cooperative. We propose to include structural cues from the social commu- nity in the narrative to characterize the relationships between individual character pairs in context of their relationships with other characters. We demonstrate that inference of relationship polarities can be significantly improved by considering their relationships with other characters. 3.1 Problem Motivation and Definition The character-centric approaches for understanding text believe that compre- hending a narrative requires the ability to interpret character roles and their ex- pected behavior towards other characters in those roles. While existing approaches can identify characters types or personas [11, 12, 105], they do not model rela- tionships between characters in a narrative. We present methods to model inter- character relationships instead of characters’ roles. 24 Young drifter Axel Nordmann goes to work in a gang of stevedores headed by Charlie Malik, a vicious bully, and is befriended by Tommy Tyler, who also supervises a stevedore gang. Malik resents blacks in positions of authority, and is antagonized when Axel goes to work for Tommy. Axel moves into Tommy’s neighborhood and becomes friends with Tommy’s wife Lucy. Axel is hiding something, and it emerges that he is a deserter from the United States Army. Malik is aware of that, and is extorting money from him. Malik frequently tries to provoke Tommy and Axel into fights, with Tommy coming to Axel’s aid ... (a) Sample summary extract for the 1957 movie ‘Edge of the City’. (b) Inferred relationship polarities with supporting evidences. Figure 3.1: Figure depicting importance of textual as well as contextual evidence in determining relationship polarities. Specifically, given two characters from a movie summary, we address the prob- lem of automatically identifying binary (cooperative and non-cooperative) relation- ships between them based on an analysis of the content of the narrative summary. 25 3.2 Role of Context Following previous works, we incorporate several linguistic cues for solving this problem (Section 3.4). However, apart from linguistic knowledge, there is a lot of contextual information to be exploited in this problem. For example, let us consider the plot summary in Figure 3.1a (condensed here for brevity). In this passage, the relations between the principal characters are explicated through a combination of cues, as seen in Figure 3.1b. For instance, one can infer that Alex (A) and Tommy (T) have a cooperative relationship through a combination of the following observations (among others): (1) T initially ‘befriends’ A, (2) A works for T, and its connotation that A is likely to cooperate with T , (3) T aids A in fights, (4) A is a friend of T’s wife , (5) A and T have a common adversary. In particular, we note that cues (4) and (5) cannot be extracted from looking at the relation between A and T in isolation, but depend on their relations with others. In this work, we show that such indirect structural cues can be very significant for inference of inter-character relationships. In other words, instead of independently identifying the nature of relationship between two characters, it is important to view their relationship in context of their relations with other characters in the narrative. 3.3 Model The example presented above suggests that a joint inference model that incor- porates both context and text would perform better than one that considers pairwise 26 relations in isolation. Therefore, we formulate the problem of relation classification as a structured prediction task, where we attempt to jointly infer the collective assignment of relation-labels for all pairs of characters in a narrative (movie sum- mary). Let x denote a narrative document for which we want to infer relationship structure y. We could think of x as a graph with characters as nodes, and rela- tionships between characters corresponding to edge-labels. We assume a supervised setting where we have labeled training set and we consider linear structured classifier of the form: hw = arg max y∈Y(x) wTφ(x,y) (3.1) Here, φ(x,y) is a feature representation for the narrative and a relation-assignment pairing and w is a vector of feature-weights which can be learnt using a voted struc- tured perceptron training algorithm [31]. Finding the best assignment corresponds to the decoding problem, i.e. finding the highest scoring assignment under a given model. Structured Perceptrons have been conventionally used for simple structured inputs, which are amenable to efficient dynamic programming inference algorithms. This is because updates require exact inference over an exponentially large space (solving the decoding problem in Equation 3.1), and updates from inexact search can violate convergence properties. In a problem like ours where the output is a network, finding the best scoring assignment (exact inference) can be computationally pro- hibitive if the network is large. However, Huang et al. [59] show that exact inference 27 is not necessary for convergence as long as we can guarantee perceptron updates only on ‘violations’, i.e. when the score of an assignment from the inference procedure is higher than the score for the ground truth labeling. This essentially means that for larger networks, it is not necessary to perform exact inference. Instead, any assign- ment whose score is greater that that of the ground-truth assignment can be used to update the model-parameters. Secondly, our structural features are based on triads of relationships but for most narratives, the character graphs are sparse (not fully connected) because summaries do not report the relationships/interactions between all pairs of characters. This means that for the vast majority of narratives in the dataset, the inference problem can be exactly solved. While training our structured perceptron model we utilize these two observations. 3.4 Features Our feature set, φ(x,y), consists of two types of features: linguistic and struc- tural. Our linguistic features identified the actions by the characters in the nar- rative and the narrator’s bias while describing those actions. We also obtained connotation [46], sentiment [69] and prior-polarity [110] of these words (identifying the features) when needed. The feature-extraction process in described in detail in Section 4.4. On the other hand, our structural features focused on configurations of rela- tionships among triads of characters. We briefly characterize the triadic features with our informal appellations for them in Figure 3.2. These features based on the 28 Structural Balance Theory [56, 20]. This theory, originally proposed in Social psy- chology in 1940s, analyzes signed triads of individuals and posits that certain triads are more likely and hence more prevalent in real networks than others. Specifically, triads with three mutual friends (‘Clique’ in Figure 3.2) or with two friends and a common enemy (‘Common Enemy’ in Figure 3.2) are more prevalent than the other two triads (‘Love triangle’ and ‘Truel’ in Figure 3.2). Davis [34] later proposed a variant of this theory, the Weak Structural Balance theory, which posits that only the triad with exactly two friends (‘Love triangle’ in Figure 3.2) is unlikely in a real network. The other three triads are plausible. These triads have been previously used to empirically analyze signed triads in social networks [66, 107]. In some domains, observed relations between entities can directly imply un- known relations among others dues to natural orderings. For example, temporal relations among events yield natural transitive constraints. For the current task, such constraints do not apply. While structural regularities like ‘a friend of a friend is a friend’ might be prevalent, these configurations are not logically entailed; and affinities for such structural regularities must be directly learnt from the observed data. Values of our structural features (Figure 3.2) consist of the number of such configurations in any assignment. The empirical affinities for such configurations, as reflected in corresponding weights can then be learnt from the data. 29 Figure 3.2: Triadic structural features. ‘+’ signs indicate cooperative and ‘–’ indicate adversarial relationships 3.5 Data We processed the CMU Movie Summary corpus, a collection of movie plot summaries from Wikipedia, along with aligned meta-data [11]; and set up an online annotation task using BRAT [98]. We use Stanford Core NLP pipeline to identify named entities (in particular person names) and coreferent mentions of those entities. Annotators could choose pairs of characters in a summary (screenshot of a short sample from the BRAT annotation interface shown in Fig 3.3), and charac- terize a relationship between them on an ordinal five-point Likert scale labeled as 1.‘Hostile’, 2.‘Adversarial’, 3.‘Neutral’, 4.‘Cooperative’ or 5.‘Friendly’. Annotators were asked to base their judgment of relationship characterization on the presented movie summary only. This resulted in a dataset of 153 movie summaries, consisting of 1044 character relationship annotations. For evaluation, we aggregated ‘hostile’ and ‘adversarial’ edge-labels, and ‘friendly’ 30 Figure 3.3: Sample annotation for a very short summary and ‘cooperative’ edge-labels to have two classes (neutral annotations were sparse, and ignored). Of these, 58% of the relations were classified as cooperative or friendly, while 42% were hostile or adversarial. The estimated annotator agreement (Cohen’s Kappa) for the collapsed classes on a small subset of the data was 0.89. 3.6 Evaluation In this section, we discuss quantitative and qualitative evaluation of our meth- ods. First, we make an ablation study to assess the relative importance of families of text-based features. We then make a comparative evaluation of our methods in recovering gold-standard annotations on a held-out test set of movie summaries. Finally, we qualitatively analyze the performance of the model, and briefly discuss common sources of errors. 31 Figure 3.4: Ablation study for text feature families 3.6.1 Feature ablation Figure 3.4 shows the 5-fold cross-validation performance of major feature fam- ilies of text features on the training set. We note that Frame-semantic features and adverbial connotations of character actions do not add significantly to model performance. This is perhaps because both these families of features were sparse. Additionally, frame-semantic parses were observed to have frequent errors in frame evocation, and frame element assignment. On the other hand, we observe that joint participation in actions (as agent or patient) is a strong indicator of cooperative relationships. In particular, incorporating these (‘Are Team’ ) features was seen to improve both precision and recall for the cooperative class; while not degrading recall for the non-cooperative class. Further, while ignoring sentiment and con- notation features for surrogate action features results in marginal degradation in 32 performance; the most significant features are seen to be sentiment and connotation features for actions where both characters occur in agent/patient roles (‘Acts To- gether’ features); and overall sentiment characterizations for words and phrases in spans of text between character mentions (span based ‘Lexical sentiment’ features). 3.6.2 Structured vs. unstructured models We now analyze the performance of our proposed models; and evaluate the significance of adding structural features to our models. In our experiments, we found the structured models to consistently outperform text-based models. We tune values of hyper-parameters, i.e. number of training epochs for the structured perceptron (10), through cross validation on training data. Table 3.1 compares the performance of the models on our held-out test set of 27 movie summaries (comprising about 20% of the all annotated character relations) using accuracy and averaged precision, recall and F-measure over the two classes. For the structured models, reported results are averages over 10 initializations. We observe that the structured perceptron model (Structured Model) outper- forms the text-only model trained with logistic regression (Unstructured Model). These results are consistent with our cross-validation findings, and demonstrate that structural features can substantially improve inference of character relations. Let us consider the affinities for structural features learnt by the model. Over 10 runs of the structured model, the average weights were: wclique = −2.79, wlovetriangle = −0.84, wcommonenemy = 10.26 and wtruel = −5.49. From the perspective of structural 33 Model Avgd P Avgd R Avgd F Acc Unstructured (Logistic Regression) 70.2 69.7 69.9 70.1 Structured 79.4 79.3 79.3 79.2 Table 3.1: Test set results for relation classification. balance theory (described in Sec. 3.4), the social configurations lovetriangle and truel are inherently unstable, whereas the others are stable. Hence one might have expected weights to follow this intuition. However, the learned weights make a significant departure from this conjecture. Specifically, the learnt affinity for the configuration lovetriangle seems higher than expected. This is unsurprising, how- ever, if we consider the domain of the data (movies), where it might be a common plot element. We also note that the ‘friend of a friend is a friend’ maxim is not supported by the feature weights (even though it is a stable configuration), and hence a model based on this as a hard transitive constraint could be expected to perform poorly. 3.6.3 Qualitative Discussion We observe that relation characterizations for character pairs are reasonable for most texts in the test set. Figure 3.5 shows labels inferred by the model for two well-known movies in the test set. We can see that the model correctly identifies all inter-character relationships. Error analysis revealed that mismatched coreference labelings are the most 34 Figure 3.5: Inferred relationships for movies ‘Titanic’ (1997) and ‘Ice Age’ (2002) common source of errors for the model. Secondly, in some cases, the text-based fea- tures mistakenly identify negative sentiments due to our coarse model of sentiment used in the feature-extraction process. For example, Figure 3.6 shows some examples of the relationships inferred for the movie ‘Smokin’ Aces 2: Assassins’ Ball’. The movie contained several characters, which yielded a large network. In the figure, we show the two relevant subgraphs of the inferred network for clarity. The model iden- tifies a non-cooperative relationship between Agents Baker and Redstone, which is incorrect. This happens because in the following sentence, the linguistic features pay attention to the negative connotation of ‘drag’ without understanding that Baker drags Redstone to keep him safe: ‘Baker drags the wounded Redstone to the “spi- der trap” ... used to safeguard people’. In the same figure, the model incorrectly identifies a cooperative relationship between Lester and McTeague. This happens because in the following sentence from the movie’s summary, the lexical features include a few words with strong positive connotations like father, children and fam- ily biasing the model towards identifying a cooperative relationship between them: ‘These include Ariella Martinez, ... Finbar McTeague who is also known as ‘The 35 Figure 3.6: Two relevant subsets of the inferred relationships for movie: ‘Smokin’ Aces 2: Assassins’ Ball’ (2010). The model incorrectly learns a cooperative rela- tionship between Lester and McTeague, and a non-cooperative relationship between Agents Baker and Redstone. Surgeon’, who brutally tortures his victims; the Southern Tremor family consisting of father Fritz and children; Lester...’. 3.7 Conclusion In this chapter we addressed the problem of identifying inter-character relation- ships in movie summaries. While making a simplifying assumption that relationships can be of only two types, we demonstrated that the task of relationship modeling can benefit from including linguistic cues as well as structural cues. Specifically, we empirically demonstrated that this task benefits from viewing inter-character rela- tions in context of their relationships with other characters in the narrative. Our experiments on a dataset of movie summaries from Wikipedia demonstrate how a structured model that incorporates this context performs better than a text based 36 model. While our test-bed was movie summaries, in the future, the framework could potentially apply to other domains of texts with social narratives, such as news stories. The framework could also be extended to finer grained relation cate- gorizations beyond coarse sentiment connotations, as well as comparing narratives based on their interpersonal relationships. Conceptually, a natural extension would be to use predictions about character relations to infer subtle character attributes such as agenda, intentions and goals. 37 Chapter 4: Modeling Evolving Relationships Between Literary Char- acters Work described in this chapter was published in AAAI 2016 [28] Like the previous chapter, in this chapter we model polarities of inter-character relationships, i.e. we again assume that relationships are of two types: cooperative and non-cooperative. However, we now assume that inter-personal relationships evolve with the progress of the narrative, and present the problem of modeling relationships as a structured prediction task. We demonstrate that for this task, it is important to employ structured models that are capable of modeling relationship between characters at any point in the narrative in context of the history of their relationship. 4.1 Problem Motivation and Definition Existing character-centric approaches to understanding texts [105, 11, 12], model each character as assuming a static narrative role, and these roles define the relationships between characters and also govern their actions through out the narrative. While such a simplified assumption provides a good general overview of the narrative, it is not sufficient to explain all events in the narrative. We 38 believe that in most narratives, relationships between characters are not static but evolve as the novel progresses. For example, consider the relationship between Tom and Becky depicted in Figure 4.1 which shows an excerpt from the summary of The Adventures of Tom Sawyer by Mark Twain 1. For most of the narrative, the characters are participants in a romantic relationship, which explains most, but not all, of their mutual behavior. We can observe that their relationship changed during the narrative. They presumably start as lovers (sentence S1 in the Figure), which is indicated by (and explains) becoming engaged. The relationship sours when Tom reveals his previous love interest (S2 and S3). However, later in the narrative they reconcile (S4 and S5). A model that assumes a fixed romantic relationship between characters would fail to explain their behaviors during the phase when their relationship was under stress. Therefore, in this chapter we assume that the relationship between characters evolves with the progress of the novel and address the task of learning relationship sequences. For instance in Figure 4.1, the relationship between Tom and Becky can be represented by the sequence 〈cooperative, non-cooperative, cooperative〉. 1SparkNotes Editors. SparkNote on The Adventures of Tom Sawyer. SparkNotes LLC. 2003. http://www.sparknotes.com/lit/tomsawyer/ 39 Figure 4.1: Sample sentences from a narrative depicting evolving relationship be- tween characters: Tom and Becky. The relationship changes from cooperative (+) to non-cooperative (-) and then back to cooperative (+). ‘. . . ’ represents text omitted for brevity. 4.2 Role of Context To address this problem, we model relationship as a structured variable – a sequence of variables each denoting relation state at a point in the narrative. The narrative fragment of interest for us is represented by the set of sentences in which the two characters of interest appeared together, arranged in the order of occurrence in the narrative. More formally, given the narrative text in form of a sequence of sentences, x = 〈x1, x2, . . . , xl〉, we address the problem of segmenting it into non-overlapping and semantically meaningful segments that represent continuities in relationship status. Each segment is labeled with a single relationship status rj ∈ {−1,+1} hence yielding a relationship sequence r = 〈r1, r2, . . . , rk〉k ≤ l. Like Chapter 3, we take a coarse-grained view, and model relation states as binary variables (indicating cooperative/non-cooperative relation). 40 Our task requires us to assign a relationship state to each sentence in the narrative. However, instead of viewing these sentences independently, it is natural to view them as a sequence and solve the problem using a structured model. We use a Markov model, which is capable of viewing the historical context while deciding the relationship state for a sentence. 4.3 Model Our approach utilizes a second order Markovian latent variable model for segmentation that is embedded in semi-supervised framework to utilize varying levels of labeling in the data. We now describe our segmentation model and the semi- supervised framework in detail. 4.3.1 Segmentation Model This model forms the core of our framework. It assumes that each sentence in the sequence is associated with a state that represents its relationship status. While making this assignment, it analyzes the content of individual sentences using a rich feature set and simultaneously models the flow of information between the states by treating the prediction task as a structured problem. Specifically, our approach employs a second order Markovian segmentation model, which assumes that each sentence in the sequence is associated with a state that represents its relationship status. 41 It collectively maximizes the following linear scores for individual sequences: score = ∑ i wφ(x, yi, yi−1, yi−2)] (4.1) where x is the input sequence and yi denotes the latent state assignment of its i th sentence to a relationship segment. Individual yis collectively yield the relationship sequence, r (by collapsing consecutive occurrences of identical states). φ represents features at the ith sentence that depend on the current state, yi, and the previous two states, yi−1 and yi−2, and w represents their weights. The second order Markov assumption of our features ensures continuity and coherence of behavior of the two characters within individual relationship segments. The Markovian segmentation model proposed above is trained using an av- eraged structured perceptron [31]. For inference, it uses a Viterbi based dynamic programming algorithm. The extension of Viterbi to incorporate second order con- straints is straightforward. We replace the reference to a state (in the state space |Y |) by a reference to a state pair (in the two fold product space |Y | × |Y |). Note that this precludes certain transitions while computing the Viterbi matrix, viz.: if the state pair at any point in narrative, t, is of the form (si, sj), then the set of state pair candidates at t + 1 only consists of pairs of the form (sj, sk). Incorporating these constraints, we compute the Viterbi matrix and obtain the highest scoring state sequence by backtracking as usual [96]. 42 4.3.2 Semi-supervised Framework Given the nature of the task, we acknowledge that obtaining a huge dataset of labeled sequence can be expensive. However, it might be more convenient to obtain partially labeled data especially in cases in which only a subset of the sentences of a sequence have an obvious relationship state membership. We, therefore, propose to embed the above mentioned segmentation model in a semi-supervised framework, which assumes that the training dataset consists of two types of labeled sequences: fully labeled, in which the complete state sequence is observed yi∀i ∈ {1 . . . l} and partially labeled, in which some of the sentences of the sequence are annotated with yi such that i ⊂ {1 . . . l}. Algorithm 2 Training the semi-supervised framework 1: Input: Fully F and partially P labeled sequences; and T : number of iterations 2: Output: Weights w 3: Initialization: Initialize w randomly 4: for t : 1 to T do 5: yˆj = arg maxyj [wt · φ(x,y)j] ∀j ∈ P such that yˆj agrees with the partial annotated states (ground truth). 6: wt+1 = StructuredPerceptron({(x, yˆ)j} ∈ {P, F}) 7: end for This framework uses a two step algorithm (Algorithm 2) to iteratively refine feature weights, w, of the segmentation model. In the first step, it uses existing weights, wt, to assign state sequences to the partially labeled instances. For state 43 assignment we use a constrained version of the Viterbi algorithm that obtains the best possible state sequence that agrees with the partial ground truth. In other words, for the annotated sentences of a partially annotated sequence, it precludes all state assignments except the given ground truth, but segments the rest of the sequence optimally under these constraints. In the second step, we train the aver- aged structured perceptron based segmentation model, using the ground truth and the state assignments obtained in the previous step, to obtain the refined weights, wt+1. 4.4 Features This section describes the features used by our model. Pre-processing We first pre-processed the text of various novel summaries to obtain part-of- speech tags and dependency parses, identify major characters and perform char- acter names clustering (assemble ‘Tom’, ‘Tom Sawyer’, etc.) using the Book-nlp pipeline [12]. However, the pipeline, designed for long text documents involving multiple characters, was slightly conservative while resolving co-references. We aug- mented its output using coreferences obtained from the Stanford Core NLP sys- tem [73]. We then obtained a frame-semantic parse of the text using Semafor [33]. We also obtained connotation [46], sentiment [69] and prior-polarity [110] of words when needed during feature extraction. Finally, given two characters and a sequence of pre-processed sentences in 44 which the two appeared together, we extracted the following features for individual sentences. 4.4.1 Content features These features help the model in characterizing the textual content of the sentences. They are based on the following general template which depends on the sentence, xj, and its state, yj: φ(xj, yj) = α if the current state is yj ; 0 otherwise where, α ∈ F1 to F33, and F1 to F33 are defined below. Figure 5.2 shows examples of these feature categories for a sample sentence. 1. Actions based: These features are motivated by Vladimir Propp’s Structuralist narrative theory [89] based insight that characters have a ‘sphere of actions’. We model the actions affecting the two characters by identifying all verbs in the sentence, their agents (using ‘nsubj’ and ‘agent’ dependency relations) and their patients (using ‘dobj’ and ‘nsubjpass’ relations). This information was extended using verbs conjunct to each other using ‘conj’. We also used the ‘neg’ relation to determine the negation status of each verb. We then extracted the following features: • Are Team [F1]: This binary feature models whether the two characters acted as a team by indicating if the two characters were agents (or patients) of a verb together. • Acts Together [F2-F7]: These features explicitly model the behavior of the two characters towards each other using verbs for which one of the characters was the agent and the other was patient. These six numeric features look at 45 positive/negative connotation, sentiment and prior-polarity of the verbs while considering their negation statuses (see Pre-processing for details). • Surrogate Acts Together [F8-F13]: These are high-recall features that analyze actions for which a character was an implicit/subtle agent or patient. For example, Tom is not the direct patient of shunned in S3 in Figure 4.1. We define a set of six surrogate features that, like before, consider connotations, sentiments and prior-polarities of verbs (considering negation). However, only those verbs are considered which have one of the characters as either the agent or the patient, and occur in sentences that did not contain any other character apart from the two of interest. 2. Adverb based: These features model narrator’s bias in describing characters’ actions by analyzing the adverbs modifying the verbs identified in ‘Action based’ features (using ‘advmod’ dependency relations). For example, in S4 in Figure 4.1 the fact that Tom nobly accepts the blame provides an evidence of a positive rela- tionship. • Adverbs Together [F14-F19] and Surrogate Adverbs Together [F20-F25]: Six numeric features measuring polarity of adverbs modifying the verbs considered in ‘Acts Together’ and ‘Surrogate Acts Together’ respectively. 3. Lexical [F26-27]: These bag-of-words style features analyze the connotations of all words (excluding stop-words) occurring between pairs of mentions of the two characters in the sentence. E.g. in S5 in Figure 4.1 the words occurring between the 46 Type Frame[Frame-elements] Negative killing [killer, victim] attack [assailant, victim] Positive forgiveness [judge, evaluee] supporting [supporter, supported ] Ambiguous cause bodily experience [agent, experiencer ] friendly or hostile [side 1, side 2, sides ] Relationship kinship [alter, ego, relatives ] subordinates superiors [superior,subordinate] Table 4.1: Frame samples used by ‘Semantic Parse’ features. pair of mentions the characters, Tom and Becky, are “goes on a picnic to McDougal’s cave with” (stopwords included for readability). 4. Semantic Parse based: These features incorporate information from a Framenet- style semantic parse of the sentence. To design these features, we manually compiled lists of positive (or negative) frames (and relevant frame-elements) depending on whether they are indicative of positive (or negative) relationship between partici- pants (identified in the corresponding frame-elements). We also compiled a list of ambiguous frames like ‘cause bodily experience’ for which the connotation was de- termined on-the-fly depending on the connotation of lexical unit at which that frame 47 was evoked. Lastly, we had a list of ‘Relationship’ frames that indicated familial or professional relationships. Table 4.1 shows examples of various frame-types and relevant frame-elements. The complete lists are shown in Appendix A 2. Based on these lists, we extracted the following two types of features: • Frames Evoked [F28-F30]: Three numeric features counting number of pos- itive, negative and ‘relationship’ frames evoked such that at least one of the characters belonged to the relevant frame-element. • Surrogate Frames Evoked [F31-F33]: Three features counting number of pos- itive, negative and ‘relationship’ frames evoked. 4.4.2 Transition features While content features assist the model in analyzing the text of individual sen- tences, these features enable it to remember relationship histories, thus discouraging it from changing relationship states too frequently within a sequence. • φ(yj, yj−1, yj−2) = 1 if current state is yj and the previous two states were yj−1, yj−2; 0 otherwise • φ(yj, yj−1) = 1 if current state is yj and the previous state was yj−1; 0 otherwise • φ(y0) = 1 if state of the first sentence in the sequence is y0; 0 otherwise 2We do not claim these lists to be exhaustive 48 4.5 Data We have used the following two datasets in our experiments both of which are based on novel summaries. SparkNotes: This dataset consists of a collection of summaries (‘Plot Overviews’) of 300 English novels extracted from the ‘Literature Study Guides’ section of Spar- kNotes. 3 We pre-processed these summaries as described in Section 4.4. Thereafter, we considered all pairs of characters that appeared together in at least five sentences in the respective summaries and arranged these sentences in order of appearance in the original summary. We refer to these sequences of sentences as simply a sequence. We considered the threshold of 5 sentences to harvest sequences long enough to manifest ‘evolving relationships, but also sufficiently many to allow learning. This yielded a collection of 634 sequences consisting of a total of 5542 sentences. For our experiments, we obtained annotations for a set of 100 sequences. An- notators were asked to read the complete summary of a novel and then annotate character-pair sequences associated with it. Sentences in a character-pair sequence were labeled as cooperative (when the two characters supported each others ac- tions/intentions or liked each other) or non-cooperative (otherwise). Annotators had access to the complete summary throughout the annotation process. They were required to fully annotate the sequences whenever possible and partially annotate them when they couldn’t decide a relationship status for some of the sentences in the sequence. It was permissible to annotate a sequence with all cooperative or all 3http://www.sparknotes.com/lit/ 49 non-cooperative states. In fact, that happened in 70% of the sequences. The dataset thus obtained contained 50 fully annotated sequences (402 sentences) and 50 par- tially annotated sequences (containing 390 sentences, of which 201 were annotated). Of all annotated sentences, 472 were labeled with a cooperative state. AMT: We considered another dataset only for evaluating our model. This dataset [74] was collected independently by another set of authors using Amazon Mechanical Turk. 4 The annotators were shown novel-summaries and a list of char- acters appearing in the novel. Given a pair of characters, they annotated if the relationship between them changed during the novel (binary annotations). They were also asked other questions, such as the overall nature of their relationship, etc., which were not relevant for our problem. There was some overlap between the novel summaries used by the two datasets described here, due to which, 62 pairs of char- acters from this dataset could be found in the SparkNotes dataset. This dataset of 62 pairs can be viewed as providing additional binary ground truth information and was used for evaluation only after training on SparkNotes data. The relationship was annotated as ‘changed’ (positive class) for 20% of these pairs. 4.6 Evaluation We now describe our baselines and evaluation results. Baselines and Evaluation Measures Our primary baseline is an unstructured model that trains flat classifiers using 4https://www.mturk.com/ 50 Model Type Model Name P R F ED Unstructured Decision Tree 67.89 75.60 69.86 0.98 Logistic Regression 72.33 77.51 72.94 1.06 Structured Order 1 Model 74.32 73.16 73.70 0.9 Order 2 Model 76.58 76.97 76.76 0.66 Table 4.2: Cross validation performances on SparkNotes data. The second order model based framework outperforms the one that uses a first order model and the unstructured baselines. the same content features as used by our framework but treats individual sentences of the sequences independently. We compare our model with this baseline to validate that relationship sequence prediction is a structured problem, which benefits from remembering intra-novel history of relationship between characters. We also compare our framework, which employs a second order Markovian segmentation model, with an identical framework, with a first order Markovian segmentation model. This baseline is included to understand the importance of remembering a longer history of relationship between characters. Also, since a higher order model can look further back, it will discourage frequent changes in relationship status within the sequence more strongly. For evaluation, we use two different measures comparing model-performances at both sentence and sequence levels. Our first measure accesses the goodness of 51 the binary relationship state assignments for every sentence in the sequence using averaged Precisions (P), Recalls (R) and F1-measures (F) of the two states. The second evaluation measure, mimics a more practical scenario by evaluating from the perspective of the predicted relationship sequence, r, instead of looking at individual sentences of the sequence. It compares the ‘proximity’ of the predicted sequence to the ground truth sequence using Edit Distance and reports its mean value (ED) over all test sequences. A better model will be expected to have a smaller value for this measure. 4.6.1 Evaluation on the SparkNotes dataset Table 4.2 compares 10-fold cross validation performances of our second order Semi-supervised Framework (Order 2 Model) with its first order counterpart (Order 1 Model) and two unstructured baselines: Decision Tree and Logistic Regression. Since the performance of the semi-supervised frameworks depends on random ini- tialization of the weights, we report are mean values over 50 random restarts in the table. The number of relationship states, |Y |, was set to be 2 to correspond to the gold standard annotations. The table shows that the framework with the first order Markov model yields slightly better performance (higher averaged F-measure and lower mean Edit Distance) than the unstructured models. This hints at a need for modeling the information flow between sentences of the sequences. The further performance improvement with the second order model emphasizes this hypothesis and also demonstrates the benefit of remembering longer history of characters while 52 Model Type Model Name P R F Unstructured Decision Tree 68.18 43.55 48.54 Logistic Regression 71.93 46.77 51.48 Structured Order 1 Model 72.36 50.64 52.52 Order 2 Model 71.62 56.45 60.76 Table 4.3: Performance comparison on the AMT dataset. The second order model based framework outperforms the one that uses a first order model and the unstruc- tured models. making relationship judgments. 4.6.2 Evaluation on the AMT dataset Table 4.3 compares performances of the various models on the AMT dataset using averaged Precision, Recall and F measures on the binary classification task of change prediction. The problem setting, input sequences format and the training procedure for these models are same as above. However, the models produce struc- tured output (relationship sequences) that need to be converted to the binary output of change prediction task. We do this simply by predicting the positive class (change occurred) if the outputted relationship sequence contained at least one change. We can see that while the performance of the framework using the first order model is similar to that of the baseline Logistic Regression, the second order model shows a 53 considerable improvement in performance. A closer look at the F measures of the two classes revealed that while the performance on the positive class was similar for all the models (except Decision Tree which was considerably lower), the perfor- mance on the negative class (no change) was much higher for the structured models (56.0 for Logistic Regression and 57.4 and 67.8 for the First and Second order mod- els respectively). This might have happened because the unstructured model looks at independent sentences and cannot incorporate historical evidence so it is least conservative in predicting a change, which might have resulted in low recall. The structured models, on the other hand, look at previous states and can better learn to make coherent state predictions. 4.6.3 Feature Ablation Figure 4.2 plots 10-fold cross validation F-measure to study the predictive importance of various feature-families using Logistic Regression on the SparkNotes data. The black bar (labeled ‘All’) represents the performance using the complete feature set and the rest of the bars represent the scenario when various features- families are incrementally omitted. We can note that the ‘Are Team’ and ‘Acts Together’ features seem to be very informative as removing them degrades the per- formance remarkably. On the other hand, the ‘Adverbs Together’ feature seems to be least informative, possibly because it was sparsely populated in our dataset. Nevertheless we can conclude that, in general, removing any feature-family degrades model’s performance indicating their predictive utility. 54 Figure 4.2: Ablation results on SparkNotes dataset. All represents performance with full feature-set and rest of the bars indicate performance with incrementally removing various feature-families. 4.6.4 Case Study We now use our framework to gain additional insights into our data. To this end, we use the framework to make predictions about various character pairs from summaries of the seven Harry Potter novels by J. K. Rowling. As before, only those pairs were considered for which the two characters appeared together in at least five sentences and none of these pairs were manually annotated. We then clustered the various pairs according to the Edit-distance based similarity between their relation- ship sequences. Table 4.4 shows sample pairs for three such clusters. Note that some of the character pairs appear more than once because several characters are shared across the seven books. While performing this clustering no information other than 55 Cluster 1 Cluster 2 Harry, Dobby [Chamber of Secrets ] Ron, Hermione [Deathly Hallows ] Harry, Dumbledore [Half-Blood Prince] Harry, Ron [Deathly Hallows ] Hagrid, Harry [Prisoner of Azkaban] Sirius, Ron [Prisoner of Azkaban] Ron, Harry [Order of the Phoenix ] Sirius, Harry [Prisoner of Azkaban] Harry, Hermione [ Deathly Hallows ] Hermione, Sirius [ Prisoner of Azkaban] Cluster 3 Harry, Snape [Prisoner of Azkaban] Draco, Harry [Half-Blood Prince] Voldemort, Dumbledore [Half-Blood Prince] Voldemort, Dumbledore [Deathly Hallows ] Harry, Voldemort [Half-Blood Prince] Table 4.4: Sample character pairs (and book titles) from the clusters obtained from the summaries of the Harry Potter series. Pairs in clusters 1 and 3 had a cooperative and non-cooperative relationship throughout the novel (respectively). Cluster 2 contains pairs for which the relationship became non-cooperative once in the novel but then finally became cooperative. 56 the relationship sequence itself (such as character identities) was used. Also, the pairs are unordered. Cluster 1 consists of pairs whose relationship remained mostly cooperative throughout the novel. This includes relationships of Harry with his friends Ron and Hermione, benefactors Dumbledore, Hagrid and Dobby. Cluster 2 consists of pairs like 〈 Ron, Hermione〉 and 〈 Harry, Ron〉 from ‘Harry Potter and the Deathly Hallows’. Their relationships were similar in the sense that the three characters started out as friends and go on a quest. However, because of their initial failures Ron abandons the other two, but reunites with them towards the later part of the novel. Hence Ron’s relationship with each of the other two was both cooperative and non-cooperative during the course of the novel. Similarly, Cluster 3 consists of character pairs, which had a non-cooperative relationship for most of the novel. However, a more careful examination revealed that in some of these pairs the model assigned a cooperative status to a few sentences in the beginning. For example, for Voldemort and Dumbledore in ‘Harry Potter and the Half-Blood Price’, when Dumbledore (along with Harry) tries to learn more about Voldemort. A human reader, who has access to the complete text of the summary as well as context from previous novels, can understand that they learn about him to fight him better and hence the reader can infer that the relationship is non-cooperative. However, in our setting, we ignore all sentences except those in which the two characters appear together, hence depriving the model of the valuable information present in between the input sentences. We believe that a more sophisticated approach that understands the complete narrative text (instead of 57 sporadic sentences about the two characters of interest) will make better inferences. 4.7 Conclusion In this chapter we modeled the dynamic or evolving nature of inter-personal relationships. We analyzed summaries of novels to extract relationship trajecto- ries that describe how the relationship evolved. We addressed this as a sequence prediction task, and presented a semi-supervised framework that uses a structured segmentation model. Our model makes second-order Markov assumption to remem- ber the ‘history’ of characters and analyzes textual contents of summaries using rich semantic features that incorporate world knowledge. We demonstrated the utility of including the long-range historical context in addition to textual cues by comparing our model with an unstructured model that treats individual sentences independently and also with a lower order model that remembers shorter history. Future work could experiment with higher order and semi-Markov models. Also, this work treats different character pairs from the same novel independently and does not attempt to understand the complete text of the narrative. Future work could explore a more sophisticated model that exploits intra-novel dynamics while predicting relationships. An important contribution this work is identifying the evolutionary nature of relationships. Further work in this direction could be used to answer questions like “What kind of novels have happy endings?”, “Are there general narrative templates of relationship evolution between the protagonist and his/her lover?”, etc. 58 Chapter 5: Unsupervised Modeling of Evolving Inter-character Re- lationships Work described in this chapter is submitted for review. The previous two chapters make a limiting assumption on types of relation- ships. They assume that relationships are only of two types: cooperative or non- cooperative. In this chapter we relax this condition and assume that relationships can be of multiple types. We introduce data-driven models that can learn these types automatically. Like Chapter 4, we model evolving inter-character relation- ships with a focus on novel summaries. While modeling the dynamic nature of these relationships we incorporate historical context by solving this as a structured prediction problem. At the same time, while taking a decision about nature of rela- tionship between the characters of interest at various points within a narrative, we incorporate global context about the overall nature of relationships between them. 5.1 Problem Motivation and Definition Most existing methods to analyze inter-personal relationships in narratives, in- cluding those presented in the previous chapters in this dissertation, are inadequate in one of more of the following three ways: (i) Firstly, they assume a static relation- 59 Figure 5.1: Excerpts from the summary of the novel ‘Harry Potter and the Prisoner of Azkaban’ depicting non-stationary nature of relationship between Harry Potter and Sirius Black. Their relationship changes from hostility to cooperative alliance. ship between characters within a narrative, represented by a single variable (ii) Sec- ondly, they characterize relationships by coarse sentiment polarities, e.g., friendly vs. adversarial, which conflates distinct semantic categories and ignores other aspects such as power (iii) Thirdly, they require expensive and resource-intensive manual annotation. The work presented in this chapter attempts to address these limita- tions. The shortcomings of such methods in understanding a narrative are apparent. Figure 5.1 shows excerpts from the summary of the novel ‘Harry Potter and the Prisoner of Azkaban’ focusing on the relationship between ‘Harry Potter’ and ‘Sirius Black’. We can see that in the text labeled P1 in the figure, Harry believes that Sirius is responsible for his parents’ deaths and so tries to kill him. Later (P2) he learns that Sirius is innocent and they both protect each other, and hence their relationship changes from that of enmity on Harry’s part to cooperative alliance. Clearly, describing their relationship by a single category will not explain all of their 60 actions. For instance, if we assume that they are allies then we cannot explain why Harry tried to kill Sirius initially. Thus, like Chapter 4, there is a need to model inter-character relationships as dynamic variables that evolve with the progress of the narrative. Also, real-world relationships are nuanced with facets such as family, romance, formality/informality, supervision/subordination, etc. and an ideal model would allow flexibility to express these. In this chapter we present a framework for unsupervised modeling of inter- character relationships from unstructured text with a focus on novel summaries. The input to our models is the text of a narrative summary and a pair of characters appearing in the narrative. We address this problem by extracting a vector repre- sentation for each of the sentences in which the two characters of interest appear together. The vector representation of the individual sentences attempts to capture cues about the relationship between the two characters in that sentence. These sen- tences (and their vector representations) are arranged in their order of appearance in the narrative. Finally, each vector is associated with a latent variable, representing the relationship state between the two characters at that point in the narrative. 5.2 Role of Context In this chapter we present three models, which address this problem as an unsupervised structured prediction task. Given a narrative (presented as a sequence of sentences) and a pair of characters appearing in the narrative, our models learn a sequence of latent variables representing their (dynamic) relationship over the course 61 of the narrative. Our models make Markovian assumption to capture the ‘flow of information’ or the historical context between individual sentences. While deciding the relationship state at any point in the narrative, the Markovian assumption enables our models to look at not just the contents of the current sentence, but also the previous few relationship states. However, it may be argued that inferring relationship state requires a longer context, or a global perspective. In other words, judging the relationship state at any juncture needs consideration of not only the current sentence, but also the overall nature of the relationship between the characters. E.g., in the novel titled ‘Harry Potter and the Half Blood Price’ by J. K. Rowling, Harry, the protagonist, investigates into the history of the villain, Voldemort, and ‘learns more’ about him. While ‘learning more’ by itself does not give us much insight into their relationship, knowing that they are enemies in general, tells us that Harry is learning more about Voldemort to fight him better and that the relationship at this point is that of animosity. Therefore, we introduce a third model, Globally Aware GHMM, which incorporates short historical context via Markovian assumptions as well as longer global context using two distinct local and global components. 5.3 Feature-vectors Extraction Given a narrative text and two characters appearing in it, our approach aims to represent their relationship as a sequence of latent variables. The sentences in the narrative that are of interest to us are those in which the two characters appear 62 together. These sentences have a natural order of their appearance in the narrative, yielding a sequence, s = 〈s1, s2 . . . sT 〉. We represent this sequence of sentences as a sequence of D-dimensional feature-vectors f = 〈~f1, ~f2 . . . ~fT 〉 such that ~ft ∈ RD. We provide this sequence of feature-vectors representing the interactions between the two characters as input to our models. Our models assign each feature-vector, ~ft, to a discrete latent relationship state, rt ∈ {1, 2, . . . R}, thus outputting a sequence of latent relationship states, r = 〈r1, r2 . . . rT 〉. This section describes the process of extracting a sequence of feature-vectors f = 〈~f1, ~f2 . . . ~fT 〉 from a sequence of sentences s = 〈s1, s2 . . . sT 〉. We pre-processed the narratives using the BookNLP pipeline [12], Stanford’s CoreNLP system [73], and frame-semantic parser [33] as described in Section 4.4. We then extracted the following sets of words from each sentence. These sets are motivated by the feature-extraction process discussed in Section 4.4 albeit without the various connotation lexicons. However, we briefly summarize these sets here again for the sake of clarity and encourage the reader to refer to the previous chapter for more details. Actions: We represent the relationship between characters using their actions, especially to each other. We capture the notion of actions by extracting the set of verbs, which had one character as an agent, and the other as a patient. For example in the sample sentence in Figure 5.2, the action word is ‘asked’. Surrogate Actions: The above set of verbs can be affected by limitations of the NLP pipeline. For example, in the sentence in Figure 5.2, ‘Jim’ is the implicit agent of ‘confronting’ ‘Maria’. To include such cases, we extract another set of verbs, which 63 Sentence: After confronting Maria, Jim furiously asked her to end the friendship. Actions: asked Surrogate Actions: confronting Lexical: furiously asked Semantic: friendship, ‘personal relationships’ Figure 5.2: Various relationship feature-sets extracted for a sample sentence depict- ing relationship between Jim and Maria. had either of the two characters as the agent or patient, provided the sentence did not contain a mention of another character. Lexical: This bag-of-words set consists of all words (except stopwords) that ap- peared between pair of mentions of the two characters. Example in Figure 5.2. Frame-semantic: This set makes use of the frame-sematic parse of the sentence and the frame-polarity lexicon, which contains a list of frames indicative of rela- tionship status of the frame-elements (see Section 4.4 and Appendix A). This set includes all frames (and the tokens at which they were fired), included in the above mentioned lexicon, fired for at least one of the characters as a frame-element. For example in Figure 5.2, a ‘personal relationships’ frame is evoked at the token ‘friend- ship’. Hence, this set will include the words ‘friendship’ and ‘personal relationships’. After extracting these sets of words from individual sentences, we obtain a vector representation, ~ft ∈ RD, for each sentence, st, by averaging the vector space embeddings [76] of the individual words in the union of these sets (motivated by the additive model of vector compositionality [78]). 64 5.4 Learning Relationship Sequences Given the feature vectors as input, f = 〈~f1, ~f2 . . . ~fT 〉, we now describe our models which learn the relationship sequence, r = 〈r1, r2 . . . rT 〉. Our first approach, GHMM – a Gaussian Hidden Markov Model, uses a non-Bayesian Hidden Markov Model with Gaussian Emissions. The hidden states comprise of relationship states and vector representation of sentences form the observations. Our second approach, Penalized HMM, extends GHMM by smoothening the relationship sequences and discouraging frequent changes in relationship states within a sequence. Finally, our last approach, Globally Aware GHMM, attempts to simulate the intuition of a global belief about the nature of relationship between the characters, while analyzing the individual sentences of the sequence. 5.4.1 GHMM Our first approach consists of a Hidden Markov Model with Gaussian Emis- sions, which generates the feature-vector sequence as: For every vector, ~ft∀t ∈ {1, 2, 3 . . . T}: 1. If t = 1, choose r1 ∼ Categorical(pi) 2. If t > 1, choose rt ∼ Categorical(φrt−1) 3. Emit vector ~ft ∼ N (µrt ,Σrt) 65 where, pi is the start state probabilities – an R-dimensional probability distri- bution indicating the probability of starting in a given state. Also, φrt−1 represents the transition probabilities, such that φij is the probability of transitioning from state i to state j, and ∑R j=1 φij = 1. Finally, it is assumed that the vectors belong- ing to a state r are normally distributed with mean µr and covariance, Σr. This model thus defines the joint distribution over a sequence of feature-vectors as: p(f , r) = T∏ t=1 p(rt|rt−1)p(~ft|rt) (5.1) where, p(rt|rt−1) and p(~ft|rt) are obtained from φrt−1rt and the Normal distribution with parameters µrt and Σrt . We use the Baum-Welch algorithm to fit the various parameters of this model. 5.4.2 Penalized GHMM In practice GHMM resulted in highly fluctuating relationship sequences. While this might be a good feature for traditional sequence modeling tasks like POS tag- ging, real-world relationship sequences are smooth and tend to remain consistent over long parts of a narrative. We, therefore, introduce a more domain-specific model, Penalized GHMM, which is similar to GHMM, except that in the generative process, every time the model makes a transition from state i to state j, it incurs a penalty, ρij. ρij =  1, if i = j , otherwise (5.2) 66 This model defines the joint distribution over a sequence of feature-vectors as: p(f , r) = T∏ t=1 p(rt|rt−1)ρrt−1,rtp(~ft|rt) (5.3) The parameter estimation process for this model is similar to that of GHMM. 5.4.3 Globally Aware GHMM The above models are local in nature, in the sense that at any point in the sequence, t, the relationship state, rt, depends only on the previous state, rt−1, and the emitted features, ~ft. However, as discussed in Section 5.2, inferring relationship state requires a more global perspective, which is aware of the overall nature of the relationship between the characters. To incorporate this behavior, we introduce a third model, Globally Aware GHMM. This model makes a decision about the current relationship state, rt, after weighing in information from a local and a global component using a choice variable, ct ∈ {0, 1}. The local component uses the Penalized GHMM style transitions to de- termine the current relationship state. Whereas, the global component (represented by θ) uses a logistic regression model based on a global feature-set, ~F , extracted from the whole sequence, s = 〈s1, s2 . . . sT 〉. This model defines the distribution over a sequence as: p(f , r) = T∏ t=1 [γ · p(rt|rt−1) · ρrt−1,rt + (1− γ) · θ(rt|~F )] · p(~ft|rt) (5.4) Here γ is the model’s preference towards the local model (γ = p(c = 1)) and 67 the global component, θ, is modeled as: θ(r|~F ) = exp( ~wr · ~F )∑R r′=1 exp( ~wr′ · ~F ) (5.5) where, ~F is the global feature-set and ~wr are the weights corresponding to the relationship state r that are learned during training. rt rt+1 ct ct+1 ~ft ~ft+1 w ~F γ φ pi ~µ Σ R R×R R R t t+1 L Figure 5.3: Diagram for the Globally Aware GHMM (for 2 sentences of a sequence). Feature-vector representations of individual sentences, ~ft, and that of the complete se- quence (global), ~F , are observed. Relationship states, rt and global versus local choices, ct are hidden. γ is the model’s preference towards the local component. ~w represents the weights of the global component (with a logistic regression form), and φ and pi represent the local component (transition and start-state probabilities respectively). Penalty ρ is not shown for clarity. ~µ and Σ are the parameters of Normal distribution. L is the number of sequences in the dataset and R is the number of relationship states. 68 Figure 5.3 pictorially describes our model and its generative story can be de- scribed as follows: For every vector, ~ft∀t ∈ {1, 2, 3 . . . T}: 1. Toss a choice variable, ct ∼ Bernoulli(γ). 2. If ct = 0, choose rt ∼ θ(r|~F ) 3. If ct = 1 & t = 1, choose r1 ∼ Categorical(pi) 4. If ct = 1 & t > 1, choose rt ∼ Categorical(φrt−1) · ρrt−1rt 5. Emit vector ~ft ∼ N (µrt ,Σrt) Training: Globally Aware GHMM is defined using its parameters, λ = (pi,φ,µ,Σ,w,γ). We train the model using EM. In each step of the process, let λ represent the current model and λ′ represent a candidate model. Our goal is to make pλ′(f) > pλ(f). It can be shown that this is equivalent to maximizing the following auxiliary function: Q(λ, λ′) = ∑ l ∑ r ∑ c pλ(r l, cl|f l) log pλ′(rl, cl, f l) (5.6) where, l is the index over various sequences in the dataset (L in number) and xl represents the variable, x, for the lth sequence. In the above equation, p(r, c, f) for a sequence is modeled as: p(r, c, f) = T∏ t=1 {γ · p(rt|rt−1) · ρrt−1,rt}δ1(ct) {(1− γ) · θ(rt|~F )}δ0(ct) · p(~ft|rt) (5.7) where, δa(x) is the Kronecker delta function which takes the value of 1 whenever 69 x = a and 0 otherwise. In the E-step, we use the scaled Forward-backward algorithm to compute scaled forward (αrc(t)) and backward (βr(t)) probabilities for a sequence, l, as: αlrc(t) = p(r l t = r, c l t = c|~f l1, ~f l2 . . . ~f lt , λ) (5.8) βlr(t) = p( ~f lt+1, ~f lt+2 . . . ~f l T l |rlt = r, λ) (5.9) In the M-step we update the parameters as: µnewrn = ∑L l ∑T l t β l r(t) · αlr.(t) · f ltn∑L l ∑T t β l r(t) · αlr.(t) (5.10) Σnewr = ∑L l ∑T l t β l r(t) · αlr.(t)(~f lt − ~µr)(~f lt − ~µr)T∑L l ∑T t β l r(t) · αlr.(t) (5.11) pinewr = ∑L l β l r(1) · αlr1(1)∑L l ∑R i β l i(1) · αli1(1) (5.12) γnew = ∑L l ∑T l t ∑R r β l r(t) · αlr1(t) L (5.13) φnewij = ∑L l ∑T l−1 t ξ l ij(t)∑L l ∑T l−1 t ∑R j ξ l ij(t) (5.14) using αlr.(t) to represent ∑ c={0,1} α l rc(t), and η l(t) to represent the scaling factor used in scaling the forward probabilities at the tth point of the lth sequence, we define ξlij(t) as: ξlij(t) = βlj(t+ 1) · φij · γ · p( ~f lt+1|rlt+1 = j)αlr.(t) ηl(t) (5.15) 70 In this step, we also learn the weights, ~wr, of the global component (Eqn. 5.5) by maximizing the following objective function using a gradient-descent based method: L∑ l R∑ r exp( ~wr · ~F l)∑R r′=1 exp( ~wr′ · ~F l) T l∑ t=2 βlr(t) · αli0(t) (5.16) 5.5 Data For our experiments we have used the dataset described in Chapter 4. It consists of plot overviews of around 300 English novels from SparkNotes. 1 These summaries were pre-processed to identify major characters, and pairs of characters that appeared together in more that 5 sentences were considered for analysis. This threshold was used in order to obtain character-pairs that interacted long enough to demonstrate the dynamic nature of their relationships but at the same time resulted in a sizeable dataset. These sentences were arranged linearly and form a sequence. The final dataset contained 634 such sequences, with an average length of 8.2 sentences per sequence. 5.6 Evaluation In this section we evaluate our models to answer the following questions: • How does the performance of our unsupervised models compare with each other and with their supervised counterpart? • Does incorporating global context help in improving performance? 1http://www.sparknotes.com/sparknotes/ 71 • How does the model performance compare with human judgments in charac- terizing inter-character relationships? • Are the latent relationship states learned by our model semantically coherent? Implementation Details: Section 5.3 described the process of conversion of sen- tences of the narrative to feature-vectors. To obtain word embeddings in this pro- cess, we used the skip-gram model (using the Word2Vec tool [76]). This model was trained with D = 200 on a collection of the unstructured text of a subset of the nov- els from Project Gutenberg. 2 The novels were pre-processed to remove punctuation and capitalization before training the tool. Also, Globally Aware GHMM needs a global-features set, F , extracted for the complete sequence of sentences. Our model doesn’t make any assumption about the nature of this feature-set. For our experiments, we use the average of the feature- vectors of all sentences in the sequence (i.e ~F = mean(~f1, ~f2 . . . ~fT )). We used an existing implementation of the GHMM model [81]. Also, we use the average of the feature-vectors of all sentences in a sequence as its global feature vector for our Globally Aware GHMM (i.e ~F = mean(~f1, ~f2 . . . ~fT )). We used  = 0.8 (selected using cross-validation). Estimating the covariance matrix Σ degraded performance, which might be due to overfitting [95]. Hence, we only show results for estimating ~µr, and we use a fixed diagonal matrix as Σ (with each diagonal entry being 0.01), following previous approaches [68]. 2https://www.gutenberg.org/ 72 5.6.1 Supervised Evaluation We begin with indirectly evaluating our models on a supervised task by heuris- tically aligning learnt latent states against label categories. For this purpose, we use the manually annotated sequences of the data described in Chapter 4. It consists of about 50 sequences in which each sentence is labeled with a binary relationship state, cooperative or non-cooperative, which we refer to as the gold-classes. Around 30% of the sequences were labeled with more than one class. However, our unsuper- vised models assign each sentence to a relationship state/cluster but do not provide a label to the states. For this evaluation we heuristically assign each of the learned states a cooperative/non-cooperative label using: label of state j = arg max i∈{coop,non−coop} {m j i Ni } (5.17) where, mji is the number of sentences belonging to the learned state j with gold- class=i, and Ni is the total number of sentences in the gold-class i. We adopted this method of labeling states instead of simply assigning each state to the gold-class that was most frequent in the state because of class skew in numbers of manual annotations in the data. Like Chapter 4, we report averaged precision, recall and F-measures of the two gold-classes. Since our models require random initialization of global weights and means of the relationship states, we report average values for 50 runs for each of the models. Figure 5.4 compares the performance of our models for various values of R –the number of relationship states, which is a user-provided input to our models. We can 73 Figure 5.4: Performance comparison of various models. Previously, we obtained an F-measure of 76.76 on the same dataset using supervised models (Chapter 4). see that, all our models significantly outperform the baseline, which always predicts the majority (cooperative) class. Also the supervised model proposed in Chapter 4 achieved an F-measure of 76.76 on the same dataset but with expensive manu- ally annotated data required for training. The figure shows that Globally Aware GHMM (black), in general, performs better than the Penalized GHMM (green) and the GHMM (blue). It also outperforms the Global Model (orange), which is an unstructured model that clusters the sentences independently (corresponds to the global only component of the Globally Aware GHMM). This indicates that for this task, it is important to have a global as well as local perspective of characters’ interactions. The performance of Penalized GHMM (green) is comparable to that of GHMM (blue), which hints that the penalty term might not be contributing significantly to the model’s performance. To investigate this further, we modified Globally Aware 74 State Most Frequent words Familial kinship, father, relationship, personal, love, mother, son, brother, family, wife, daughter, tell, friend, marry, child, sister Desire relationship, become, make, decide, reveal, personal, find, kinship, desiring, forming relationships, learn, marriage, marry Casual go, meet, come, leave, take, find, return, together, arrive, get, tell, run, bring, decide, home, enter, begin, make, try, ask Verbal tell, ask, say, want, see, know, tells, find, realize, desiring, love, leave, marry, think, learn, believe, asks, try, kill, have Hostile kill, killing, attack, protect, cause harm, encounter, die, tell, fight, hostile, try, kinship, take, back, find, make, protecting Table 5.1: Representative words for relationship states learned by the Globally Aware GHMM. GHMM to exclude the penalty term. We call this model ‘Globally Aware GHMM w/o penalty’ (purple). We can see that its performance is much worse than that of Globally Aware GHMM (black). This suggests that while the penalty term is not very useful for a local model like GHMM, it is indeed valuable for models like Globally Aware GHMM, which have an ability to switch between local and global components. This is because frequent switching between the two compo- nents can result in frequent shifting between relationship states within a sequence. This frequent shifting is unnatural for the given task because the states represent inter-personal relationships, which are usually stable and evolve smoothly with the narrative. Therefore, such a model benefits from a penalty term, which smoothens the relationship trajectories and makes them more consistent with human judgment. This can also be observed in Figure 5.5. The Globally Aware GHMM (black) has much stronger preference for one of the components (global) than Globally Aware 75 Figure 5.5: Global vs local preference learned by the Globally Aware GHMM with and without penalty. The model without penalty has weaker preference for a global categorization which can result in frequent shifts in relationship states within a sequence, resulting in lower accuracy. GHMM w/o penalty (purple). The preference learned for the latter model is closer to 0.5 which would result in frequent shifts between the two components. 5.6.2 Relationship Analogy Task We now evaluate our model against human judgment without restrictive as- sumptions on the types of relationships. Manually designing a taxonomy of rela- tionship types is challenging. Further, judging the quality of a learned relationship sequence using a given taxonomy is subjective, and can vary considerably with choices in design. Hence, we evaluate the models on an objective task involving human subjects to answer questions based on semantics of inter-character relationships. Specifically, given a pair (P ) of characters in a text, we asked subjects to pick a pair that reflects 76 Figure 5.6: Screenshot from human evaluation task a similar relationship from two choices of pairs of characters, O1 and O2. Figure 5.6 shows an example question. The two options were selected from pool of pairs similar (or dissimilar) to the given pair, P , as determined by our model. We use an edit- distance based measure to compute similarity between (normalized) relationships sequences of any two given pairs. Subjects were required to read the summaries of these books before they an- swered questions, but also had access to these during the task. An illustrative sample question and answer was initially provided to explain the task. Subjects were also reminded to consider that personal relationships can have multiple facets and can change over time. They were asked to consider not just the nature but also the trajectories of relationships while judging similarity. For each question, the subjects could choose one of the two candidates character-pairs, or a don’t know option, and could report their confidence in doing so (high/low). To reduce annotator fatigue, each annotator was asked questions about characters mentioned in only three books 77 and could not answer more than 10 questions per session. Overall, we collected an- swers for about 100 such questions and each question was answered by at least three annotators. The inter-annotator agreement on this task was 0.73 (chance adjusted Fleiss’ κ = 0.46). We then evaluated the agreement between the model’s judgment and the hu- man provided answers. The annotators agreed with our model’s similarity judgment in 66.0% of the questions. This is a considerably hard task (since the inter-annotator agreement is not high) and a random baseline would agree only half the time. Hence we can conclude that the model learns sequences that correlate significantly with human judgment (p < 10−3), and can be used to address semantically complicated questions. 5.6.3 Coherence of Relationship States Table 5.1 shows the most frequent words from the relationship states learned by the Globally Aware GHMM (R = 5). For the purpose of this visualization, for each state, we report the most frequent words from the union of the feature-sets extracted (Sec. 5.3) for all the sentences assigned to that state. We can see that the first state corresponds to familial relationships as it consists of words like father, family, etc. The second state corresponds to a desire to initiate relationships as it contains words like desiring, forming relationships, marriage, etc. The third state consists of sentences in which the characters participate in physical action like go, meet, arrive, etc. The last two states depict casual verbal communication and hostile 78 Figure 5.7: Model Precision (MP) for the Word-intrusion detection task for relation- ship states learned by our model. MP is high for most states indicating semantic coherence. relationship respectively. Word-intrusion Detection Task: To further investigate the semantic coherence of the states, we performed the ‘Word-intrusion detection’ task [24]. In this task, a human subject is presented with six randomly ordered words. Five of these words are high frequency words from one of our learned relationship states, and one of the words, the ‘intruder’, belongs to a different (randomly chosen) relationship state. The subjects are then asked to identify the intruder word. The subjects in our experiments were graduate students from varying disciplines and were comfortable with English. Each subject was shown at-least five sets of words (one corresponding to each state), and no subject was shown more than ten sets. We then calculate the ‘Model Precision’ for each topic which is the fraction of times a subject accurately identified the intruder. Figure 5.7 shows the results of our experiment. We can see 79 that the subjects successfully identified the intruder with high precision in all cases (p < 0.001) except the ‘desire’ state (p ∼ 0.3). We can thus conclude that the states learned by our model are semantically coherent. 5.7 Conclusion This chapter relaxed the limiting assumptions about the number of possible relationship types made in the previous chapters. In this chapter, we presented methods to discover various types of relationships in an unsupervised manner. In particular, we addressed the problem of unsupervised learning of evolving relation- ship between characters from novel summaries. Treating this as a structured pre- diction problem we present three different models that are based on linguistic in- formation as well as real-world knowledge. Our experiments reveal that for solving this problem, it is not sufficient to simply look at local historical cues about the relationships between the two characters of interest from a small part of text at any juncture. Instead, it is important to incorporate a global context of the overall nature of relationship between them. We also empirically demonstrate the semantic coherence of the relationship states learned by our model. 80 Chapter 6: Application: Identifying Existence of Instructor-student Relationship in MOOC Forums Work described in this chapter was published in ACL 2014 and AAAI 2016 [26, 27]. In this chapter, we present an application of our relationship modeling ap- proach to a real-world task in the domain of educational discussion forums. We work with MOOC (Massive Open Online Course) forums, which provide a platform for students enrolled in MOOCs to interact with the instructional staff. In this domain, we address the problem of predicting if an instructor would intervene on a thread, assuming that posts of threads are chronologically structured in a chain-like fashion. As mentioned earlier, the underlying analogy here models the intervention of an instructor in a discussion as the initiation of a deeper relationship, in context of the historical content of the thread. We model the problem as identification of relationship existence because in some sense, discussion forums provide a platform for students to form deeper rela- tionships with instructors. In fact in MOOCs, forums are the only way for students to interact personally with the instructors. When a student posts on a thread seek- ing instructor’s attention, (s)he is attempting to initiate such a relationship with 81 the instructor. And when the instructor responds to the student’s query, (s)he con- firms/forms the relationship. Thereafter, depending on the content of their posts, this relationship takes concrete form/nature. For example, if the instructor sup- ports/endorses a student’s answer in a thread, their relationship takes a ‘positive’ form. Hence, in future, the instructor is likely to have more trust on (positive bias towards) this particular student. In this work, we limit ourselves to predicting if the relationship would be formed, and don’t delve deeper into the nature of relation- ship (whether the instructor supports/disproves/trusts the claim/query/comment posted by the student). We considered two different MOOCs and found their content to be not very different from narratives. The forums follow a coherent well-organized development, with each thread attributed to a title, which summarizes the query posted by a stu- dent. Posts in a thread contain comments/developments relevant to the title, and follow a natural temporal order, similar to most narratives. In fact, as discussed in this chapter, we propose two models that utilize this narrative-schema-like lin- eal arrangement of posts and our experiments demonstrate that doing so helps in making better judgments about the instructor’s decision to post. We additionally demonstrate the utility of our model by addressing another task of identifying if a desire expressed in a short paragraph is fulfilled. 82 6.1 Problem Motivation and Definition Ubiquitous computing and easy access to high bandwidth internet have re- shaped the modus operandi in distance education towards Massive Open Online Courses (MOOCs). MOOCs impart inexpensive and high-quality education from field-experts to thousands of learners across geographic and cultural barriers. Even as the MOOC model shows exciting possibilities, it presents a multitude of chal- lenges that must be addressed [111, 112, 85, 90, 5, 58, 82]. MOOCs platforms have been especially criticized on grounds of lacking a personalized educational experience [37]. Unlike traditional classrooms, the predominant mode of interac- tion between students and instructors in MOOCs is via online discussion forums. Ideally, forum discussions can help make up for the lack of direct interaction, by enabling students to ask questions and clarify doubts. However, due to huge class sizes, MOOCs witness a very large number of threads on these forums. Owing to extremely skewed ratios of students to instructional staff, it can be prohibitively time-consuming for the instructional staff to manually follow all threads of a fo- rum. Hence there is a pressing need for automatically curating the discussions for the instructors. Analyzing forum-postings contents and bringing the most pertinent content to the instructor’s attention would help instructors receive timely feedback and design interventions as needed. From the students’ perspective, the problem is evident from an examination of existing forum content, indicating that if stu- dents want instructor’s input on some issues, the only way for them to get his/her attention is by ‘up-voting’ their votes. Figure 6.1 provides some examples of this 83 “The problem summary: Anyone else having problems viewing the video lec- ture...very choppy. If you are also experiencing this issue; please upvote this post.” “I read that by up-voting threads and posts you can get the instructors’ atten- tion faster.” “Its is very bad to me that I achieved 10 marks in my 1st assignment and now 9 marks in my 2nd assignment, now I won’t get certificate, please Course staff it is my appeal to change the passing scheme or please be lenient. Please upvote my post so that staff take this problem under consideration.” Figure 6.1: Example posts demonstrating that in order to attract instructor’s at- tention students have to resolve to the inefficient method of getting their posts upvoted by their classmates. Therefore, there is a need to design models that could automatically identify threads on which instructors intervene. behavior. This is clearly an inefficient solution. Thus, we focus on identifying situations in which instructor (interchangeable with “instructional staff”) intervention is warranted. In our description it is assumed that a forum consists of multiple threads. Each thread (t) has a title and consists of multiple posts (pi). Individual posts do not have a title and the number of posts varies dramatically from one thread to another. We address the problem of predicting if the course instructor would intervene on a thread, t. The instructor’s decision to intervene, r, equals 0 when the instructor doesn’t reply to the thread and 1 otherwise. Individual posts are not assumed to be labeled with any category and the only supervision given to the model during training is in form of intervention decision. 84 6.2 Role of Context While the textual content of the discussion forum can be expected to provide significant cues about instructor’s intervention decision, as indicated before, it is equally important to view an individual post in context of the other posts in the thread. Therefore, we propose to leverage the structure of the forum thread in order to maximally utilize its textual contents. We assume that posts of a thread structure it in form of a story or a ‘chain of events’. For example, an opening post of a thread might pose a question and the following posts can then answer or comment on the question. Our models tap this linear ‘chain of events’ behavior of a thread by assum- ing that individual posts belong to latent categories which represent their textual content at an abstract level and that an instructor’s decision to reply to a post is based on this chain of events (represented by the latent categories). We present two different ways of utilizing this ‘chain of events’ behavior for predicting instructor’s intervention which can be either simply modeled as the ‘next step’ is this chain of events (Linear Chain Markov Model -LCMM) or as a decision globally depending on the entire chain (Global Chain Model -GCM). 6.3 Models In this section, we explain our models in detail. In our description it is assumed that a discussion board is organized into multiple forums (representing topics such 85 as “Assignment”, “Study Group”, etc.). A forum consists of multiple threads. Each thread (t) has a title and consists of multiple posts (pi). Individual posts do not have a title and the number of posts varies dramatically from one thread to another. We address the problem of predicting if the course instructor would intervene on a thread, t. The instructor’s decision to intervene, r, equals 0 when the instructor doesn’t reply to the thread and 1 otherwise. The individual posts are not assumed to be labeled with any category and the only supervision given to the model during training is in form of intervention decision. 6.3.1 Logistic Regression (LR) Our first attempt at solving this problem involved training a logistic regression for the binary prediction task which models P (r|t). Feature Engineering Our logistic regression model uses the following two types of features: Thread only features and Aggregated post features. ‘Thread only features’ capture informa- tion about the thread such as when, where, by who was the thread posted and lexical features based on the title of the thread. While these features provide a high-level information about the thread, it is also important to analyze the contents of the posts of the thread. In order to maintain a manageable feature space, we compress the features from posts and represent them using our ‘Aggregated post features’. Thread only features: 1. a binary feature indicating if the thread was started by an anonymous user 86 2. three binary features indicating whether the thread was marked as approved, unresolved or deleted (respectively) 3. forum id in which the thread was posted 4. time when the thread was started 5. time of last posting on the thread 6. total number of posts in the thread 7. a binary feature indicating if the thread title contains the words lecture or lectures 8. a binary feature indicating if the thread title contains the words assignment, quiz, grade, project, exam (and their plural forms) Aggregated post features: 9. sum of number of votes received by the individual posts 10. mean and variance of the posting times of individual posts in the thread 11. mean of time difference between the posting times of individual posts and the closest course landmark. A course landmark is the deadline of an assignment, exam or project. 12. sum of count of occurrences of assessment related words e.g. grade, exam, assignment, quiz, reading, project, etc. in the posts 87 13. sum of count of occurrences of words indicating technical problems e.g. prob- lem, error 14. sum of count of occurrences of thread conclusive words like thank you and thank 15. sum of count of occurrences of request, submit, suggest We had also considered and dropped (because of no performance gain) other features about identity of the user who started the thread, number of distinct par- ticipants in the thread (an important feature used for a similar task previously [7]), binary feature indicating if the first and the last posts were by the same user, aver- age number of words in the thread’s posts, lexical features capturing references to the instructors in the posts, etc. 6.3.2 Linear Chain Markov Model (LCMM) The logistic regression model is good at exploiting the thread level features but not the content of individual posts. The ‘Aggregated post features’ attempt to capture this information but since the number of posts in a thread is variable, these features relied on aggregated values. We believe that considering aggregate values is not sufficient for the task in hand. As noted before, posts of a thread are not independent of each other. Instead, they are arranged chronologically such that a post is published in reply to the preceding posts and this might effect an 88 h1 h2 hn r φ(t) p1 p2 pn T (a) Linear Chain Markov Model (LCMM) h1 h2 hn r φ(t) p1 p2 pn T (b) Global Chain Model (GCM) Figure 6.2: Diagrams of the Linear Chain Markov Model (LCMM) and the Global Chain Model (GCM). pi, r and φ(t) are observed and hi are the latent variables. pi and hi represent the posts of the thread and their latent categories respectively; r represents the instructor’s intervention and φ(t) represent the non-structural fea- tures used by the logistic regression model. instructor’s decision to reply. For example, consider a thread that starts with a question. The following posts will be students’ attempt to answer the question or raise further concerns or comment on previous posts. The instructor’s post, though a future event, will be a part of this process. We, therefore, introduce a new structured model, LCMM, to incorporate this complete process (shown in Figure 6.2a). The model abstractly represents the in- formation from individual posts (pi) using latent categories (hi). The intervention 89 For every thread, t, in the dataset: 1. Choose a start state, h1, and emit the first post, p1. 2. For every subsequent post, pi ∀ i ∈ {2 . . . n} : (a) Transition from hi−1 to hi. (b) Emit post pi. 3. Generate the instructor’s intervention decision, r, using the last state hn and non-structural features, φ(t). Figure 6.3: Instructor’s intervention decision process for the Linear Chain Markov Model. decision, r, is the last step in the chain and thus incorporates information from the individual posts. It also depends on the thread level features: ‘Thread only features’ and the ‘Aggregated post features’ jointly represented by φ(t) (also referred to as the non-structural features). This process is explained in Figure 6.3. We use hand-crafted features to model the dynamics of the generative process. Whenever a latent state emits a post or transits to another latent state (or to the final intervention decision state), emission and transition features get fired which are then multiplied by respective weights to compute a thread’s ‘score’: fw(t, p) = max h [w · φ(p, r, h, t)] (6.1) Note that the non-structural features, φ(t), also contribute to the final score. 90 Learning and Inference During training we maximize the combined scores of all threads in the dataset using a generic EM style algorithm. The supervision in this model is provided only in form of the observed intervention decision, r and the post categories, hi are hidden. The model uses the pseudocode shown in Algorithm 3 to iteratively refine the weight vectors. In each iteration, the model first uses the Viterbi algorithm to decode thread sequences with the current weights wt to find optimal highest scoring latent state sequences that agree with the observed intervention state (r = r′). In the next step, given the latent state assignments from the previous step, a structured perceptron algorithm [31] is used to update the weights wt+1 using weights from the previous step, wt, initialization. Algorithm 3 Training algorithm for LCMM 1: Input: Labeled data D = {(t, p, r)i} 2: Output: Weights w 3: Initialization: Set wj randomly, ∀j 4: for t : 1 to N do 5: hˆi = arg maxh[wt · φ(p, r, h, t)] such that r = ri∀i 6: wt+1 = StructuredPerceptron(t, p, hˆ, r) 7: end for 8: return w While testing, we use the learned weights and Viterbi decoding to compute the intervention state and the best scoring latent category sequence. 91 Feature Engineering In addition to the ‘Thread Only Features’ and the ‘Aggregated post features’, φ(t) (Section 6.3.1), this model uses the following emission and transition features: Post Emission Features: 1. φ(pi, hi) = count of occurrences of question words or question marks in pi if the state is hi; 0 otherwise. 2. φ(pi, hi) = count of occurrences of thank words (thank you or thanks) in pi if the state is hi; 0 otherwise. 3. φ(pi, hi) = count of occurrences of greeting words (e.g. hi, hello, good morning, welcome, etc. ) in pi if the state is hi; 0 otherwise. 4. φ(pi, hi) = count of occurrences of assessment related words (e.g. grade, exam, assignment, quiz, reading, project, etc.) in pi if the state is hi; 0 otherwise. 5. φ(pi, hi) = count of occurrences of request, submit or suggest in pi if the state is hi; 0 otherwise. 6. φ(pi, hi) = log(course duration/t(pi)) if the state is hi; 0 otherwise. Here t(pi) is the difference between the posting time of pi and the closest course landmark (assignment or project deadline or exam). 7. φ(pi, pi−1, hi) = difference between posting times of pi and pi−1 normalized by course duration if the state is hi; 0 otherwise. Transition Features: 92 1. φ(hi−1, hi) = 1 if previous state is hi−1 and current state is hi; 0 otherwise. 2. φ(hi−1, hi, pi, pi−1) = cosine similarity between pi−1 and pi if previous state is hi−1 and current state is hi; 0 otherwise. 3. φ(hi−1, hi, pi, pi−1) = length of pi if previous state is hi−1, pi−1 has non-zero question words and current state is hi; 0 otherwise. 4. φ(hn, r) = 1 if last post’s state is hn and intervention decision is r; 0 otherwise. 5. φ(hn, r, pn) = 1 if last post’s state is hn, pn has non-zero question words and intervention decision is r; 0 otherwise. 6. φ(hn, r, pn) = log(course duration/t(pn)) if last post’s state is hn and interven- tion decision is r; 0 otherwise. Here t(pn) is the difference between the posting time of pn and the closest course landmark (assignment or project deadline or exam). 6.3.3 Global Chain Model (GCM) In this model we propose another way of incorporating the chain structure of a thread. Like the previous model, this model also assumes that posts belong to latent categories. It, however, doesn’t model the instructor’s intervention decision as a step in the thread generation process. Instead, it assumes that instructor’s decision to intervene is dependent on all the posts in the threads, modeled using the latent post categories. This model is shown in Figure 6.2b. Assuming that p represents posts of thread t, h represents the latent category assignments, r represents the intervention 93 decision; feature vector, φ(p, r, h, t), is extracted for each thread and using the weight vector, w, this model defines a decision function, similar to what is shown in Equation 6.1. Learning and Inference Similar to the traditional maximum margin based Support Vector Machine (SVM) formulation, our model’s objective function is defined as: min w λ 2 ||w||2 + T∑ j l(−rjfw(tj, pj)) (6.2) where λ is the regularization coefficient, tj is the j th thread with intervention decision rj and pj are the posts of this thread. w is the weight vector, l(·) is the squared hinge loss function and fw(tj, pj) is defined in Equation 6.1. Replacing the term fw(tj, pj) with the contents of Equation 6.1 in the mini- mization objective above, reveals the key difference from the traditional SVM formu- lation - the objective function has a maximum term inside the global minimization problem making it non-convex. We, therefore, employ the optimization algorithm presented in [25] to solve this problem. Exploiting the semi-convexity property [45], the algorithm works in two steps, each executed iteratively. In the first step, it determines the latent variable assignments for positive examples. The algorithm then performs two steps iteratively - first it determines the structural assignments for the negative examples, and then optimizes the fixed objective function using a cutting plane algorithm. Once this process converges for negative examples, the algorithm reassigns values to the latent variables for positive examples, and proceeds to the second step. The 94 algorithm stops once a local minimum is reached. A somewhat similar approach, which uses the Convex-Concave Procedure (CCCP) is presented by [113]. At test time, given a thread, t, and it posts, p, we use the learned weights to compute fw(t, p) and classify it as belonging to the positive class (instructor intervenes) if fw(t, p) ≥ 0. Feature Engineering The feature set used by this model is very similar to the features used by the previous model. In addition to the non-structural features used by the logistic regression model (Section 6.3.1), it uses all the Post Emission features and the three transition features represented by φ(hi−1, hi) and φ(hi−1, hi, pi, pi−1) as described in Section 6.3.2. 6.4 Data For evaluating our models, we have used the forum content of two MOOCs from different domains (science and humanities), offered by Coursera, a leading education technology company.1 Both courses were taught by professors from the University of Maryland, College Park. Genes and the Human Condition (From Behavior to Biotechnology) (GHC) dataset: 2 This course was attended by 30,000 students and the instructional staff comprised of 2 instructors, 3 Teaching Assistants and 56 technical support staff. The discussion forum of this course consisted of 980 threads composed of about 1https://www.coursera.org/ 2https://www.coursera.org/course/genes 95 3,800 posts. Women and the Civil Rights Movement (WCR) dataset: 3 The course consisted of a classroom of about 14,600 students, 1 instructor, 6 Teaching Assistants and 49 support staff. Its discussion forum consisted of 800 threads and 3,900 posts. We evaluate our models on held-out test sets. For the GHC dataset, the test set consisted of 186 threads out of which the instructor intervened on 24 while, for the WCR dataset, the instructor intervened on 21 out of 155 threads. Also, it was commonly observed that after an instructor intervenes on a thread, its posting and/or viewing behavior increases. We, therefore, only consider the student posts until the instructor’s first intervention. Care was also taken to not use features that increased/decreased disproportionately because of the instructor’s intervention such as number of views or votes of a thread. In our evaluation, we approximate instructor’s ‘should reply’ instances with those where the instructor indeed replied. Unlike general forum users, we believe that the correlation between the two scenarios is quite high for instructors. It is their responsibility to reply, and by choosing to a MOOC, they have ‘bought in’ to the idea of forum participation. The relatively smaller class sizes of these two MOOCs also ensured that most threads were manually reviewed, thus reducing instances of ‘missed’ threads while retaining the posting behavior and content of a typical MOOC. 3https://www.coursera.org/course/womencivilrights 96 Model Genes and the Human Condition (GHC) Women and the Civil Rights (WCR) P R F P R F LR 44.44 16.67 24.24 66.67 15.38 25.00 J48 45.50 20.80 28.55 25.00 23.10 24.01 LCMM 33.33 29.17 31.11 42.86 23.08 30.00 GCM 60.00 25.00 35.29 50.00 18.52 27.03 Table 6.1: Held-out test set performances of chain models, LCMM and GCM, are better than that of the unstructured models, LR and J48. 6.5 Evaluation 6.5.1 Quantitative Results Since the purpose of solving this problem is to identify the threads which should be brought to the notice of the instructors, we measure the performance of our models using F-measure of the positive class. The values of various parameters were selected using 10-fold Cross Validation on the training set. Table 6.1 presents the performances of the proposed models on the held-out test sets. We also report performance of a decision tree (J48) on the test sets for sake of comparison. We can see that the chain based models, Linear Chain Markov Model (LCMM) and Global Chain Model (GCM), outperform the unstructured models, namely Lo- gistic regression (LR) and Decision Trees (J48). This validates our hypothesis that 97 Figure 6.4: Visualization of lexical contents of the categories learnt by our model from the GHC dataset. Each row is a category and each column represents a fea- ture vector. A light color represents high values while lower values are represented by darker shades. Dark monochromatic columns are used to better separate the five feature clusters, F1-F5, which represent words that are common in thanking, logistics-related, introductory, syllabus related and miscellaneous posts respectively. Categories 1,2,3 and 4 are dominated by F2, F4, F1 and F3 respectively indicating a semantic segregation of posts by our model’s categories. using the post structure results in better modeling of instructor’s intervention. The table also reveals that GCM yields high precision and low recall values, which is possibly due to the model being more conservative owing to information from all posts of the thread. 98 6.5.2 Visual Exploration of Categories Our chain based models assume that posts belong to different (latent) cate- gories and use these categories to make intervention predictions. Since this process of discovering categories is data driven, it would be interesting to examine the contents of these categories. Figure 6.4 presents a heat map of lexical content of categories identified by LCMM from the GHC dataset. The value of H (number of categories) was set to be 4 and was pre-determined during the model selection procedure. Each row of the heat map represents a category and the columns represent values of indi- vidual features, f(w, c), defined as: f(w, c) = C(w,c) where, C(w, c) is total count of occurrences of a word, w, in all posts assigned to category, c and < C(w, c) > represents its expected count based on its frequency in the dataset. While the actual size of vocabulary is huge, we use only a small subset of words in our feature vector for this visualization. These feature values, after normalization, are represented in the heat map using colors ranging from bright cream (high value) to dark black (low value). The darker the shade of a cell, the lower is the value represented by it. For visual convenience, the features are manually clustered into five groups (F1 to F5) each separated by a dark beige colored column in the heat map. The first column of the heat map represents the F1 group which consists of words like thank you, thanks, etc. These words are characteristic of posts that mark either the conclusion of a resolved thread or are posted towards the end of the course. Rows corresponding to the category 3 in Table 6.2 show two examples of such 99 posts. Similarly, F2 represents the features related to logistics of the course and F3 captures introductory posts by new students. Finally, F4 contains words that are closely related to the subfield of gene and human conditions and would appear in posts that discuss specific aspects or chapters of the course contents, while F5 contains general buzz words that would appear frequently in any biology course. Analyzing individual rows of the heat map, we can see that out of F1 to F4, Categories 1, 2, 3 and 4 are dominated by logistics (F2), course content related (F4), thank you (F1) and introductory posts (F3) respectively, represented by bright colors in their respective rows. We also observe similar correlations while examining the columns of the heat map. Also, F5, which contains words common to the gene and human health domain, is scattered across multiple categories. For example, dna/rna and breeding are sufficiently frequent in category 1 as well as 2. Table 6.2 gives examples of representative posts from the four clusters. We show only the relevant part of the complete post. We can see that these examples agree with our observations from the heat map. Furthermore, we compare the semantics of clusters learnt by our models with those proposed by Stump at al. [101] even though the two categorizations are not directly comparable. Nevertheless, generally speaking, our category 1 corresponds to Stump et al. [101]’s Course structure/policies and category 2 corresponds to Content. Interestingly, categories 3 and 4, which represent valedictory and introductory posts, correspond to a single Social/affective from the previous work. We can, therefore, conclude that the model, indeed splits the posts into cate- gories that look semantically coherent to the human eyes. 100 Category Example posts 1 ‘I’m having some issues with video playback. I have downloaded the videos to my laptop...’ ‘There was no mention of the nuclear envelope in the Week One lecture, yet it was in the quiz. Is this a mistake?’ 2 ‘DNA methylation is a crucial part of normal development of or- ganisms and cell differentiation in higher organisms...’ ‘In the lecture, she said there are...I don’t see how tumor-suppressor genes are a cancer group mutation.’ 3 ‘Thank you very much for a most enjoyable and informative course.’ ‘Great glossary! Thank you!’ 4 ‘Hello everyone, I’m ... from the Netherlands. I’m a life science student.’ ‘Hi, my name is ... this is my third class with coursera’ Table 6.2: Representative posts from the four categories learnt by our model. Due to space and privacy concerns we omit some parts of the text, indicated by “. . . ”. 6.5.3 Choice of Number of Categories Our chain based models, assigning forum posts to latent categories, are param- eterized with H, the number of categories. We therefore, study the sensitivity of our models to this parameter. Figure 6.5, plots the 10-fold cross validation performance of the models with increasing values of H for the two datasets. Interestingly, the sensitivity of the two models to the value of H is very different. The LCMM model’s performance fluctuates as the value of H increases. The initial performance improvement might be due to an increase in the expressive power of the model. Performance peaks at H = 4 and then decreases, perhaps owing to over-fitting of the data. 101 In contrast, GCM performance remains steady for various values of H which might be attributed to the explicit regularization coefficient which helps combat over-fitting, by encouraging zero weights for unnecessary categories. (a) Genes and the Human Condition dataset (b) Women and the Civil Rights Movement dataset Figure 6.5: Cross validation performances of the two models with increasing number of categories. 102 (a) Genes and the Human Condition dataset (b) Women and the Civil Rights Movement dataset Figure 6.6: Cross validation performances of the various feature types for the two datasets. 103 6.5.4 How important are linguistic features? We now focus on the structure independent features and experiment with their predictive value, according to types. We divide the features used by the LR into the following categories:4 • Full: set of all features (feature no. 1 to 15) • lexical: based on content of thread titles and posts (feature no. 7 to 8 and 12 to 13) • landmark: based on course landmarks (e.g, exams, quizzes) information (fea- ture no. 11) • MOOCs-specific: features specific to the MOOCs domain (lexical + landmark features) • post: based only on aggregated posts information (feature no. 9 to 15) • temporal: based on posting time patterns (feature no. 4, 5 and 10) Figure 6.6 shows 10-fold cross validation F-measure of the positive class for LR when different types of features are excluded from the full set. The figure reveals that the MOOCs-specific features (purple bar) are important for both the datasets indicating a need for designing specialized models for forums analysis in this domain. 4Please refer to Section 6.3.1 for description of the feature ids. 104 Also, lexical features (red bar) and post features (blue bar) have pretty dra- matic effects in GHC and WCR data respectively. Interestingly, removing the landmark feature set (green bar) causes a consid- erable drop in predictive performance, even though it consists of only one feature. Other temporal features (orange bar) also turn out to be important for the predic- tion. From a deeper analysis of feature values, we observed that instructors tend to get more active as the course progresses and their activity level also increases around quizzes/exams deadlines. We can, therefore, conclude that all feature types are important and that lexical as well as MOOC specific analysis is necessary for modeling instructor’s intervention. 6.6 Additional Application: Understanding Desire Fulfillment We additionally apply the model proposed above to a different problem of identifying if a desire expressed by a subject in a given short piece of narrative text was fulfilled. 6.6.1 Problem Motivation and Definition Understanding expressions of desire is a fundamental aspect of understanding intentional human-behavior. Desire expressions can be used to provide rationale for character behaviors when analyzing narrative text [53, 41], extract information about human wishes [51], explain positive and negative sentiment in reviews, and 105 Desire Expression: Before Lenin died, he said he wished to be buried beside his mother. e1: When he died, Stalin let the people in Russia look at his body. e2: Because people kept coming they decided not to bury him, and preserved his body instead. e3: A building was built in Red Square, Moscow over the body so that people could see it. e4: It is called the Lenin Mausoleum. e5: Many Russians and tourists still go there to see his body today. Desire Fulfillment status: Not fulfilled Figure 6.7: Example consisting of: a Desire Expression, Evidence fragments (e1. . .e5) and a binary Desire Fulfillment Status. The Desire-subject is marked in bold fonts in the Desire-expression. support automatic curation of community forums by identifying unresolved issues raised by users. Given text, denoted as Desire-expression (e.g., “Before Lenin died, he said he wished to be buried beside his mother.”) containing a desire (“be buried beside his mother”) by the Desire-subject (“he”), and the subsequent text (denoted Evi- dence fragments or simply Evidences) appearing after the Desire-expression in the paragraph, we predict if the Desire-subject was successful in fulfilling their desire. Figure 6.7 illustrates our setting. 6.6.2 Model: LCMM To solve this problem we track the events and emotional states associated with the narrative’s central character (the Desire-subject) while modeling the narrative flow offered by the Evidence fragments. To model the narrative flow we associate 106 a latent state to each Evidence fragment indicating the model’s belief about the desire fulfillment status. We model the transitions between these states using the LCMM model presented above (Figure 6.2a), where the final latent state indicates the binary desire fulfillment status (as opposed to representing the instructor’s reply decision in the other task mentioned in this chapter). In other words, such an approach analyzes individual evidence fragments to extract information about desire fulfillment. However, while doing so it views each evidence fragment in context of the other evidences. The model is trained in a similar fashion. 6.6.3 Features We now describe our features and how they are used by the model. Table 6.3 defines our features. They capture different semantic aspects of the desire-expression and evidences, such as entities, their actions and connotations, and their emotive states using lexical resources like Connotation Lexicon [46], WordNet and our lexicon of conforming and dissenting phrases. Before extracting features, we pre-processed the text to obtain POS tags, dependency parses, and resolved co-references using Stanford CoreNLP [73], and extracted all adjectives and verbs (with their negation statuses and connotations) associated with the Desire-subject. 1. Entailment (F1): This feature simply incorporates the output of BIUTEE [99, 70] – a Textual Entailment (TE) model. Since, TE systems often rely on aligning the entities appearing in the text fragments, we reduce the desire fulfillment task into several TE instances consisting of text-hypothesis pairs, by pairing the Desire- 107 Feature Type Id Definition Entailment F1 TEPrediction: Binary prediction of the Textual Entailment model [99]. Discourse F2, F3 ButPresent, SoPresent : Binary features indicating if a ‘but’ or ‘so’ (respectively) followed the Desire-verb (‘wanted to’, ‘wished to’ etc.) in the Desire-expression. Focal Word F4, F5, F6 focal count, focal syn and focal ant count: Count of occur- rences of the focal word(s), their WordNet [77] synonyms and antonyms (respectively) in the Evidence. Occurrences of syn- onyms or antonyms were identified only when they had the same POS tag as the focal word(s). F7 focal+syn count: Sum of F4 and F5 F8 focal lemm count: Count of occurrences of lemmatized forms of the focal word(s) in the Evidence. Desire- subject mention F9 sub count: Count of all mentions (direct and co-referent) of the Desire-subject in the Evidence. Emotional State F10, F11 +adj, -adj count: Counts of occurrences of ‘positive’ and ‘neg- ative’ adjectives (respectively) modifying the direct and co- referent mentions of the Desire-subject in the Evidence. Action F12, F13 +Agent, -Agent count: Number of times the connotation of verbs appearing in the Evidence agreed with and disagreed with (respectively) that of the intended action. F14, F15 +Patient, -Patient count: Count of occurrences of ‘positive’ and ‘negative’ verbs (respectively) in the Evidence which had the Desire-subject as the patient. Sustenance F16, F17 isConforming, isDissenting: Binary features indicating if the Evidence starts with a conforming or dissenting phrase (respec- tively) (see Appendix B). Table 6.3: Feature definitions (Section 6.6.3). F1-F3 are extracted for each example while F4-F17 are extracted for each evidence. 108 Figure 6.8: Artificial example indicating feature utility. The Desire-subject mentions are marked in blue, actions in bold and emotions in italics. Discourse feature is underlined. expression (hypothesis) with each of the Evidence fragments (text) in that example. However, we “normalized” the Desire-expression, so that it would be directly appli- cable for the TE task and resemble the input over which such systems are trained. For example, the Desire-expression, “One day Jerry said he wanted to paint his barn.”, gets converted to “Jerry painted his barn.”. This process followed several steps: • If the Desire-subject is pronominal, replace it with the appropriate named entity when possible (we used the Stanford CoreNLP coreference resolution system) [73]. • Ignore the content of the Desire-expression appearing before the Desire-subject. • Remove the clause containing the Desire-verb (‘wanted to’, ‘wished to’ etc.), and convert the succeeding verb to its past tense. The desire was considered ‘fulfilled’ if the TE model predicted entailment for at least one of the text-hypothesis pairs of the example. 2. Discourse (F2-F3): These features aim to identify indications of obstacles 109 or progress of desire fulfillment in the Desire-expression itself, based on discourse connectives. E.g. ‘so’ (underlined) in the Desire-expression in Figure. 6.8 indicates progress of desire fulfillment. 3. Focal words (F4-F8): These features identify the word(s) most closely related to the desire, and look for their presence in the Evidences. We define a focal word as the clausal complement of the Desire-verb (‘wanted to’, ‘hoped to’, ‘wished to’). If the clausal complement is a verb, the focal word is its past tense form. e.g., the focal word in the Desire expression in Figure. 6.8 is ‘helped’. A focal word is not simply the verb following the Desire-verb: e.g. in the Desire-expression in Figure. 6.7, the causal complement of ‘wished’ is ‘buried’. We then define features counting occurrences of the identified focal words and their WordNet synonyms and antonyms in each of the Evidences. 4. Desire-subject mentions (F9): This feature looks for mentions of Desire- subject in the Evidences assuming that a lack of mentions of the Subject might indicate absence of instances of their taking actions needed to fulfill the desire. 5. Emotional State (F10-F11): Signals about the fulfillment status could also emanate from the emotional state of the Subject. A happy or content Desire-subject can be indicative of a fulfilled desire (e.g. in Evidence e3 in Figure. 6.8), and vice versa. We quantify the emotional state of the Subject(s) using connotations of the adjectives modifying their mentions. 6. Action features (F12-F15): These features analyze the intended action and the actions taken by various entities. We first identify the intended action - the verb immediately following the Desire-verb in the Desire Expression. e.g., in Figure 6.8 110 the intended action is to ‘help’. Thereafter, we design features that capture the connotative agreement between the intended action and the actions taken by the Desire-subject(s) in the Evidences. We also include features that describe connota- tions of actions (verbs) affecting the Desire-subject(s). E.g. in e1 of Figure 6.8, the action by the Desire-subject (marked in blue), ‘offered’, is in connotative agreement with the intended action, ‘help’ (both have positive connotations according to Feng et al. [46]). Also, the actions affecting the subject (‘thanked’, ‘gifted’) have positive connotations indicating desire fulfillment. 7. Sustenance Features (F16-F17): LSNM uses a chain of latent states to abstractly represent the content of the Evidences with respect to Desire fulfillment Status. At any point in the chain, the model has an expectation of the fulfillment status. The sustenance features indicate if the expectation should intensify, remain the same or be reversed by the incoming Evidence fragment. This is achieved by designing features indicating if the Evidence fragment starts with a ‘conforming’ or a ‘dissenting’ phrase. E.g. e3 in Figure 6.8 starts with a conforming phrase, ‘Overall’, indicating that the fulfillment status expectation (positive in e2) should not change. These phrases were chosen using various discourse senses mentioned in Prasad et al. [88]. The complete list is available in Appendix B. 6.6.4 Evaluation We evaluated our model on two datasets consisting of excerpts from (i) stories in the Machine Comprehension Test dataset (MCTest) [92] and (ii) the Simple En- 111 glish Wikipedia (SimpleWiki) 5. We compared this model with unstructured models that do not take the narrative flow into account. We also include the majority base- line, which always predicts the majority class. The majority classes for the MCTest and SimpleWiki datasets were the positive and the negative classes respectively. Table 6.4 shows our results. For our model, LCMM, we report median performance values over 100 random restarts, since its performance depends on the initializa- tion of the weights. Also, the number of latent states was set to be 2 and 15 for the MCTest and SimpleWiki datasets respectively using cross-validation. The difference in optimal values for the number of latent states (and F scores) for the two datasets could be attributed to the difference in complexity of the language and concepts used in them. MCTest consists of children stories, focusing on simple concepts and goals (e.g., ‘wanting to go skating’) and their fulfillment is indicated explicitly, in simple and focused language (e.g., ‘They went to the skating rink together.’). On the other hand, SimpleWiki describes real-life desires (e.g., ‘wanting to conquer a country’), which require sophisticated planning over multiple steps, which may pro- vide only indirect indication of the desire’s fulfillment. This resulted in a harder classification problem, and increased the complexity of inference over several latent states. The table shows that LCMM outperforms the unstructured models indicating the benefit of incorporating the narrative structure of the storyline offered by the text by viewing individual evidences in context of other evidences to better understand desire fulfillment. 5http://simple.wikipedia.org/wiki/Main_Page 112 Data Model Type Model Name P R F MCTest Unstructured Majority 27.9 50.0 35.8 LR 64.7 64.9 64.6 DT 63.2 63.9 61.7 Structured LCMM 69.4 68.8 68.0 SimpleWiki Unstructured Majority 36.0 50.0 41.8 LR 61.6 52.7 49.2 DT 57.7 51.3 46.3 Structured LCMM 57.1 55.1 54.2 Table 6.4: Test set performances of various models reporting averaged Precision, Recall and F measure of the two classes. Our Latent Chain Markov Model, LCMM, outperforms the unstructured models (Logistic Regression, LR and Decision Trees, DT) and the majority baseline (which always predicts the majority class). 113 6.7 Conclusion In this chapter we explored an application of the relationship identification task in the domain of MOOC discussion forums. A student’s question on the forum is viewed as a request to initiate a deeper relationship with the instructor, and the instructor’s reply is viewed as the initiation of the relationship. With this view, we address the task of initiation or existence of such instructor-student relationships. Specifically, we address the task of predicting instructor intervention in MOOC dis- cussion forums. For this problem, we analyzed the text of the thread while viewing the individual posts in context of each other. We achieved this by presenting models that incorporate the structure of the thread using latent variables. Our experiments on forum data from two different Coursera MOOCs showed that utilizing thread structure is important for predicting instructors behavior. We further applied one of our models to another task of identifying if a desire expressed in a short piece of narrative text got fulfilled. We extracted information from individual sentences in the text while viewing them in context of each other and empirically demonstrated the importance of incorporating this context and the utility of our model. 114 Chapter 7: Conclusion 7.1 Summary In this dissertation we presented methods to model inter-personal relation- ships from text. We began with the problem of jointly inferring cooperative or non-cooperative relationships between characters of a narrative, and showed how the task of characterizing the relationship between two characters can benefit from incorporating information about their relationships with others. We then addressed modeling evolving relationships as a sequence prediction task. For a given pair of characters, we identified a sequence of variables depicting the evolving relationship between the two characters. Each variable in the sequence was binary in nature and indicated a cooperative or a non-cooperative relationship status at a juncture in the narrative. We empirically demonstrated how this task could benefit from including historical dependencies. In the following chapter, instead of specifying and restricting relationship types, we enabled our models to automatically discover various types of relation- ships in a data-driven manner. Some examples of relationship types discovered by our model include familial, adversarial, romantic, etc. Like the previous chapter, we modeled evolution of inter-personal relationships. We also demonstrated that 115 apart from incorporating historical information about the relationship of the two characters of interest, it is beneficial to maintain a global belief about the overall nature of their relationships. Finally, we presented a practical application of this task in the domain of MOOC (Massive Open Online Courses) discussion forums. While viewing forums as platforms where their users (including students and instructors) interact and form relationships with each other, we focused on predicting the initiation of an involved relationship between students and instructors. In other words, we addressed the problem of analyzing forum threads to identify when an instructor would intervene on the thread. We showed that apart from incorporating various linguistic and domain-specific cues, it is important to model the discourse structure of discussion threads. We additionally demonstrated that such a model could also be used to address other problems, such as reading a piece of text to identify if the desire expressed in it was fulfilled. An underlying theme that subsumes all of the above tasks is viewing these problems as structured predictions tasks that require incorporating linguistic cues as well as their contexts of appearance. 7.2 Future Work The rest of this section discusses plausible directions for future work. 116 7.2.1 Domain-specific Models The models presented in Chapters 3 to 5 could benefit from including domain- specific knowledge about narrative genres, author styles and preferences, etc. For example, while modeling inter-character relationships in Jane Austen’s novels, it could help to include the information that her novels are mostly romantic. Similarly, while inferring inter-character relationships in narratives that are a part of a series (e.g. Harry Potter novels, Sherlock Holmes stories etc.), it might help to include a longer historical context or background knowledge about the relation- ships between the characters of interest in the previous narratives or the complete series. 7.2.2 Enhancing Latent Variable Models In Chapter 5 we presented an unsupervised method that discovers various types of relationships in a data-driven manner. It assumes that sentences depicting relationships between two given characters belong to latent clusters, and each of these clusters represents a type of relationship. Such a model could benefit from a weak supervision, which can guide the cluster to discover the types of relationships we expect them to learn in a given dataset. One way to achieve this is by optionally ‘seeding’ the clusters with representative words. For example, while exploring a dataset of novels, including crime and romantic novels, one cluster could be seeded with words like ‘love’ and ‘marriage’ while another could be seeded with ‘kill’ and ‘attack’. While such a model will be more inclined to learn these two specific clus- 117 ters, it would still have the ability to discover unseeded clusters thus leveraging the advantages of unsupervised data-driven methods. A similar idea could be applied to the models presented in Chapter 6. These models also assume that posts of a discussion forum thread belong to latent cate- gories and learn these categories in a data-driven manner. These categories can also be seeded with domain-specific words or cues, enhancing their semantic coherence. Another idea to improve the meaningfulness of the learned clusters is by en- forcing the learned clusters to look diverse. This could be achieved by including an additional regularizer in the objective function of the models that aims at minimiz- ing the inter-cluster similarity while maximizing the intra-cluster similarity. Such methods could be expected to improve the semantics of the learned clusters. 7.2.3 Applications to other NLP problems This dissertation primarily focused on modeling inter-personal relationships. Apart from assisting in understanding people’s actions and goals in text, future work could study applications of the ideas presented in this dissertation to various domains. For example, we showed one such application in the domain of MOOC dis- cussion forums. Another task, which can benefit from studying relationships, is that of automatically recommending recipient(s) for an email based on its text. Includ- ing information about the style or content of the email coupled with the (evolving) relationship of the composer with other his/her contacts (such as formal/informal, supervisor/subordinate, colleagues/family etc.), can help in obtaining novel per- 118 spectives and solutions. For example, an email discussing planning the user’s son’s birthday party is more likely to be meant for family members than office colleagues. Another use-case in this domain would be to curate the incoming emails/messages according to the recipients relationship with the sender. For example, a user might want all emails from family members to automatically get redirected to a dedi- cated folder. Similarly the user might want all emails from his/her supervisor to be automatically marked as important. It is also possible to explore similar applications in related domains like instant messaging systems, social networking sites, etc. In the purview of recommendation systems, this research could apply to investigate the role of relationships in modeling individual preferences, and vice versa. For example, people in close relationships might be expected to reflect similar preferences in movies, books, music, etc. A more grounded use-case could be support or therapy groups. One could model the evolution of the relationship between the therapist and the patient from the content of their exchanges over social media, emails or text messages. This could assist therapists and moderators or administrators in analyzing if the trajectory of the relationship is constructive, or as intended. 7.2.4 Widening the definition of relationships The work presented in this dissertation made several assumptions about re- lationships that can be relaxed in future work. For example, in the current work, we treat all relationships as symmetric. Future work could focus on studying asym- 119 metric relationships, such as unrequited love. Potential models operating with this definition could model relationships to be mutual in most cases but allow for asym- metric relationships in some cases. The current work also doesn’t allow simultaneous existence of multiple relationship- types. It assumes that at any point of time, there can be only one type of relationship between the participants. Future work could relax this criterion to allow for more nuanced modeling of relationships while treating this as a multi-label classification problem. For example, siblings could also be rivals, or a person’s supervisor could also be his/her friend, etc. An interesting aspect of this work would be limit co- existence of certain relationship pairs. For example it is very unlikely for friends to also be mutually hostile. It would be interesting to learn such compatibilities or incompatibilities between labels using data-driven methods. A related future direction could be to model relationship between characters as a mixture of several relationship-types. Such methods could use probabilistic models or mixture models to infer statements like a given pair of characters can be described as 90% friends and 10% family-members. Such models could be used to discover subtle relationships that are not explicitly stated in the text. For example, in a story two people might have a romantic attraction for each other but not demonstrate it obviously or might even exhibit a feeling of initial distaste towards each other. A human reader might have an inkling that the dislike might eventually lead to a more intimate relationship later. It would be interesting to see if such mixture models or probabilistic models could discover such hidden relationships by assigning a non-trivial weight/probability to the romantic relationship even when 120 the explicitly demonstrated relationship is of dislike. 7.2.5 Relevance to Digital Humanities Research in digital humanities focuses on studying large collections of text. Our relationship identification models could be used to analyze news corpora and Wikipedia articles to study relationships and their evolution between political lead- ers and organizations (like political parties), notable individuals (like the pope, vi- sionary scientists, etc.), and geo-political or business entities (like USA, UN, NATO), etc. Figure 7.1 shows example of a social network of political entities extracted from a Wikipedia article using the model presented in Chapter 3. In this network, each node represents a political entity and the edges labeled with + or - signs represent cooperative and non-cooperative relationships respectively. This example indicates a possible use-case of this work as a tool for analyzing digital archives and political data. Our ideas could also be extended to analyze and discover patterns from large collections of literary works. For example our data-driven method in Chapter 5 dis- covered relationship states with subtle differences which can be difficult to discover manually (such as Casual and Verbal in Table 5.1), and can be of particular inter- est to literary scholars. The Casual state, depicting physical activity, might be a stronger indicator of beginning of a more intimate relationship than the Verbal state. Future work based on this research could also conduct large-scale studies aimed at answering literary questions like “Do certain authors or novels portray relationships 121 Figure 7.1: Network Inferred from the Wikipedia article on 2003 Invasion of Iraq. The + and - signs indicate cooperative and non-cooperative relationships respec- tively. of desire more than others?” [87]; “Do Jane Austen’s female and male protago- nists have a pattern in their evolving relationship (e.g. mutual disdain followed by romantic love)?” [19, 100, 57] etc. 7.2.6 Relevance to Psychological and Sociological theories The ideas presented in this dissertation could also be extended to analyze theories related to inter-personal relationships. For example, Levinger [67] stud- ied development of relationships, and suggested that relationships have a timeline or lifespan and several life-stages. For instance, romantic relationships have stages like Acquaintance, Buildup, Continuation, Deterioration, and Termination. In fu- ture, our models could be customized to study the various stages of certain types 122 of relationships. For example, while focusing only on cues related to romantic rela- tionships, our models could analyze if fictional or real accounts of love stories indeed follow such timelines. Alternatively, they could attempt at exploring whether there is a connection between success/popularity of such narratives and their proximity to such theoretical timelines. Thus, the current work suggests several interesting avenues for improving and extending existing models, as well as using them to solve other interesting problems from domains ranging from NLP to Psychology. Improved methods for solving these problems could also focus on enriched models of character attributes, goals and intentions, as well grounding character personae in particular narratives to entities in a knowledge base. Such joint models could benefit from treating text as a reflection of social phenomenon, and incorporating world knowledge from multiple sources to assist in its comprehension. 123 Appendix A: Frames Lexicon Lists of frames used by our Semantic-parse based features. ‘Frames’ and ‘Frame-elements’ used in the following table refer to those fired during frame- semantic parsing of the text using Semafor [33]. Type Frame Frame-elements Negative abusing victim, abuser attack assailant, victim avoiding agent, undesirable situation besieging assailant, victim cause emotion agent, experiencer cause harm agent, victim competition participant 1, participant 2, participants defending assailant, victim destroying destroyer, undergoer endangering agent, valued entity experience bodily harm experiencer, body part 124 Type Frame Frame-elements Negative fall for deception, victim fear stimulus, experiencer firing employer, employee giving in compeller, capitulator going back on a commitment protagonist, affected party hindering protagonist, action hit target agent, target hostile encounter side 1, side 2, sides immobilization agent, patient inhibit movement agent, theme intentional deception deceiver, victim kidnapping perpetrator, victim, co-participant killing killer, victim manipulate into doing manipulator, victim offenses perpetrator, victim piracy perpetrator, victim prevent from having agent, protagonist protecting danger, asset quarreling arguer1, arguer2, arguers 125 Type Frame Frame-elements Negative rape perpetrator, victim revenge avenger, offender, injured party taking captive agent, captive thwarting protagonist, preventing cause trap deceiver, victim violence aggressor, aggressors, victim want suspect suspect Table A.1: Lists of negative frames. Type Frame Frame-elements Positive alliance member 1, member 2, alliance be in agreement on assessment cognizer 1, cognizer 1, cognizers collaboration partner 1, partner 2, partners chatting interlocutor 1, interlocutor 2 choosing cognizer, chosen come together configuration, individuals, party 1, party 2 commitment speaker, addressee commonality entity 1, entity 2, entities defending defender, victim 126 Type Frame Frame-elements Positive desiring experiencer, focal participant emotion active experiencer, topic personal relationship partner 1, partner 2, partners forgiveness judge, evaluee forming relationships partner 1, partner 2, partners grooming agent, patient hospitality host, guest make agreement on action party 1, party 2, parties offering offerer, potential recipient participation participant 1, participant 2, participants protecting protection, asset reassuring speaker, experiencer releasing captor, theme reparation wrongdoer, wrongdoer rescuing agent, asset, patient reveal secret speaker, addressee sharing protagonist 1, protagonist 2, protagonists sign agreement signatory, co-participant social connection individual 1, individual 2, individuals 127 Type Frame Frame-elements Positive social event attendee social event collective attendees social event individuals party 1, party 2 suasion speaker, addressee subjective influence agent, cognizer supply supplier, recipient supporting supporter, supported visiting agent, entity warning speaker, addressee Table A.2: Lists of positive frames. 128 Type Frame Frame-elements Ambiguous cause bodily experience agent, experiencer cause to experience agent, experiencer cotheme theme, cotheme experiencer focus experiencer friendly or hostile side 1, side 2, sides manipulation agent, entity respond to proposal interlocutor, speaker Table A.3: Lists of ambiguous frames. Type Frame Frame-elements Relationship kinship alter, ego, relatives forming relationships partner 1, partner 2 personal relationship partner 1, partner 2 subordinates and superiors superior, subordinate Table A.4: Lists of relationship frames. 129 Appendix B: Conformity Lexicon Our sustenance features described in Section 6.6.3 use a list of conforming and dissenting phrases. These phrases were chosen manually using various discourse senses mentioned in Prasad et al. [88]. For conforming phrase list we considered the explicit connectives in the fol- lowing discourse senses: ‘Contra-expectation’, ‘Contrast’, ‘Contrast/Precedence’, ‘Contrast/Synchrony’, ‘Contrast/Temporal’, ‘Opposition’ but discarded very fre- quent phrases like ‘and’, ‘if’, ‘or’, ‘then’, ‘when’, ‘while’, ‘as if’, ‘in the end’. Similarly, for the dissenting phrase list, we considered the following senses: ‘Equivalence’, ‘Instantiation’, ‘Precedence/Result’, ‘Reason’, ‘Reason/Restatement’, ‘Reason/Synchrony’, ‘Result’, ‘Specification’, ‘Specification/Succession’, ‘Specifica- tion/Synchrony’ and discarded ‘and’, ‘as’, ‘or’, ‘for’, ‘then’, ‘when’, ‘rather’, ‘in turn’, ‘but’, ‘if only’. The complete list of these phrases is shown in Table B.1. 130 Type Phrases Conforming ‘in other words’, ‘indeed’, ‘for example’, ‘for instance’, ‘in fact’, ‘in particular’, ‘finally’, ‘ultimately’, ‘apparently because’, ‘at least partly because’, ‘because’, ‘especially as’, ‘especially because’, ‘es- pecially since’, ‘in large part because’, ‘in part because’, ‘insofar as’, ‘just because’, ‘largely because’, ‘mainly because’, ‘merely be- cause’, ‘not because’, ‘not only because’, ‘now that’, ‘only because’, ‘particularly as’, ‘particularly because’, ‘particularly since’, ‘partly because’, ‘perhaps because’, ‘presumably because’, ‘primarily be- cause’, ‘simply because’, ‘since’, ‘so’, ‘accordingly’, ‘as a result’, ‘consequently’, ‘hence’, ‘in the end’, ‘in turn ’, ‘largely as a result’, ‘so that’, ‘thereby’, ‘therefore’, ‘thus’, ‘also’, ‘as though’, ‘much as’, ‘overall’, ‘specifically’, ‘especially after’, ‘especially when’ Dissenting ‘although’, ‘but’, ‘by comparison’, ‘by contrast’, ‘conversely’, ‘even though’, ‘however’, ‘in contrast’, ‘in fact’, ‘instead’, ‘rather’, ‘never- theless’, ‘nonetheless’, ‘nor’, ‘on the contrary’, ‘on the other hand’, ‘meanwhile’, ‘still’, ‘though’, ‘whereas’, ‘yet’, ‘even as’, ‘even if’, ’even still’, ‘even then’, ‘regardless’, ‘neither’ Table B.1: Lists of conforming and dissenting phrases used by the Sustenance fea- tures. 131 Bibliography [1] Elizabeth Abel. (E)Merging Identities: The Dynamics of Female Friendship in Contemporary Fiction by Women. Signs, 6(3):413–435, 1981. ISSN 00979740, 15456943. URL http://www.jstor.org/stable/3173754. [2] Apoorv Agarwal, Anup Kotalwar, Jiehan Zheng, and Owen Rambow. SIN- NET: Social Interaction Network Extractor from Text. In Sixth International Joint Conference on Natural Language Processing, IJCNLP 2013, Nagoya, Japan, October 14-18, 2013, pages 33–36, 2013. URL http://aclweb.org/ anthology/I/I13/I13-2009.pdf. [3] Richard D Alexander. The Evolution of Social Behavior. Annual review of ecology and systematics, pages 325–383, 1974. [4] Irwin Altman and Dalmas A Taylor. Social Penetration: The Development of Interpersonal Relationships. Holt, Rinehart & Winston, 1973. [5] Ashton Anderson, Daniel P. Huttenlocher, Jon M. Kleinberg, and Jure Leskovec. Engaging with Massive Online Courses. In 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7-11, 2014, pages 687–698, 2014. doi: 10.1145/2566486.2568042. URL http://doi.acm.org/10.1145/2566486.2568042. [6] Yoav Artzi, Patrick Pantel, and Michael Gamon. Predicting Responses to Microblog Posts. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT ’12, pages 602–606, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics. ISBN 978-1-937284-20-6. URL http://dl.acm.org/citation.cfm?id=2382029.2382126. [7] Lars Backstrom, Jon Kleinberg, Lillian Lee, and Cristian Danescu-Niculescu- Mizil. Characterizing and Curating Conversation Threads: Expansion, Focus, Volume, Re-entry. In Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, WSDM ’13, pages 13–22, New York, NY, USA, 2013. ACM. ISBN 978-1-4503-1869-3. doi: 10.1145/2433396.2433401. URL http://doi.acm.org/10.1145/2433396.2433401. 132 [8] Collin F. Baker, Charles J. Fillmore, and John B. Lowe. The Berkeley FrameNet Project. In 36th Annual Meeting of the Association for Com- putational Linguistics and 17th International Conference on Computational Linguistics, COLING-ACL’98, August 10-14, 1998, Universite´ de Montre´al, Montre´al, Quebec, Canada. Proceedings of the Conference., pages 86–90, 1998. URL http://aclweb.org/anthology/P/P98/P98-1013.pdf. [9] David Bamman. People-Centric Natural Language Processing. PhD thesis, Language Technologies Institute, School of Computer Science, Carnegie Mel- lon University, 2015. [10] David Bamman and Noah Smith. Unsupervised Discovery of Biographical Structure from Text. Transactions of the Association for Computational Lin- guistics, 2:363–376, 2014. ISSN 2307-387X. URL https://tacl2013.cs. columbia.edu/ojs/index.php/tacl/article/view/371. [11] David Bamman, Brendan O’Connor, and Noah A. Smith. Learning Latent Personas of Film Characters. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, pages 352–361, 2013. URL http: //aclweb.org/anthology/P/P13/P13-1035.pdf. [12] David Bamman, Ted Underwood, and Noah A. Smith. A Bayesian Mixed Effects Model of Literary Character. In Proceedings of the 52nd Annual Meet- ing of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pages 370–379, 2014. URL http://aclweb.org/anthology/P/P14/P14-1035.pdf. [13] Leonard E. Baum and Ted Petrie. Statistical Inference for Probabilistic Func- tions of Finite State Markov Chains. The Annals of Mathematical Statistics, 37(6):1554–1563, 1966. [14] Eric Berne. Games People Play: The Psychology of Human Relationships, volume 2791. Penguin UK, 1967. [15] Lynda E. Boose. The Father and the Bride in Shakespeare. PMLA, 97(3):325– 347, 1982. ISSN 00308129. URL http://www.jstor.org/stable/462226. [16] P. Oscar Boykin and Vwani P. Roychowdhury. Personal Email Networks: An Effective Anti-Spam Tool. CoRR, cond-mat/0402143, 2004. URL http: //arxiv.org/abs/cond-mat/0402143. [17] Philip Bramsen, Martha Escobar-Molano, Ami Patel, and Rafael Alonso. Ex- tracting Social Power Relationships from Natural Language. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguis- tics: Human Language Technologies-Volume 1, pages 773–782. Association for Computational Linguistics, 2011. 133 [18] Jerome Bruner. Actual Minds. Possible Worlds. Harvard University Press, Cambridge, 1986. [19] Marilyn Butler. Jane Austen and the War of Ideas. Clarendon Press, 1975. ISBN 9780198120681. URL https://books.google.com/books?id= lCvuAAAAMAAJ. [20] Dorwin Cartwright and Frank Harary. Structure Balance: A Generalization of Heiders Theory. Psychological review, 63(5), 1956. [21] Nathanael Chambers. Event Schema Induction with a Probabilistic Entity- Driven Model. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1797–1807, 2013. URL http://aclweb. org/anthology/D/D13/D13-1185.pdf. [22] Nathanael Chambers and Dan Jurafsky. Unsupervised Learning of Narra- tive Schemas and their Participants. In ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2-7 August 2009, Singapore, pages 602–610, 2009. URL http: //www.aclweb.org/anthology/P09-1068. [23] Nathanael Chambers and Daniel Jurafsky. Unsupervised Learning of Narra- tive Event Chains. In ACL 2008, Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA, pages 789–797, 2008. URL http://www.aclweb.org/anthology/ P08-1090. [24] Jonathan Chang, Jordan Boyd-Graber, Chong Wang, Sean Gerrish, and David M. Blei. Reading Tea Leaves: How Humans Interpret Topic Models. In Neural Information Processing Systems, 2009. [25] Ming-Wei Chang, Dan Goldwasser, Dan Roth, and Vivek Srikumar. Dis- criminative Learning over Constrained Latent Representations. In Human Language Technologies: The 2010 Annual Conference of the North Ameri- can Chapter of the Association for Computational Linguistics, HLT ’10, pages 429–437, Stroudsburg, PA, USA, 2010. Association for Computational Lin- guistics. ISBN 1-932432-65-5. URL http://dl.acm.org/citation.cfm?id= 1857999.1858065. [26] Snigdha Chaturvedi, Dan Goldwasser, and Hal Daume´ III. Predicting In- structor’s Intervention in MOOC forums. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1501–1511. Association for Computational Linguistics, 2014. URL http://aclweb.org/anthology/P14-1141. 134 [27] Snigdha Chaturvedi, Dan Goldwasser, and Hal Daume´ III. Ask, and shall you receive? : Understanding Desire Fulfillment in Natural Language Text. In Proceedings of the Thirtieth Conference on Artificial Intelligence, AAAI’16, 2016. [28] Snigdha Chaturvedi, Shashank Srivastava, Hal Daume´ III, and Chris Dyer. Modeling Evolving Relationships between Characters in Literary Novels. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, 2016. [29] Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Vanderwende. Probabilis- tic Frame Induction. In Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, June 9-14, 2013, Westin Peachtree Plaza Hotel, Atlanta, Geor- gia, USA, pages 837–846, 2013. URL http://aclweb.org/anthology/N/ N13/N13-1104.pdf. [30] Robert Coles. The Call of Stories: Teaching and the Moral Imagination. Houghton Mifflin, Boston, MA, 1989. [31] Michael Collins. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms. In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing - Volume 10, EMNLP ’02, pages 1–8, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics. doi: 10.3115/1118693.1118694. URL http: //dx.doi.org/10.3115/1118693.1118694. [32] Cristian Danescu-Niculescu-Mizil, Lillian Lee, Bo Pang, and Jon M. Kleinberg. Echoes of Power: Language Effects and Power Differences in Social Interac- tion. In Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012, pages 699–708, 2012. doi: 10.1145/ 2187836.2187931. URL http://doi.acm.org/10.1145/2187836.2187931. [33] Dipanjan Das, Desai Chen, Andre´ F. T. Martins, Nathan Schneider, and Noah A. Smith. Frame-Semantic Parsing. Computational Linguistics, 40(1): 9–56, 2014. doi: 10.1162/COLI a 00163. URL http://dx.doi.org/10.1162/ COLI_a_00163. [34] James A. Davis. Clustering and Structural Balance in Graphs. Clustering and structural balance in graphs, 20(2):181–187, 1967. [35] Munmun De Choudhury, Hari Sundaram, Ajita John, and Dore´e Duncan Seligmann. What Makes Conversations Interesting? Themes, Participants and Consequences of Conversations in Online Social Media. In 18th International World Wide Web Conference (WWW), pages 331–331, April 2009. URL http: //www2009.eprints.org/34/. 135 [36] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum Like- lihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1–38, 1977. [37] Mark Edmundson. The Trouble With Online Education, July 19 2012. http://www.nytimes.com/2012/07/20/opinion/ the-trouble-with-online-education.html. [38] Lajos Egri. The Art of Dramatic Writing. Simon and Schuster, New York, NY, USA, 1946. [39] Khalid El-Arini, Min Xu, Emily B. Fox, and Carlos Guestrin. Represent- ing Documents Through Their Readers. In Inderjit S. Dhillon, Yehuda Ko- ren, Rayid Ghani, Ted E. Senator, Paul Bradley, Rajesh Parekh, Jingrui He, Robert L. Grossman, and Ramasamy Uthurusamy, editors, KDD, pages 14– 22. ACM, 2013. ISBN 978-1-4503-2174-7. URL http://dblp.uni-trier.de/ db/conf/kdd/kdd2013.html#El-AriniXFG13. [40] David K. Elson. Modeling Narrative Discourse. PhD thesis, Columbia Uni- versity, 2012. [41] David K Elson. Detecting Story Analogies from Annotations of Time, Action and Agency. In Proceedings of the LREC 2012 Workshop on Computational Models of Narrative, Istanbul, Turkey, 2012. [42] David K. Elson, Nicholas Dames, and Kathleen McKeown. Extracting So- cial Networks from Literary Fiction. In ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, pages 138–147, 2010. URL http://www.aclweb.org/ anthology/P10-1015. [43] Susan Engle. The Stories Children Tell: Making Sense of the Narratives of Childhood. W.H. Freeman & Company, NY, 1995. [44] Ursula Fanning. Sentimental Subversion: Representations of Female Friend- ship in the Work of Matilde Serao. Annali d’Italianistica, 7:273–286, 1989. ISSN 07417527. URL http://www.jstor.org/stable/24003871. [45] Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester, and Deva Ra- manan. Object Detection with Discriminatively Trained Part-Based Models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627– 1645, 2010. ISSN 0162-8828. doi: http://doi.ieeecomputersociety.org/10.1109/ TPAMI.2009.167. [46] Song Feng, Jun Seok Kang, Polina Kuznetsova, and Yejin Choi. Connotation Lexicon: A Dash of Sentiment Beneath the Surface Meaning. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, pages 136 1774–1784, 2013. URL http://aclweb.org/anthology/P/P13/P13-1174. pdf. [47] Mark Alan Finlayson. Learning Narrative Structure from Annotated Folktales. PhD thesis, Massachusetts Institute of Technology, 2012. [48] Jane M. Ford. Patriarchy and Incest from Shakespeare to Joyce. University Press of Florida, 1998. [49] Gillian Frith. The Intimacy which is Knowledge: Female Friendships in the Novels of Women Writers. PhD thesis, University of Warwick, December 1988. [50] Jennifer Golbeck and James A. Hendler. Reputation Network Analysis for Email Filtering. In CEAS 2004 - First Conference on Email and Anti-Spam, July 30-31, 2004, Mountain View, California, USA, 2004. URL http://www. ceas.cc/papers-2004/177.pdf. [51] Andrew B Goldberg, Nathanael Fillmore, David Andrzejewski, Zhiting Xu, Bryan Gibson, and Xiaojin Zhu. May All Your Wishes Come True: A Study of Wishes and How to Recognize Them. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 263–271. Association for Computational Linguistics, 2009. [52] Debra Goodlett. Love and Addiction in ‘Wuthering Heights’. The Review of English Studies, 37(3), 1996. [53] Amit Goyal, Ellen Riloff, and Hal Daume´ III. Automatically Producing Plot Unit Representations for Narrative Text. In Proceedings of the 2010 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2010, 9-11 October 2010, MIT Stata Center, Massachusetts, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 77–86, 2010. URL http://www.aclweb.org/anthology/D10-1008. [54] Anatoliy Gruzd. Automated Discovery of Social Networks in Text-based On- line Communities. In Proceedings of the ACM 2009 International Conference on Supporting Group Work, GROUP ’09, pages 379–380, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-500-0. doi: 10.1145/1531674.1531733. URL http://doi.acm.org/10.1145/1531674.1531733. [55] Ahmed Hassan, Amjad Abu-Jbara, and Dragomir Radev. Extracting Signed Social Networks from Text. In Workshop Proceedings of TextGraphs-7 on Graph-based Methods for Natural Language Processing, pages 6–14. Associa- tion for Computational Linguistics, 2012. [56] Fritz Heider. Attitudes and Cognitive Organization. Journal of Psychology, 21:107–112, 1946. 137 [57] Charles H. Hinant. Jane Austen’s “Wild Imagination”: Romance and the Courtship Plot in the Six Canonical Novels. Narrative, 14(3):294–310, 2006. URL http://www.jstor.org/stable/20107392. [58] Jonathan Huang, Chris Piech, Andy Nguyen, and Leonidas J. Guibas. Syn- tactic and Functional Variability of a Million Code Submissions in a Machine Learning MOOC. In AIED Workshops, 2013. [59] Liang Huang, Suphan Fayong, and Yang Guo. Structured Perceptron with Inexact Search. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 142–151. Association for Computational Linguistics, 2012. [60] Lynn Jamieson and Lynn Jameison. Intimacy. Polity Press Cambridge, 1998. [61] Thorsten Joachims, Thomas Finley, and Chun-Nam John Yu. Cutting- plane Training of Structural SVMs. Machine Learning, 77(1):27–59, 2009. doi: 10.1007/s10994-009-5108-8. URL http://dx.doi.org/10.1007/ s10994-009-5108-8. [62] Paramjit S. Judge. Foundations of Classical Sociological Theory: Functional- ism, Conflict and Action. Pearson, 1 edition, March 2012. ISBN 0898591384. [63] Jon M. Kleinberg. Computational Perspectives on Social Phenomena at Global Scales. In Francesca Rossi, editor, IJCAI 2013, Proceedings of the 23rd In- ternational Joint Conference on Artificial Intelligence, Beijing, China, Au- gust 3-9, 2013. IJCAI/AAAI, 2013. ISBN 978-1-57735-633-2. URL http: //www.aaai.org/ocs/index.php/IJCAI/IJCAI13/paper/view/6896. [64] Vinodh Krishnan and Jacob Eisenstein. “You’re Mr. Lebowski, I’m the Dude”: Inducing Address Term Formality in Signed Social Networks. In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Language Technologies, Den- ver, Colorado, USA, May 31 - June 5, 2015, pages 1616–1626, 2015. URL http://aclweb.org/anthology/N/N15/N15-1185.pdf. [65] Wendy G. Lehnert. Plot Units and Narrative Summarization. Cogni- tive Science, 5(4):293–331, 1981. doi: 10.1207/s15516709cog0504 1. URL http://dx.doi.org/10.1207/s15516709cog0504_1. [66] Jure Leskovec, Daniel P. Huttenlocher, and Jon M. Kleinberg. Signed Net- works in Social Media. In Proceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010, Atlanta, Georgia, USA, April 10-15, 2010, pages 1361–1370, 2010. doi: 10.1145/1753326.1753532. URL http://doi.acm.org/10.1145/1753326.1753532. [67] G Levinger. Development and Change. H.H. Kelley, et al. (Eds.), Close relationships, pages 315–359, 1983. 138 [68] Chu-Cheng Lin, Waleed Ammar, Chris Dyer, and Lori Levin. Unsuper- vised POS Induction with Word Embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 1311–1316, Denver, Colorado, May–June 2015. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/N15-1144. [69] Bing Liu, Minqing Hu, and Junsheng Cheng. Opinion Observer: Analyzing and Comparing Opinions on the Web. In Proceedings of the 14th international conference on World Wide Web, WWW 2005, Chiba, Japan, May 10-14, 2005, pages 342–351, 2005. doi: 10.1145/1060745.1060797. URL http://doi.acm. org/10.1145/1060745.1060797. [70] Bernardo Magnini, Roberto Zanoli, Ido Dagan, Kathrin Eichler, Guenter Neu- mann, Tae-Gil Noh, Sebastian Pado´, Asher Stern, and Omer Levy. The excite- ment open platform for textual inferences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22- 27, 2014, Baltimore, MD, USA, System Demonstrations, pages 43–48, 2014. URL http://aclweb.org/anthology/P/P14/P14-5008.pdf. [71] Inderjeet Mani. Computational Narratology. Hu¨hn, Peter et al. (eds.): the living handbook of narratology, 2012. URL hup.sub.uni-hamburg.de/lhn/ index.php?title=ComputationalNarratology&oldid=2030. [72] Inderjeet Mani. Computational Modeling of Narrative. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2012. ISBN 9781608459810. doi: 10.2200/S00459ED1V01Y201212HLT018. URL http: //dx.doi.org/10.2200/S00459ED1V01Y201212HLT018. [73] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. The Stanford CoreNLP Natural Lan- guage Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, System Demonstrations, pages 55–60, 2014. URL http://aclweb.org/anthology/P/P14/P14-5010.pdf. [74] Philip Massey, Patrick Xia, David Bamman, and Noah Smith. Annotating Character Relationships in Literary Texts. ArXiv e-prints, 2015. [75] Andrew McCallum, Xuerui Wang, and Andre´s Corrada-Emmanuel. Topic and Role Discovery in Social Networks with Experiments on Enron and Academic Email. Journal of Artificial Intelligence Research (JAIR), 30:249–272, 2007. doi: 10.1613/jair.2229. URL http://dx.doi.org/10.1613/jair.2229. [76] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Dis- tributed Representations of Words and Phrases and their Compositionality. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, 139 editors, Advances in Neural Information Processing Systems 26, pages 3111– 3119. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/ 5021-distributed-representations-of-words-and-phrases-and-their-compositionality. pdf. [77] George A. Miller. WordNet: A Lexical Database for English. Commun. ACM, 38(11):39–41, November 1995. ISSN 0001-0782. doi: 10.1145/219717.219748. URL http://doi.acm.org/10.1145/219717.219748. [78] Jeff Mitchell and Mirella Lapata. Vector-based Models of Semantic Compo- sition. In ACL 2008, Proceedings of the 46th Annual Meeting of the Associa- tion for Computational Linguistics, June 15-20, 2008, Columbus, Ohio, USA, pages 236–244, 2008. URL http://www.aclweb.org/anthology/P08-1028. [79] Raymond J. Mooney and Gerald DeJong. Learning Schemata for Natural Language Processing. In Proceedings of the 9th International Joint Conference on Artificial Intelligence. Los Angeles, CA, August 1985, pages 681–687, 1985. [80] F. Mosteller and D. L. Wallace. Inference and Disputed Authorship: The Federalist Papers. Addison-Wesley, Reading, Mass., 1964. [81] Kevin Murphy. Hidden Markov Model (HMM) Toolbox for Matlab. http: //people.cs.ubc.ca/~murphyk/Software/HMM/hmm.html, 1998. [82] Andy Nguyen, Christopher Piech, Jonathan Huang, and Leonidas J. Guibas. Codewebs: Scalable Homework Search for Massive Open Online Programming Courses. In WWW, pages 491–502, 2014. [83] John Walker Orr, Prasad Tadepalli, Janardhan Rao Doppa, Xiaoli Fern, and Thomas G. Dietterich. Learning Scripts as Hidden Markov Models. In Proceed- ings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Que´bec City, Que´bec, Canada., pages 1565–1571, 2014. URL http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8603. [84] Karl Pichotta and Raymond J. Mooney. Learning Statistical Scripts with LSTM Recurrent Neural Networks. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI-16), Phoenix, Arizona, Febru- ary 2016. URL http://www.cs.utexas.edu/users/ai-lab/pub-view.php? PubID=127543. [85] Chris Piech, Jonathan Huang, Zhenghao Chen, Chuong Do, Andrew Ng, and Daphne Koller. Tuned Models of Peer Assessment in MOOCs. In Proceed- ings of The 6th International Conference on Educational Data Mining (EDM 2013), 2013. [86] Plato and W. Hamilton. The Symposium. Penguin classics. Penguin Books, 1951. URL https://books.google.com/books?id=1fIDAQAAIAAJ. 140 [87] Robert M. Polhemus. Being in Love from Jane Austen to DH Lawrence. Chicago: University of Chicago Press, 1990. [88] Rashmi Prasad, Eleni Miltsakaki, Nikhil Dinesh, Alan Lee, Aravind Joshi, Livio Robaldo, and Bonnie Webber. The Penn Discourse Tree-bank 2.0 An- notation Manual. Technical report, University of Pennsylvania, Institute for Research in Cognitive Science, 2007. URL http://www.seas.upenn.edu/ ~pdtb/PDTBAPI/pdtb-annotation-manual.pdf. [89] Vladimir Iakovlevich Propp. Morphology of the Folktale. University of Texas, 1968. URL http://books.google.com/books?isbn=0292783760. [90] Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daume´ III, and Lise Getoor. Learning Latent Engagement Patterns of Students in Online Courses. In Pro- ceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, July 27 -31, 2014, Que´bec City, Que´bec, Canada., pages 1272–1278, 2014. URL http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8571. [91] Michaela Regneri, Alexander Koller, and Manfred Pinkal. Learning Script Knowledge with Web Experiments. In ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11-16, 2010, Uppsala, Sweden, pages 979–988, 2010. URL http://www.aclweb.org/ anthology/P10-1100. [92] Matthew Richardson, Christopher J. C. Burges, and Erin Renshaw. MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text. In EMNLP, pages 193–203. ACL, 2013. ISBN 978-1-937284-97-8. URL http: //dblp.uni-trier.de/db/conf/emnlp/emnlp2013.html#RichardsonBR13. [93] Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. The Author-topic Model for Authors and Documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, UAI ’04, pages 487–494, Arlington, Virginia, United States, 2004. AUAI Press. ISBN 0-9749039-0-6. URL http://dl.acm.org/citation.cfm?id=1036843.1036902. [94] Roger C. Schank and Robert P. Abelson. Scripts, Plans, Goals, and Under- standing: An Inquiry Into Human Knowledge Structures (Artificial Intelli- gence Series). Psychology Press, 1 edition, July 1977. ISBN 0898591384. [95] T. Shinozaki and T. Kawahara. HMM Training based on CV-EM and CV Gaussian Mixture Optimization. In Automatic Speech Recognition Under- standing, 2007. ASRU. IEEE Workshop on, pages 318–322, Dec 2007. doi: 10.1109/ASRU.2007.4430131. [96] Noah A. Smith. Linguistic Structure Prediction. Synthesis Lectures on Hu- man Language Technologies. Morgan & Claypool Publishers, 2011. doi: 10. 2200/S00361ED1V01Y201105HLT013. URL http://dx.doi.org/10.2200/ S00361ED1V01Y201105HLT013. 141 [97] Shashank Srivastava, Snigdha Chaturvedi, and Tom Mitchell. Inferring Inter- personal Relations in Narrative Summaries. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, 2016. [98] Pontus Stenetorp, Sampo Pyysalo, Goran Topic´, Tomoko Ohta, Sophia Ana- niadou, and Jun’ichi Tsujii. BRAT: a Web-based Tool for NLP-assisted Text Annotation. In Proceedings of Demonstrations Session at EACL 2012, pages 102–107, 2012. [99] Asher Stern and Ido Dagan. BIUTEE: A Modular Open-Source System for Recognizing Textual Entailment. In The 50th Annual Meeting of the As- sociation for Computational Linguistics, Proceedings of the System Demon- strations, July 10, 2012, Jeju Island, Korea, pages 73–78, 2012. URL http://www.aclweb.org/anthology/P12-3013. [100] Bruce Stovel. ‘A Contrariety of Emotion’: Jane Austen’s Ambivalent Lovers in Pride and Prejudice. The International Fiction Review, 14(1):27–33, 1987. [101] Glenda S. Stump, Jennifer DeBoer, Jonathan Whittinghill, and Lori Breslow. Development of a Framework to Classify MOOC Discussion Forum Posts: Methodology and Challenges. In NIPS Workshop on Data Driven Education, 2013. [102] Xu Sun, Takuya Matsuzaki, Daisuke Okanohara, and Jun’ichi Tsujii. Latent Variable Perceptron Algorithm for Structured Classification. In IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelli- gence, Pasadena, California, USA, July 11-17, 2009, pages 1236–1242, 2009. URL http://ijcai.org/papers09/Papers/IJCAI09-208.pdf. [103] Oscar Ta¨ckstro¨m and Ryan McDonald. Discovering Fine-grained Sentiment with Latent Variable Structured Prediction Models. In Proceedings of the 33rd European Conference on Advances in Information Retrieval, ECIR’11, pages 368–374, Berlin, Heidelberg, 2011. Springer-Verlag. ISBN 978-3-642-20160-8. URL http://dl.acm.org/citation.cfm?id=1996889.1996937. [104] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large Margin Methods for Structured and Interdependent Output Variables. Journal of Machine Learning Research, 6:1453–1484, Decem- ber 2005. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id= 1046920.1088722. [105] Josep Valls-Vargas, Jichen Zhu, and Santiago Ontan˜o´n. Toward Automatic Role Identification in Unannotated Folk Tales. In Proceedings of the Tenth AAAI Conference on Artificial Intelligence and Interactive Digital Enter- tainment, AIIDE 2014, October 3-7, 2014, North Carolina State Univer- sity, Raleigh, NC, USA, 2014. URL http://www.aaai.org/ocs/index.php/ AIIDE/AIIDE14/paper/view/9004. 142 [106] Max Weber, Guenther Roth, and Claus Wittich. Economy and Soci- ety: An Outline of Interpretive Sociology. University of California Press, 1978. ISBN 9780520035003. URL https://books.google.com/books?id= pSdaNuIaUUEC. [107] Robert West, Hristo S. Paskov, Jure Leskovec, and Christopher Potts. Ex- ploiting Social Network Structure for Person-to-Person Sentiment Analysis. TACL, 2:297–310, 2014. URL https://tacl2013.cs.columbia.edu/ojs/ index.php/tacl/article/view/396. [108] R. S. White. Metamorphosis by Love in Elizabethan Romance, Romantic Comedy, and Shakespeare’s Early Comedies. The Review of English Studies, 35(137):14–44, 1984. ISSN 00346551, 14716968. URL http://www.jstor. org/stable/516810. [109] Robert Wilensky. Understanding Goal-Based Stories. Outstanding Disserta- tions in the Computer Sciences. Garland Publishing, New York, 1978. ISBN 0-8240-4410-X. [110] Theresa Wilson, Janyce Wiebe, and Paul Hoffmann. Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. In HLT/EMNLP 2005, Hu- man Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 6-8 October 2005, Vancouver, British Columbia, Canada, 2005. URL http://acl.ldc. upenn.edu/H/H05/H05-1044.pdf. [111] Diyi Yang, David Adamson, and Carolyn Penstein Rose´. Question Recom- mendation with Constraints for Massive Open Online Courses. In Proceedings of the 8th ACM Conference on Recommender Systems, RecSys ’14, pages 49– 56, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2668-1. doi: 10.1145/ 2645710.2645748. URL http://doi.acm.org/10.1145/2645710.2645748. [112] Diyi Yang, Jingbo Shang, and Carolyn Penstein Rose´. Constrained Question Recommendation in MOOCs via Submodularity. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14, pages 1987–1990, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2598-1. doi: 10.1145/2661829.2662089. URL http://doi. acm.org/10.1145/2661829.2662089. [113] Chun-Nam John Yu and Thorsten Joachims. Learning Structural SVMs with Latent Variables. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pages 1169–1176, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-516-1. doi: 10.1145/1553374.1553523. URL http://doi.acm.org/10.1145/1553374.1553523. 143