ABSTRACT Title of Document: USING A HIGH-DIMENSIONAL MODEL OF SEMANTIC SPACE TO PREDICT NEURAL ACTIVITY. Alice Freeman Jackson, Doctor of Philosophy, 2014 Directed By: Professor Donald J. Bolger, Department of Human Development and Quantitative Methodology This dissertation research developed the GOLD model (Graph Of Language Distribution), a graph-structured semantic space model constructed based on co- occurrence in a large corpus of natural language, with the intent that it may be used to explore what information may be present about relationships between words in such a model and the degree to which this information may be used to predict brain responses and behavior in language tasks. The present study employed GOLD to examine genera relatedness as well as two specific types of relationship between words: semantic similarity, which refers to the degree of overlap in meaning between words, and associative relatedness, which refers to the degree to which two words occur in the same schematic context. It was hypothesized that this graph-structured model of language constructed based on co-occurrence should easily capture associative relatedness, because this type of relationship is thought to be present directly in lexical co-occurrence. Additionally, it was hypothesized that semantic similarity may be extracted from the intersection of the set of first-order connections, because two words that are semantically similar may occupy similar thematic or syntactic roles across contexts and thus would co-occur lexically with the same set of nodes. Based on these hypotheses, a set of relationship metrics were extracted from the GOLD model, and machine learning techniques were used to explore predictive properties of these metrics. GOLD successfully predicted behavioral data as well as neural activity in response to words with varying relationships, and its predictions outperformed those of certain competing models. These results suggest that a single- mechanism account of learning word meaning from context may suffice to account for a variety of relationships between words. Further benefits of graph models of language are discussed, including their transparent record of language experience, easy interpretability, and increased psychologically plausibility over models that perform complex transformations of meaning representation. USING A HIGH-DIMENSIONAL MODEL OF SEMANTIC SPACE TO PREDICT NEURAL ACTIVITY. By Alice Freeman Jackson Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2014 Advisory Committee: Professor Donald J. Bolger, Chair Professor Hal Daum? III Professor Kevin Dunbar Professor William Idsardi, Dean?s Representative Professor Meredith Rowe ? Copyright by Alice Freeman Jackson 2014 ii Dedication To Andrew and the future. iii Acknowledgements I was extraordinarily lucky to join the lab of DJ Bolger, a scientist of outstanding intellectual honesty, dogged persistence, and an irrepressible talent for finding time to talk. Without DJ?s confident support of my work and benevolent leadership of the lab, my graduate career would have taken a very different course and this dissertation certainly would never have taken shape. I am also indebted to Tracy Riggins on this front, as her cheerful presence, advice, and generous sharing of her EEG system, lab space, and expertise made this research possible. I must thank my most patient committee members, Drs Donald J. Bolger, Hal Daume, Kevin Dunbar, William Idsardi, and Meredith Rowe, for their role in shaping and improving the present work. I am gratified to have forged friendships with my fellow lab members, Say Young Kim, Lesley Sand, & Brandee Feola, and my fellow NACS students, particularly Katie Willis & Susan Teubner-Rhodes. I am thankful to you all for your boundless camaraderie, your celebrations when p < .05 and chocolatey condolences when p = Ns, and your bright hearts. My education and intellectual path are as much products of my upbringing as they are of my own efforts. A lifetime of family influences is too much to express pithily, so I must distill my gratitude to its cores: I am thankful to my mother, for the Freeman traditions of curiosity and creativity; to my father, for the Jackson heritage of reverence for words and learning; and to my brother, for encouraging me in everything I've ever done. iv And lastly, I have the pleasure of acknowledging my best friend and loving husband, Andrew. I am so thankful for your support, patience, encouragement, conversation, compassion, reflection, your brilliant mind, your good humor, your strength of character, and the deliberate way that you live. I am so fortunate to have you as a tremendous force of love and joy in my life. v Table of Contents Dedication ..................................................................................................................... ii Acknowledgements ...................................................................................................... iii Table of Contents .......................................................................................................... v List of Tables .............................................................................................................. vii List of Figures ............................................................................................................ viii Chapter 1: Introduction ................................................................................................. 1 1.1 Overview ............................................................................................................. 1 1.1.1 Three major stages of the present study ....................................................... 1 1.2 Major theoretical issues ...................................................................................... 2 1.2.1 Language representation and language models ........................................... 2 1.2.3 The utility of computational models in brain research ................................ 4 1.2.4 Psychological and neurological plausibility of language models ................ 6 1.3 Research questions .............................................................................................. 7 1.3.1 Can GOLD predict behavioral data?............................................................ 7 1.3.2 Can GOLD predict neural data? .................................................................. 8 1.3.3 Can GOLD?s predictions outperform other models? ................................... 8 Chapter 2: Literature review ......................................................................................... 9 2.1 Distributional models .......................................................................................... 9 2.1.1 Introduction .................................................................................................. 9 2.1.2 Psychological plausibility of distribution models ...................................... 10 2.1.3 Existing distributional models and their applications ................................ 12 2.2 Graph models .................................................................................................... 21 2.2.1 Introduction ................................................................................................ 21 2.2.2 Existing graph models................................................................................ 21 2.2.3 Psychological and/or neurological plausibility of graph models ............... 23 2.4 Event-related potentials .................................................................................... 24 2.4.1 Introduction ................................................................................................ 24 2.4.2 The n400 component.................................................................................. 25 2.4 Machine learning .............................................................................................. 26 2.4.1 Introduction ................................................................................................ 26 2.4.2 Types of algorithms ................................................................................... 27 2.4.3 Psychological/neurological plausibility ..................................................... 28 2.5 Summary ........................................................................................................... 29 Chapter 3: Methods ..................................................................................................... 30 3.1 GOLD model .................................................................................................... 30 3.1.1 Introduction ................................................................................................ 30 3.1.2 Corpus ........................................................................................................ 31 3.1.3 Preprocessing ............................................................................................. 31 3.1.4 Constructing the graph ............................................................................... 32 3.1.5 Normalization ............................................................................................ 35 vi 3.1.6 Similarity and association metrics ............................................................. 35 3.2 Latent semantic analysis (LSA) ........................................................................ 39 3.3 Machine learning .............................................................................................. 40 3.4 Summary ........................................................................................................... 40 Chapter 4: Experiment 1 (behavioral data) ................................................................. 41 4.1 Stimuli ............................................................................................................... 41 4.1.1 For human subjects in Experiment 1a and Experiment 2 .......................... 41 4.1.2 For model predictions in Experiment 1b ................................................... 43 4.2 Participants (1a) ................................................................................................ 44 4.3 Procedure (1a) ................................................................................................... 44 4.4 Data analysis (1a) .............................................................................................. 44 4.5 Results ............................................................................................................... 45 4.5.1 Ratings (1a) ................................................................................................ 45 4.6.1 Word pair categories (1b) .......................................................................... 48 Chapter 5: Experiment 2 (neural data) ........................................................................ 52 5.1 Participants ........................................................................................................ 52 5.2 Procedure .......................................................................................................... 53 5.2.1 ERP collection and preprocessing ............................................................. 55 5.2.2 Features for machine learning .................................................................... 56 5.3 Results ............................................................................................................... 56 5.3.1 ERP visualizations and sanity checks ........................................................ 56 5.3.2 Model predictions of ERP voltage ............................................................. 68 Chapter 6: Discussion ................................................................................................ 71 6.1 Model performance ........................................................................................... 71 6.2 Word relationships ............................................................................................ 73 6.3 Benefits of computational models .................................................................... 74 6.4 Graphs as models of language .......................................................................... 75 6.5 Individual differences ....................................................................................... 78 6.6 Future research ............................................................................................ 79 6.6.1 Language .................................................................................................... 79 6.6.2 GOLD ..................................................................................................... 80 6.6.3 Individual differences ................................................................................ 81 6.6.3 ERP ......................................................................................................... 82 6.6.5 Extensions .................................................................................................. 82 6.7 Conclusion ........................................................................................................ 86 Appendix A. GOLD metrics ....................................................................................... 87 Appendix B. ERP Participant assessment results ....................................................... 91 Appendix C. ERP prediction performance ................................................................. 92 Appendix D. Stimuli for ratings and ERP study ......................................................... 93 Appendix E. Stimuli and stimuli parameters ............................................................ 102 References ................................................................................................................. 106 vii List of Tables Table 1. Regressor performance on similarity and association ratings. ..................... 45 Table 2. Classifier performance on the Plaut and Booth (2000) word pairs. .............. 48 Table 3. Classifier confusion matrices for the Plaut and Booth (2000) word pairs. .. 48 Table 4. Classifier performance on the Chiarello et al. (1990) word pairs. ................ 50 Table 5. Classifier confusion matrices for the Chiarello et al. (1990) word pairs. ..... 50 Table 6. Regressor performance on voltage at Pz, 300-500ms................................... 69 Table 7. Correlations between metrics and ERP measures. ........................................ 70 Table 8. Weight normalization methods ..................................................................... 89 Table 9. ERP participant assessment results ............................................................... 91 Table 10. Correlations between models and predictions, 20 iterations of 70/30 train/test....................................................................................................................... 92 Table 11. Stimuli and stimuli parameters for ratings and ERP................................... 93 Table 12. Word pairs from Chiarello et al. (1990) ................................................... 102 Table 13. Word pairs from Plaut and Booth (2000). ................................................ 103 viii List of Figures Figure 1. First-order associates of grumpy-cat. .......................................................... 34 Figure 2. First-order associates of sushi-octopus. ....................................................... 34 Figure 3. A simplified graph of grumpy-cat. .............................................................. 37 Figure 4. Similarity predictions from one train/test using a random forest trained on smallGOLD ................................................................................................................. 46 Figure 5. Association predictions from one train/test iteration using a random forest trained on smallGOLD ................................................................................................ 47 Figure 6. Trial template in the ERP task. ................................................................... 55 Figure 7. First and second words of the wordpairs, sorted by participant response. .. 57 Figure 8. Second words of the word pairs, sorted into high and low similarity and association ratings. ...................................................................................................... 58 Figure 9. ERPs sorted by association ratings in six ordered bins. .............................. 59 Figure 10. ERPs sorted by similarity ratings in six ordered bins................................ 60 Figure 11. Main effect of association (lowest-highest). ............................................ 61 Figure 12. Main effect of similarity: lowest-highest .................................................. 62 Figure 13. Chiarello et al. (1990) words vs. lowest rated words. ............................... 63 Figure 14. Main effect of association (associated-unrelated) ..................................... 64 Figure 15. Main effect of similarity (similar-unrelated) ............................................. 64 Figure 16. Main effect of similarity and association (both-unrelated) ....................... 65 Figure 17. Interaction between association and similarity (associated-similar). ........ 66 Figure 18. Interaction between association and both (associated-both) ..................... 66 Figure 19. Interaction between similarity and both (similar-both) ............................. 67 1 Chapter 1: Introduction 1.1 Overview The present study aims to develop a computational model of language that uses graphs and graph algorithms, is structured in a psychologically and/or neurologically plausible manner, and may be used to predict behavioral and neural data from language tasks. This chapter will describe how the study will progress, present relevant major theoretical issues, and summarize the research questions at hand. 1.1.1 Three major stages of the present study The first stage is the construction of a graph-structured semantic space model, herein referred to as GOLD (Graph Of Language Distribution). GOLD will be constructed based on lexical co-occurrence within a large corpus of natural language. The second stage is the extraction of relatedness metrics from GOLD. Metrics of word relationships will be derived from the word graph in a theoretically informed manner, such that the metrics reflect theoretical conceptions of word meaning and word relationships. This theory-driven approach will extract specific properties of the graph that correspond to theoretical constructs and use these properties to construct a variety of metrics. The third stage of this study will comprise behavioral and neuroimaging tasks that will provide data with which to test GOLD?s metrics from stage two. Specifically, the analyses of the third stage will predict (a) human ratings of word relationships and (b) neural activity in a semantic relatedness judgment task, and 2 compare GOLD?s predictive performance to that of certain existing models. Machine learning techniques will be used to discover predictive properties of the GOLD metrics; if GOLD is successful, subsequent examination may be warranted to determine if the discovered properties may further inform theory. 1.2 Major theoretical issues 1.2.1 Language representation and language models A central question in the study of language in cognitive science is how word meaning is represented in the mind and brain. There is strong evidence that the meanings of words are learned from context (Bolger, Balass, Landen, & Perfetti, 2008a), and later reconstructed ad-hoc when meaning retrieval is necessary (Burgess & Lund, 1998; Kintsch & Mangalath, 2011). A class of computational models called ?distributional models? (discussed in Chapter 2) may be congruent with these properties of word meaning, as these models are constructed based on co-occurrence of words within a large collection of contexts, and relationships among words in the model may be later extracted. As such, these models mirror the general form of word meaning acquisition, representation, and usage as conceptualized in human language processing. Different types of relationships between words may be considered within distributional models of language (Budanitsky & Hirst, 2005; Utsumi, 2010) and may be mathematically defined within a model (Weeds, Weir, & McCarthy, 2004). The present study will consider two different types of relationship: semantic similarity, referring to the degree of overlap in meaning features (e.g. cat and feline are highly 3 similar, while cat and blobby are not), and associative relatedness1, referring to co- occurrence of words in contexts (e.g. question and ask are highly associated, while question and query are not) (Budanitsky & Hirst, 2006; Kolb, 2006; Landauer & Dumais, 1997; Lund & Burgess, 1996). Distributional data may be able to capture both (Weeds & Weir, 2005), from the hypothesis that words that are similar in meaning may occur in the same role in similar contexts, while words that are associated may occur nearby. The first aim of this dissertation is to test whether GOLD can provide support for this hypothesis by calculating association from raw co-occurrence and calculating similarity from shared or patterns of connectivity between two words (Lund, Burgess, & Atchley, 1995), such that two words that are connected to the same community of words with similarly weighted connections are more similar. It has been suggested that associative relatedness and semantic similarity are separate entities supported by separate networks of word representations, while others suggest a single mechanism of representation that can give rise to both of these relationship types (see Hutchison, 2003 for a review). However, association and similarity are not easily dissociable: words that are associated are likely to be semantically similar to some degree, and words that are semantically similar often co- occur (Deyne & Storms, 2008; Hutchison, 2003). Thus, it is difficult to argue that a particular effect arises from one relationship type or the other, as the relationships so often overlap. The present study uses a different approach: if GOLD can successfully differentiate between similarity and association, then this would suggest that the 1 This concept is referred to by a variety of names, including semantic relatedness, association, associative relatedness, and lexical similarity. For clarity the present study will use the phrase ?associative relatedness? or ?association?. 4 information necessary to identify these two relationship types must be present in the single mechanism of co-occurrence. 1.2.3 The utility of computational models in brain research A variety of computational models have been proposed that describe semantic processing of language, including acquisition of word meaning, semantic organization, and word use. These semantic models generally process a corpus of text and produce a model that represents some set of relationships among words. Some semantic models require pre-existing human analysis to specify relationships among words or concepts (e.g. WordNet, Roget?s thesaurus, or Wikipedia), while others only encode those relationships that can be extracted by automated means (like distribution and co-occurrence). Specific semantic models will be reviewed in the next chapter. Semantic models may be used for theoretical aims or for real-world applications: to judge relationships between words, like semantic distance or synonymy (Landauer, Foltz, & Laham, 1998); to make predictions of lexical items or phrases, like what word is likely to follow an existing sequence or what word a writer intended to write and instead misspelled (e.g. Islam & Inkpen, 2008); to classify input, like sorting sets of text by likely author (e.g. Burrows & Tahaghoghi, 2007); to assess the relatedness of semantic content in a student?s writing to gauge how well a concept is understood (e.g. Kakkonen, Myller, Timonen, & Sutinen, 2005); and many other tasks. In light of these real-world applications, there have been concerns that these computational models are ?tools? rather than valid psychological models, and while they are useful feats of engineering, they are bankrupt theoretically (Chomsky; Keynote panel, 2011). It has been argued that this is not the case (Norvig, 5 2011), for several reasons. Firstly, computational models are constructed based on theories of language acquisition and organization; the success of a model constructed based on a particular theory constitutes support for that theory. Secondly, computational models are typically quite parsimonious, as they are implemented manually based on a limited set of assumptions or parameters. Thirdly, computational models tend to make predictions that are well-quantified and falsifiable, which is not always the case in non-computational language models (e.g. complaints against Chomsky?s theory of Universal Grammar: Piattelli-Palmarini, 1980). Lastly, a major benefit of implementing a language model in a computer is that its functioning is entirely transparent. In a computational architecture, it is known exactly what information is available to a model and what the model does with that information in order to be successful, so it is easier to draw conclusions about language processing?s reliance on that information. For example, as will be discussed in subsequent chapters, many models that use only co-occurrence of words within documents have been successful at mimicking human performance on certain tasks. This success is evidence that statistical co-occurrence alone carries sufficient information to perform on these tasks. However, these models fail on other tasks (e.g. Burgess, 2000; Wiemer-Hastings, 2000), which indicates that some other information beyond co- occurrence is necessary to complete those tasks. Assessing how models achieve, or fail to achieve, their stated goals can thus further inform theory about what information the mind may use or how it may be organized. 6 1.2.4 Psychological and neurological plausibility of language models The prominent distribution models such as HAL and LSA are vector space models in which words or contexts are represented as vectors in multidimensional space. Due to the vast number of words and contexts, the immensity of the vector space is necessarily reduced using an algorithm known as singular value decomposition. While highly effective as a computational tool, it is questionable whether such a process plausibly reflects a psychological process (Jones & Mewhort, 2007; Kwantes, 2005; Steyvers & Tenenbaum, 2005). It should be noted that a variety of work has explored neurally plausible implementations of complex mathematical processes, including arithmetic and more complex nonlinear computations in individual neurons (see Silver, 2010, for a review), convolution (Blouw & Eliasmith, 2003), and Fourier transforms (Velik, 2008), so it is not necessarily the case that computational models that rely on processes such as SVD can be ruled out as viable explanations of human semantic processing. However, alternatives that profess greater plausibility have been developed using episodic memory models (Kwantes, 2005), neural network models (Plaut & Booth, 2000; Rohde, Gonnerman, & Plaut, 2005), and with graph models (Collins-Thompson & Callan, 2007; Steyvers & Tenenbaum, 2005). The purported plausibility of these models arises from their congruence with cognitive theories, model assumptions, more ready interpretations of their calculations, and the types of information contained within the representations. Graph models in particular are consistent with an instance-based learning framework of word learning (Bolger et al., 2008; Daalen- kapteijns & Elshout-mohr, 2001; Fukkink, Blok, & de Glopper, 2001; Jenkins, Stein, 7 & Wysocki, 1984), in which episodic traces representing individual exposures to a word are accessible, but information derived from larger patterns of co-occurrence is also available. This aspect of graphs will be discussed in more detail in Chapter 2. 1.3 Research questions The present study seeks to establish the utility of the GOLD model in predicting behavioral performance and neural activity underlying word processing. If GOLD is found to be effective, subsequent research can specify the source(s) of its predictive power. The research questions of the present study will focus on evaluating the quality of the GOLD model, and exploring what may be learned from its performance on a small suite of tasks, rather than which specific parameters of GOLD influence its performance. Each of the following three sections will introduce a finding or set of findings that GOLD is expected to replicate or outperform. 1.3.1 Can GOLD predict behavioral data? GOLD will be used to predict human ratings of association and similarity of word pairs. GOLD is intended to capture the information necessary to judge relationships of both association and similarity from co-occurrence data. Accordingly, using theoretically informed metrics of similarity and association, GOLD is hypothesized to predict both association and similarity ratings, as well as classify words based on their relationship type. These predictions, if successful, will provide some indication the corpus is reasonable and that the methods of calculating relationships are appropriate. 8 1.3.2 Can GOLD predict neural data? A specific feature of event-related potentials (ERPs) called the n400 (discussed in Chapter 2) is elicited in response to language. The n400 effect has been consistently found to be modulated by the strength of the relationship between words, such that greater relation between words in a pair produces a smaller n400 effect. Furthermore, the specific relationship types of similarity and/or association of word pairs has been shown to produce differential n400 effects (e.g. Koivisto & Revonsuo, 2001). Using similarity metrics derived from theoretical formulations of word meaning, combined with machine learning algorithms, GOLD is hypothesized to predict the size of the n400 effect elicited in response to a variety of stimuli. 1.3.3 Can GOLD?s predictions outperform other models? LSA (Landauer, Laham, & Foltz, 1997) has been used to predict amplitudes in similar electrophysiology tasks (e.g. Parviz, Johnson, Johnson, & Brock, 2011). GOLD?s performance on the prediction task will be compared to LSA to determine if the GOLD is an improvement on this commonly used and broadly successful model. It is hypothesized that GOLD will outperform LSA due to GOLD?s maintenance of full model dimensionality, its theory-informed similarity metrics, and its consistency with well-supported psychological theory. 9 Chapter 2: Literature review This chapter aims to review relevant literature in several fields: distributional models in general, graph models in particular, event-related potentials, and machine learning. It is worth noting here that this literature review is ultimately from a perspective of what can be learned about language. Accordingly, the computer science and machine learning literatures are reviewed to the degree necessary to clarify the methods used in the present study, and are not comprehensively covered. 2.1 Distributional models 2.1.1 Introduction The distributional hypothesis (Firth, 1957; Mcdonald & Ramscar, 2000) states that the meanings of words are related to or inferred from how words co-occur with other words in an entire corpus of contexts: if a word occurs in similar contexts as another word, then the two words should have similar meanings. The distributional hypothesis is notable in that it asserts no role of syntax, thematic organization, or even word order in inferring word meaning: the distribution of words in contexts alone is sufficient to construct their meaning. The following sections will discuss the psychological plausibility of this type of computational model, existing distributional models and their uses, and various parameters that change distributional models? utility. 10 2.1.2 Psychological plausibility of distribution models Distributional models account for a wide range of behavioral findings and are strongly rooted in theory. This section will discuss two major well-supported theoretical bases of semantics that are both transparently reflected in distributional models: (1) that meaning is dynamic as well as context-constrained, and (2) that learning occurs incrementally from context. There is plentiful evidence that the meanings of words are learned primarily from context (Fukkink et al., 2001; Swanborn & de Glopper, 1999, 2002; van Daalen- Kapteijns, Elshout-mohr, & de Glopper, 2001), that the meanings of words are fluid and dynamic (Bolger, Balass, Landen, & Perfetti, 2008; Kintsch & Mangalath, 2011) and depend heavily on context rather than formal definitions (Lawrence W Barsalou, 1987; Rogers & McClelland, 2011). Conceptually speaking, rather than looking up the meanings of words in a mental ?dictionary? when words are encountered, the meanings of words are constructed ad-hoc in a contextually-constrained manner (Burgess & Lund, 1998). Contextually-relevant meanings of words are problematic for certain other types of models, such as cognitive models of semantic knowledge that specify features or categorical organization (e.g. Mervis & Rosch, 1981), as category models can?t account easily for context constraints (Rogers & McClelland, 2011). Distributional models can, as words may co-occur with other words that belong to disparate inter-connected groups that reflect different meanings. Behavioral evidence suggests that, while acquiring meanings of novel words, learners gradually extract abstract meaning from successive exposures, while also maintaining non-abstract associations from each individual exposure (e.g. van 11 Daalen-kapteijns, Elshout-mohr, & de Glopper, 2001). The process of acquiring meaning gradually, through exposure to context, is formalized in the incremental learning hypothesis (Bolger et al., 2008; Fukkink et al., 2001). In a distributional framework, on exposure to a word within a context, a ?connection? between each word in the context is entered into the computational model. The unreduced distributional model thus represents the entire history of the learner?s instances of exposure to language. In human learners acquiring word meanings, a small number of exposures to a novel word leads to word knowledge that is weak and changeable (van Daalen- Kapteijns & Elshout-Mohr, 1981), and exposures to novel words in uninformative contexts leads to word knowledge that is weak or inaccurate (G. a Frishkoff, Collins- Thompson, Perfetti, & Callan, 2008; G. A. Frishkoff, Perfetti, & Collins-Thompson, 2010). In a distributional model, frequency and informativeness of exposures are both encoded: words that have been viewed infrequently or with nonspecific or generic contexts have weak connections that can be numerically overshadowed by co- occurrence with other, more informative words or by future exposures. Furthermore, definitional meaning is not stored in a qualitatively distinct system, rather experiences of ostension are represented as an instance or contextual episode in distributional models. In such models, the core set of abstract meaning features is represented as the pattern of most frequent associates of that word. These benefits are discussed at length with respect to the HAL model (Lund & Burgess, 12 1996), which does not reduce the dimensionality of its representations2 and thus maintains all of the ?memory traces? of language exposure that lead to its structure. 2.1.3 Existing distributional models and their applications 2.1.3.1 Introduction A wide variety of computational models have been developed using distributional bases, such as LSA (Landauer & Dumais, 1997; Landauer et al., 1998), HAL (Lund & Burgess, 1996), COALS (Rohde et al., 2005), SOC-PMI (Islam & Inkpen, 2008), and many other variants. These distributional models have met with success at a variety of tasks ranging from synonymy judgment to essay grading (Kakkonen et al., 2005), indicating that the information contained just within distributions of words is sufficient to meet a surprising range of language-related goals. However, certain models that have incorporated syntactic, thematic, or other information (Kakkonen, Myller, & Sutinen, 2006; Pad? & Lapata, 2006) or combined distributional models with other sources of information structure such as Wikipedia or WordNet (Agirre et al., 2009; Strube & Ponzetto, 2006) have improved on the performance of strictly distributional models in certain tasks, confirming that there is, unsurprisingly, more to language than just distribution. While distribution-only models may not reach peak performance compared to models supplemented with other information, they do possess a major advantage: models that rely only on distribution can be fully automated, and thus be reconstructed on arbitrary corpora with no additional human effort. Automation is a terrifically attractive characteristic 2 Some variants of the HAL model do use dimensionality reduction methods, including discarding low- variance columns and multidimensional scaling algorithms (e.g. Lund, Burgess, & Atchley, 1995); it is reported that performance is equivalent between full- and reduced-dimensionality versions of the model. 13 when considering language, a system with a vocabulary of many hundreds of thousands of words and infinite generativity (Hauser, Chomsky, & Fitch, 2002). Accordingly, distributional models are a fruitful area of research and have been found to succeed at a wide range of tasks with real-world applications, such as grading student responses to a training program (Magliano & Graesser, 2012), synonym generation (Inkpen, 2007), scoring definitions (Collins-Thompson & Callan, 2007), authorship attribution (Burrows & Tahaghoghi, 2007), and so on. It is worthwhile to note that computational language models relying only on co-occurrence are not intended to model the full extent of language. Some models account for other features, such as word order (e.g. Blouw & Eliasmith, 2003; Jones & Mewhort, 2007), but the majority are ?bag of words? models that discard syntactic information, and thus are incapable of making distinctions in meaning that rely on syntax, word order, or other features that are not represented in co-occurrence. Furthermore, these models are not intended to comprehend language in the sense of grounding semantic meaning in situational information (Kintsch & van Dijk, 1978). Rather, these models operate at an earlier level of comprehension (L.W. Barsalou, Santos, Simmons, & Wilson, 2008) that enables early lexical semantic processing in comprehension and word learning. Approaches that do account for structure in language, whether syntactic or conceptual or otherwise, are profoundly valuable in the study of semantic knowledge and language, but tend to address different classes of questions than corpus-based models that rely on statistical features of language context to model relationships between units of language (Griffiths, Steyvers, & Tenenbaum, 2007). 14 2.1.3.2 The role of ?context? in distributional models The distributional hypothesis asserts that the meanings of words are learned based on other words that co-occur in a context (Mcdonald & Ramscar, 2000), but it does not specify what, exactly, ?context? means. It may be the case that ?context? means something different in written than in spoken language. In a face-to-face conversational situation, context is not limited to the precise contents of speech and may include such factors as physical, social, and intellectual attributes of the speakers, previous topics discussed by the speakers, prosody, and so on. It may be the case that all of these contextual cues are relevant in interpreting or constructing (Kintsch & Mangalath, 2011) the meaning of an utterance. However, in developing semantic space models, context is assumed to be limited to the words present in the current text. In semantic space models, words count as co-occurring with a target word if they fall within some ?window? of words around the target word in a text. Models may use several sizes of windows: some use ?document? as the smallest organizational unit, and link every word in a document to every other word (e.g. LSA: Landauer, Foltz, & Laham, 1998) others use some smaller value (e.g. ten words before and after the target word: Lund & Burgess, 1996). These models typically slide the window over the entire document, counting co-occurrence to the target word in the center of each window until the end of the document is reached. The role of window size in model performance has been assessed (e.g. Bullinaria & Levy, 2012) with the general finding that increasing window size produces worse performance. However, this analysis was carried out using models that collapse the dimensionality 15 of the represented corpus; it is unclear if this finding will apply to models that preserve dimensionality (dimensionality is discussed below). Naturalistic texts provide additional meaningful units of organization beyond the ?document?, namely the sentence and the paragraph. There is evidence that these organizational units are reflected to some degree in a reader?s processing of the text (e.g. Goldman, Hogaboam, Bell, & Perfetti, 1980; Ledoux, Camblin, Swaab, & Gordon, 2006; Shanahan, Kamil, & Tobin, 1982). 2.1.3.3 The role of corpus size and selection in distributional models Selecting an insufficiently large corpus carries two risks: first, that a word may not be represented at all in the corpus, and second, that all of the senses of the word may not be represented in the corpus. What constitutes a ?large? corpus has varied dramatically over the years: versions of LSA by 1997 used ?very large numbers of words? in the range of 20-70k (Landauer et al., 1997); early HAL models (Lund & Burgess, 1996) used 160 million words from USENET; HiDex, a later HAL-type model, used a one billion word corpus from USENET (Shaoul & Westbury, 2010), in part because a 160 million word subset did not include every word from their 50,000-word lexicon. If a corpus contains no instances of a word, then clearly that word is not represented and cannot be processed using the resulting model; if a corpus contains very few instances of a word, it is unlikely that those instances span all possible senses in which a word may be used. As English is rife with polysemy (84% of words examined in Rodd, Gaskell, & Marslen-Wilson, 2004), a small corpus might be expected to exclude alternate meanings or uses of a huge number of words. Hence, larger corpora should be more likely to capture the variance 16 with which words are used ? not only increasing range of associations, but also allowing the model to encounter words with multiple meanings in many different contexts. A small corpus also risks insufficient representation of domain-specific terms. For example, while CPU and RAM have specific meanings whose differences are vital to the workings of computers, LSA-type models judge the two terms to be highly similar, in some cases maximally similar (Wiemer-Hastings, 2000). Both occur in a specific domain ? a computer?s hardware ? and either the limited corpus or the dimensionality reduction eliminated the fine distinctions between the two terms. It may be valuable from a perspective of ecological validity to construct models that mimic human experience, but many existing models use corpus sizes that do not reflect the size or range of realistic language input to a developing human. It is difficult to estimate how many words a person hears and reads over the course of a lifetime, but a lower bound may be estimated using the Human Speechome Project3, which recorded the in-home audiovisual environment of a child from infancy to age three. A subset of the recordings has been transcribed, yielding a set of 7 million (total, non-unique) words to which the child was exposed4. Considering that not all of the records had been transcribed, and that the entire dataset represents only three years of exposure to speech and minimal exposure to written text, it seems safe to place a (very) conservative lower bound of exposure to language at 7 million words. A more appropriate lower bound estimate would scale this figure by age, such that an 18-year-old would have heard six times more than a 3-year-old, leading to a figure of 3 http://www.media.mit.edu/cogmac/projects/hsp.html 4 http://www.ted.com/talks/deb_roy_the_birth_of_a_word.html 17 42 million words; this figure accounts only for spoken, and not written, words. In either case, theoretically, corpora sizes on the order of millions would be more ecologically valid than smaller corpora. From a data-driven standpoint, there is strong evidence that vastly increasing the size of a corpus can lead to increased success using a distributional model (e.g. Chelba, Bikel, Shugrina, Nguyen, & Kumar, 2012; Dean et al., 2012). Some studies have found diminishing returns beyond some threshold size (90 million words, in Bullinaria & Levy, 2007), while some have found unbounded benefits at larger corpus sizes (2 billion words, in Bullinaria & Levy, 2012). The utility of larger corpora may also depend on the measure in question: there is evidence that simply increasing the size of the input corpora can dramatically improve performance at certain automated tasks, especially if the corpus comprises unlabeled data (Dumais, Banko, Brill, Lin, & Ng, 2002; Recchia & Jones, 2009). Whether or not more data will improve performance in the present model is a directly testable question, as the data are collected and then stored in units of documents, and thus document sets of varying size may be tested in the same way, and their performance compared. Addressing this question is beyond the scope of the present study, but may be addressed in future work. 2.1.3.4 Manually annotated taxonomies A number of studies have examined the utility of word relationships that have been manually defined or organized, such as dictionaries, thesauruses, and knowledgebases like Wikipedia or WordNet (Miller, 1995). Budanitsky & Hirst (2005) reviewed a variety of human-organized knowledge bases (e.g. Roget?s 18 Thesaurus, WordNet (Miller, 1995), MeSH5) and compared the performance of various similarity metrics trained on WordNet?s human-annotated data; a variety of other works have used knowledgebases entirely, or in combination with language distributions, to complete language tasks (e.g. Agirre et al., 2009; Gabrilovich & Markovitch, 2007; Jarmasz, 2003; Li, Sun, & Datta, 2011; Mihalcea, Corley, & Strapparava, 2005; Strube & Ponzetto, 2006). These models typically perform very well, which is one of many arguments to be made in support of manually constructed knowledgebases. However, human-annotated models suffer from the general limitations of (a) the enormous amount of time required to annotate or organize the data, (b) that only data that has been preprocessed in this resource-intensive manner can be used by the model, and (c) the assumption that the structure of meaning in language is both static and predefined. These models require a correct, precise taxonomy of terms and concepts, which depend on extensive and accurate human effort. In contrast, an automated system lacks the additional information that is provided by human judgment, but is cheaper, faster, and much less limited in scope. Another major drawback of human-annotated corpora is that the model is ?frozen? in the historical period in which the model was made, and cannot incorporate novel uses of language without massive human effort. It is an often-lamented reality that language is continually evolving (e.g. Dorogovtsev & Mendes, 2001; Scheel, 1998). A human-annotated model generally only captures a ?snapshot? of a language, while an automated processor can track evolving language use in a community on a much shorter timescale than the years it takes to complete a project on the scale of WordNet. 5 http://www.ncbi.nlm.nih.gov/mesh 19 2.1.3.5 Model dimensionality Natural language is vast. The OED contains 600,000 unique words6, while the Google Books project has estimated that English contains over a million unique words (Michel et al., 2011). Given the enormous size of the vocabulary, much less the possible combinations of multiple words into phrases, maintaining the full dimensionality of a language-derived space has traditionally been difficult. Some models maintain most of the dimensionality of the semantic space, notably the HAL model (Lund & Burgess, 1996), which performs well at extracting both similarity and association, as well as additional tasks such as categorization. Many existing models do collapse across dimensions using procedures like singular value decomposition (in LSA; Landauer et al., 1997) or various approaches that discard dimensions based on their variance (Lund, Burgess, & Atchley, 1995) to yield a much more manageable computational space, however these reduced dimensions (a) do not map directly to concepts or words, and (b) necessarily minimize the salience of less dominant meanings of words. Some have argued that the real dimensionality of the human semantic space is very small (Lowe, 2000), and thus that dimensionality reduction accurately reflects human semantic processing. However, compressional/reduction methods like SVD have been found to distinguish poorly among near-synonyms (Wang & Hirst, 2010) or multiple meanings of words (Lee, Baker, Song, & Wetherbe, 2010). These findings indicate that, from a data-driven perspective, higher- dimensional representations may be necessary for at least some tasks of language use. 2.1.3.6 Word frequencies Lastly, this method of model construction also produces word frequency counts. Word frequencies are strong predictors of reaction time in a wide variety of 6 http://public.oed.com/about/ 20 reading tasks; accordingly, the accuracy of the model of language from which word frequencies are derived is critical (Burgess & Livesay, 1998). The word frequency counts expected from this internet-based corpus may more accurately reflect the language experience of participants than many existing word frequency databases. Consider that the word pizza has the same frequency as scrutiny in the American National Corpus7 and advocate in the BYU Contemporary American corpus8, and it doesn't even appear in the Brown corpus (Wilson, 1988). Given that the target population of most university studies is the infamous college sophomore, a corpus based on language generated by many users (many of whom are from a college demographic) may be a better fit for experimental uses. It has been found (Burgess & Livesay, 1998) that a larger and more recent set of frequencies (from the HAL corpus: Lund & Burgess, 1996) more strongly predicted medium-to-low frequency words than the Brown corpus. High-frequency words in a language are less likely to change or be replaced by new words over time (Pagel, Atkinson, & Meade, 2007), which may explain the older Brown corpus predicted reaction times to high frequency words as well as the newer corpus. Accordingly, a corpus that reflects realistic, conversational word frequencies ? and can be updated automatically to reflect changing language ? may be ideally suited to experimental use. 7 http://www.anc.org/frequency.html 8 http://corpus.byu.edu/coca/ 21 2.2 Graph models 2.2.1 Introduction The majority of the models discussed in the preceding section are vector space models in which words or sets of words are represented as vectors in a dimension- reduced space. Far fewer researchers have used a graph theory approach to constructing models based on the distributional hypothesis, though these models are rapidly gaining traction (Radev & Mihalcea, 2008). This section will introduce graphs and discuss some graph models that have met with success in previous research. Graphs are methods of representing data and relationships among data using ?nodes? and ?edges? or ?connections?. Connections between nodes have an associated number referred to as ?weight?. In the case of a graph model of language, each node may represent a word, a document, and the weight of a connection between two nodes may represent proximity or frequency of co-occurrence. A possible benefit of graph models of language is that the data are not necessarily collapsed or reduced, though reduction is possible. Instead of singular value decomposition (SVD) or similar algorithms needed for high dimensionality models, reduction of complexity in graphs may be executed using clustering, by collapsing clusters of nodes into supernodes that could be described as latent concepts, by directly collapsing synonyms, or by pruning of nodes or connections based on weights, frequencies, or other properties. 2.2.2 Existing graph models Graph models that have been used in the literature have varied widely in the target tasks and algorithms employed. Previous research has addressed the task of 22 identifying category exemplars using an algorithm that considered each new exemplar candidate?s connectivity to previously identified exemplars (Widdows & Dorow, 2002); gauged document similarity using a type of sub-graph comparison that compared the entirety of the documents rather than considering individual terms (Tsang & Stevenson, 2010); and identified ?communities? corresponding to word senses using clique analysis, an algorithm commonly applied to social networks (Palla, Der?nyi, Farkas, & Vicsek, 2005). The MESA model (Collins-Thompson & Callan, 2007) used random walk Markov chains through a graph whose connections represented several different types of word relationships to judge the quality of word definitions, while Huges and Ramage (2007) used random walk Markov chains on graphs based on WordNet relationships to judge semantic similarity of word pairs. The consistent feature of these studies is that each study exploits graph-specific properties of the model and graph analysis algorithms to address their chosen tasks. The combination of graph models with machine learning approaches has also been successful at various language tasks. Machine learning algorithms may be used to find patterns in existing data, and use those patterns to predict characteristics of new data. This approach may be particularly useful when the model produces or contains a great deal of information, but is not clear on precisely how that information should be combined or reduced to a final prediction. Minkov and Cohen (2008) combined a graph theoretic approach with machine learning techniques to learn a similarity metric with a graph walk algorithm. Silva and Amancio (2013) used specific types of graph traversal with machine learning classifiers to perform word sense disambiguation. The combination of graph theory and machine learning may be 23 fruitful, as graph analysis algorithms may extract information from the word graph that can then be used as inputs to the machine learning algorithm. 2.2.3 Psychological and/or neurological plausibility of graph models Graph models9 provide certain additional relevance to the psychological study of language, largely stemming from the fact that dimensionality of the model is not reduced in any transformative manner. While low-frequency words or low-weight connections may be deleted from a graph model in order to reduce its computational burden, these deletions don?t impact any other words or connections. Each node still represents a word and each connection still represents first-order co-occurrence. In contrast, the matrix reduction used in LSA takes a semantic space with many thousands of dimensions and reduces it to a few hundred dimensions, such that vectors within the resulting space do not correspond directly to any specific concepts (hence the ?latent? meaning in ?latent semantic analysis?). A major benefit of full graphs of co-occurrence, rather than reduced vector spaces, is that the full graph allows statistical properties of language to accrue from the episodic traces that are reflected in connection weights (Kwantes, 2005; Steyvers & Tenenbaum, 2005), grounding the graph in the episodic-trace models of memory (Hintzman, 1984; Howard, Addis, Jing, & Kahana, 2005; Kwantes, 2005). Thus, maintaining full dimensionality in a graph model doesn?t eliminate information as singular value decomposition does. Instead, it records the history of language exposure in very clear way and allows for easier interpretation of model output because nodes and edges reflect specific words and co-occurrence, rather than latent 9 Graphs can be represented as matrices, and thus information within a graph may still be described as vectors. 24 meaning (Audet & Burgess, 1999; Burgess & Lund, 1997; Lund & Burgess, 1996). The ultimate output of the graph model ? in this case, judgments of similarity and association ? is thus extracted from the accumulation of contexts that contain the target words. This is a mechanism that is consistent with theories of word learning, particularly the instance-based learning framework (Bolger et al., 2008), that assert that the meanings of words are learned from features that are consistently present in discourse or other contexts. 2.3 Event-related potentials 2.3.1 Introduction The preceding sections reviewed research in language models. The success of the language model in the present study will be quantified by its ability to predict neural activity as measured by event-related potentials (ERPs). Accordingly, the following section will introduce ERPs and discuss their utility in studying language processes. ERPs are small segments of electroencephalograph (EEG) recordings that are time-locked to the onset of stimuli and averaged over many trials to produce an averaged waveform. Averaging many trials allows a very small event-related signal to be extracted from the background noise of brain activity. Various features, referred to as components, of the time-locked waveform have been identified as reflecting particular language-related processes or experimental manipulations (Kaan, 2007; Osterhout, Kim, & Kuperberg, 2006). Several of these ERP components have been used as tools to examine various aspects of on-line processes involved in reading, among them the n400 (discussed below). 25 There are many benefits of collecting ERP data in addition to behavioral data, notably their sensitivity. ERPs are generally considered to be more sensitive than behavioral output, for several reasons. Firstly, ERP data are high-dimensional: 64 or 128 channels and generally around a thousand timepoints per trial. While a task with a yes/no response generally only examines variance on the two metrics of reaction time and accuracy of a decision output, ERP can allow the examination of latent activity that is collapsed into the single instance of behavioral output. In the present study regarding word knowledge, if variability in knowledge or representation of a word is not large enough to produce different behavioral output, or if the variability is on a dimension that doesn?t directly alter behavioral output on a particular task, then the variability may not be reflected in behavior. ERPs provide a sensitive measure that is often able to measure such latent variability in cognitive processes. 2.3.2 The n400 component The n400 is a negative deflection in the EEG signal that occurs roughly 400ms after stimulus onset. This component, extensively reviewed elsewhere (Kutas & Federmeier, 2011) is commonly used as an index of semantic knowledge and integration of semantic knowledge into existing contexts. Of particular importance is that the degree of relationship between a predicted target and the actual target has been found to modulate the amplitude of the n400 (e.g. Federmeier & Kutas, 1999), and that similarity between word pairs in priming tasks shows a similar, though sometimes attenuated, effect (Perfetti, Wlotko, & Hart, 2005). Koivisto and Revonsuo (2001) found that both semantic similarity and relatedness affect n400 amplitude, but noted that related words elicited a longer-lasting n400 priming effect than the similar 26 words. These properties make the n400 an ideal tool for investigating language processes and how word meanings are represented or manipulated in the brain. 2.4 Machine learning 2.4.1 Introduction Machine learning (ML) uses ?features?, or predictors, and ?examples?, or instances of data from which to learn or to predict. In the present study, the output of the GOLD model will make up the features and grand average ERPs will make up the examples. Many features will be used as inputs to the ML algorithms because the literature informs no specific pre-existing hypotheses about which types of similarity calculation and/or normalization are most appropriate. It may be valuable to use feature selection, in which predictions are made using only a subset of features that have been identified as being more informative than others, particularly because many of the GOLD features will be correlated. Feature reduction often leads to better performance, except in the case where certain features predict a subset of the problem space that other features do not predict (Hall, 1999). Additionally, variables that are correlated can still add information, as long as they are not perfectly correlated (Guyon & Elisseeff, 2003). Accordingly, the present model will rely on the full set of features from GOLD as well as exploring model performance with reduced sets of features. 27 2.4.2 Types of algorithms No a priori hypotheses regarding ML algorithms, so na?ve implementations of several different algorithms were tested, including support vector machines, neural networks, random forests, and k-nearest-neighbors. Each of these algorithms is briefly introduced below. Support vector machines (SVMs) and support vector regressors (SVRs) can identify patterns in data that are complexly related by mapping the data into a new space in which they are more simply related. Furthermore, SVMs/SVRs aim to optimize these transforms such that the space between the classes of examples is as wide as possible, which allows for better generalization. These methods are robust in the face of noisy and/or sparse, high-dimensional, and have been used with success in brain research (Lotte, Congedo, L?cuyer, Lamarche, & Arnaldi, 2007) and a variety of other fields. Neural networks (Cheng & Titterington, 1994; Hopfield, 1982) are based on a very simplified model of neurons, typically modeled as layers of ?neurons?: an input layer, one or more hidden layers, and an output layer (the present study uses multilayer perceptrons with a single hidden layer). The input layer takes in the stimuli, passes them on to the hidden layer, and the hidden layer outputs to the output layer which corresponds directly or indirectly to the network?s decision. All of the connections between neurons in each layer are weighted, and those weights altered such that the pattern of weights in the network can represent transformations from input to output. Neural networks have been applied to a variety of fields including language research (Bengio, Ducharme, Vincent, & Jauvin, 2003) 28 The random forest algorithm (Breiman, 2001) trains many decision trees that are initialized with random weights. Instead of relying on a single decision tree?s prediction, it averages over the predictions of all of the trees in the forest, to produce an output that is more robust against noise and vagaries of random weight assignment. Random forests have met with success in language modeling (Xu & Jelinek, 2004). The k-nearest-neighbors algorithm considers the k training examples that are nearest in the feature space to a test example, and assigns the average value (for regression) or most common class (for classification) of the neighbors as the prediction of the test example. This is a fairly simple approach, and considers only the immediate feature space, but achieves high performance on a variety of measures (e.g. Weinberger, Blitzer, & Saul, 2009). 2.4.3 Psychological/neurological plausibility In keeping with the theme of psychological/neurological plausibility, it seemed appropriate to restrict GOLD?s learners to algorithms that are plausibly implementable in a brain. However, what exactly constitutes a psychologically or neurologically plausible mechanism is not clear. Logically speaking, it is the case a neural network of suitable size with one or more hidden layers is capable of performing arbitrarily complex mathematical operations (Hornik, Stinchcombe, & White, 1989); if the brain can operate as the mathematically modeled neural networks do, then it is not obvious that an algorithm like SVM, or even SVD, could not be occurring in the brain. Empirically speaking, realistic models of neurons have found success at modeling a variety of algorithms, including fast Fourier transforms (Velik, 29 2008) and convolution (Blouw & Eliasmith, 2003). Accordingly, it seems inappropriate to rule out a particular algorithm based on its implausibility, and so all of the aforementioned ML algorithms will be used and discussed. 2.5 Summary This chapter reviewed relevant literature in language acquisition and representation (the distributional hypothesis), semantic space models, graph models, language-related ERPs, and the basics of machine learning. This past work leads to the general hypothesis that a graph model of distributional data may give rise to similarity measures that can predict behavior as well as neural activity measured via ERP. The next chapter discusses the construction of such a model. 30 Chapter 3: Methods This section will describe the construction of the GOLD model, the LSA model, and the machine learning techniques that will be used to predict behavioral and brain data. 3.1 GOLD model 3.1.1 Introduction The present study will construct a graph-structured model (GOLD) of English based on the distributional hypothesis discussed in the previous chapter. The ultimate goal of GOLD in the present study is to measure similarity of two sets of words by representing their meanings through their relationships to other words. GOLD will not reduce its complexity to a small set of dimensions as in LSA (Landauer et al., 1997) and many other vector space models. Instead, GOLD will take the form of a graph in which each node represents a word and the weights associated with connections between nodes will represent relative frequency and proximity of co-occurrence. The weakest connections between nodes and/or the most infrequent words may be removed from the graph in the interest of reducing necessary computations, and connection values may be normalized, but no further transformations will be applied. Maintaining, rather than reducing, the dimensionality of the data is intended to allow the finest possible comparisons between words by not eliminating any information about their connectivity. 31 3.1.2 Corpus In an attempt to capture modern language usage, we collected a corpus from comments on the forum website Reddit (www.reddit.com), which is one of the most frequently visited websites on the internet (www.alexa.com). The benefits of using a Reddit comment corpus include naturalistic language use, a wide range of authors, a broad array of topics under discussion, and a vast pool of data. Posts in the most popular subsections of Reddit (enumerated at http://subreddits.org/) were queried roughly daily from October 2012 through February 2013, and threads containing more than 100 comments were collected. Comments were parsed at the ?document? level, which consisted of the entire comment thread; the ?paragraph? level, which took

and
tags as paragraph breaks; and the ?sentence? level, which used sentence-final punctuation such as periods and exclamation points as delimiters in addition to the paragraph breaks. The GOLD model was constructed based on the paragraph level data, as a compromise between the computational complexity of full- document processing and the limited span of the sentence-level data. A total of 19,646 comment threads were collected, totaling 4,342,302 paragraphs, 97,976,253 tokens (word instances), with 431,822 types (unique words). 3.1.3 Preprocessing The corpus was stripped of several classes of letterstrings. Stop words (closed- class words such as it, the, and; using NLTK?s English 127-word stoplist; Bird, Loper, & Klein, 2009) were removed, on the premise that removal of stop words does not impact the output of the network but does dramatically decrease the computational load of network construction and analysis (Bullinaria & Levy, 2012). 32 This removed 50,064,361 tokens, more than half of the corpus. Unique strings that did not occur in a large set of words combined from NLTK?s word lists (size 755,110) and NLTK?s package of WordNet (size 10,771,928) were removed on the premise that these words are not common terms in the language. This step eliminated letterstrings such as fooooood, hasbut, and qxt, and protowords such as facepalm, derp, and awesomesauce. A surprising 362,202 types were removed in this step, for two reasons. First, retaining only words that occur in wordlists is overly conservative, as many legitimate words were not present in the wordlists (such as minnesota and minecraft). Second, the internet is rife with creative misspellings, and these strings are more likely to be unique than correct spellings ? for example, someone may occur with a high frequency but only count as a single unique type, while sumone, someon, somoen, summone, etc., will each count as a separate, unique type. Despite the huge number of types removed in this step, these types accounted for only 2,112,017 tokens, or ~2.15% of the corpus. Lastly, strings that occurred only once in the entire corpus (10,592 tokens, such as osseous and monomorphism) were removed on the premise that very low frequency words will be connected to a very small set of co- occurring words and thus cannot contribute much to the network processing or to psychological meaning. A final list of 58,901 types remained after cleaning, composing a corpus of 45,799,875 tokens. 3.1.4 Constructing the graph Co-occurrence of words within the cleaned corpus was calculated by examining each paragraph in turn, pairing every word in the paragraph with every 33 other word, and incrementing the weight of the connection for each word pair by 1. Paragraphs of length=1 (e.g. "cuuuuuuuuuute" and, mysteriously, ?onychomycosis?) were ignored. The total collection of word pairs and connection weights were fed into graph database software (Neo4j version 1.8.2; Eifrem, 2009) to construct the graph. A total of 58,901 unique words (nodes) and 54,399,032 weighted relationships among those words (edges) were included in the GOLD model. The graph possesses expected properties of a large-scale language network (Steyvers & Tenenbaum, 2005), such as a degree distribution following Zipf?s law and small-world structure. On the advice of Bullinaria and Levy (2007, 2012), the network was reconstructed using a window of size=1, such that words were only connected to words that occurred immediately adjacent in the cleaned paragraphs. This network included 58,901 nodes and 10,603,851 weighted edges, and is hereafter referred to as ?smallGOLD?. Figures 1 and 2 display the immediate neighbors of two pairs of words in smallGOLD: grumpy-cat in Figure 1, and sushi-octopus in Figure 2. Figure 1 is too dense to discern much about individual connections, but in Figure 2, edges? thickness and color reflect their weight. The effect of frequency is very apparent in Figure 1, as grumpy occurs 754 times in the corpus, while cat occurs 17,551 times; accordingly, the size of the cat associate cloud dwarfs that of the grumpy associate cloud. Figure 2 displays a pair that is much closer in frequency: sushi occurs 938 times in the corpus, while octopus occurs 512 times. It is worth noting that the higher frequency words are more likely to be in the overlap set (those nodes that are connected to both words of the word pair) merely as a result of frequency. 34 Figure 1. First-order associates of grumpy-cat. Connectivity between associates is not displayed. The large cloud of nodes are the associates of cat that are not also connected to grumpy; the small cloud of nodes are the associates of grumpy that are not also connected to cat; and the round blob between them is the set of nodes that is connected to both grumpy and cat. Figure produced using Force Atlas and Yifan-Hu layout algorithms in Gephi (Bastian, Heymann, & Jacomy, 2009). Figure 2. First-order associates of sushi-octopus. Connectivity between associates is not displayed. This subgraph is small enough to display weight information as well; weight of connections is depicted by color (red=large weights) as well as thickness. Figure produced using Force Atlas and Yifan-Hu layout algorithms in Gephi (Bastian et al., 2009). 35 3.1.5 Normalization Theoretically, high-frequency words carry less information or specificity of meaning than low-frequency words (Finn, 1977; Schatz & Baldwin, 1986). That is, terms with high specificity are used more rarely because their specificity is applicable more rarely (e.g. the concept denoted by antidisestablishmentarianism isn?t relevant often in daily life). In contrast, more frequent words tend to be far less specific and are more likely to be polysemous (e.g. run). In a co-occurrence model, high- frequency words are connected heavily and widely merely as a product of their frequency, rather than necessarily reflecting meaningful relationships. Accordingly, these abundant, heavy weights must be normalized to remove this undue influence of frequency. Any applied normalization method must account for frequencies of the words at both ends of an edge; several standard methods, such as pointwise mutual information (PMI) and association strength (Eck & Waltman, 2009) already do this, while other methods that only normalize node properties, such as inverse document frequency (IDF), may be altered to suit a two-word relationship. The theoretical underpinnings of graph models of language are clear that weights should be normalized, but are not clear on the best manner of normalizing weights. Accordingly, we used 15 different normalization techniques that rely on combinations of raw frequency, document frequency, IDF, and log transforms of these frequencies. 3.1.6 Similarity and association metrics There is evidence (e.g. Weeds & Weir, 2005) that examination of different types of information within a model framework can identify different types of relationships such as similarity and association. From a theory-driven perspective, 36 the structure of a word graph may be able to directly capture both types of relationships. Semantic similarity between two items may be reflected in second- order connections, or the intersection between their connections (i.e. are both words connected to the same set of other words?). Association may be captured in first-order connections, or the connection between the two items themselves (are the words connected to each other? If so, how strongly?). These proposed patterns derive from the distributional hypothesis, for the following reasons. Similarity would be represented in second-order connections because two words that connect to the same neighborhood of words may take the same role (e.g. the hot cup of coffee and the warm cup of coffee); similarity would not be captured in first-order connections because natural language doesn?t generally provide that kind of redundancy (e.g. the hot and warm coffee). Association would be represented in first-order connections because those would co-occur directly together, as coffee and hot would be associated in the previous example, as would coffee and warm. From a data-driven perspective, it may be beneficial to view the model as containing useful information of some kind, but remain agnostic as to the exact form of that information. Machine learning techniques will be used to discover and describe, rather than proscribe, what properties of the word graph may be useful in representing different relationships between words. However, theory will inform the properties that are extracted from the graph to be input to the machine learning algorithms. The use of both theory and data to inform model metrics will be useful on several levels. The theory-driven approach is more clearly informed and psychologically valid; the data-driven approach may yield a metric that is more 37 difficult to interpret psychologically, but will produce more accurate predictions. If this is the case, the metrics may be examined more closely to determine what sort of information in the graph it is relying on to produce better predictions, which may in turn inform theory. In this way, if existing theory is incomplete in explaining how relationships are encoded in distributional data, the data-driven method may be used to discover additional factors that might make theory more complete. Figure 3. A simplified graph of grumpy-cat. Overlap nodes are shown on a blue background and nonoverlap nodes are shown on a green background. Ideal metrics for assessing relatedness between words in the GOLD model should (a) reflect psycholinguistic theories, (b) preferably be limited to a set range of values, such as LSA?s -1 to 1, for easy comparison, and (c) differentially consider nodes that are connected to both words in a word pair as well as words that were uniquely connected to each word, as both first- and second-order co-occurrences putatively contribute to relatedness differentially. Figure 3 presents a very small subset of the associates of grumpy-cat to illustrate the overlap and nonoverlap nodes. Association was theorized to be reflected in the direct connection between the two words in a word pair, which reflects the episodic history of how often the two 38 words co-occur. This metric has no upper bound, and a minimum of 0 indicating no relationship. This metric was calculated by extracting the raw weight of the connection between the two words and normalizing it by the normalization methods in Table 1. An additional metric was determined by calculating PMI as follows, where w is the weight between the two words in the word pair, is the document frequency of word 1, and is the total number of documents in the corpus: Additionally, 15 methods of normalizing the connection weights were used (see Table 7 in Appendix A for normalization methods). All permutations of these association algorithms and normalization methods were calculated from the graph, for a total of 30 association metrics (15 normalization methods x 2 association calculation methods). Semantic similarity goes beyond the simple co-occurrence between to words and is theoretically reflected in shared or overlapping patterns of connectivity for two words (Lund, Burgess, & Atchley, 1995), such that two words that are connected to the same community of words with similarly weighted connections are more similar. In essence, the graded nature of similarity (e.g. Collins & Loftus, 1975) might be represented by some combination of the overlapping relative to non-overlapping patterns of connections and the fundamental weighting of those connections. This general conception of similarity is akin to Lin?s universal similarity measure (Lin 1998b, as reviewed in Budanitsky & Hirst, 2005), although with a definition of overlap that arises from connectivity rather than information directly. 39 This theoretical conception does not prescribe the exact calculation of the metric, so in order to determine the optimal metric for detecting similarity versus association in GOLD, we tested 5 different algorithms (see Appendix A for calculation details). All permutations of the similarity algorithms and normalization methods were calculated from the graph, for a total of 75 similarity metrics (15 normalization methods x 5 similarity calculation methods). These metrics are redundant to some degree; however, because one of the primary goals of the present study was to establish if the information necessary to classify stimuli is present in the graph, the full set of metrics was input into the neural network classifiers. Additionally, eliminating metrics based on performance on this stimulus set may provide an inaccurate view of which metrics are necessary or most predictive, because this stimulus set is not designed to span the full space of relationships (e.g. there may be many synonyms and few antonyms in the stimulus set). 3.2 Latent semantic analysis (LSA) Latent Semantic Analysis (LSA) is a vector-space model commonly used in language research to gauge word relationships and is often considered the gold standard for performance of a range of measures. Accordingly, LSA was used here as a comparison model. LSA was constructed on the corpus described above using gensim (Rehurek & Sojka, 2004). The same preprocessing steps were applied to the corpus and the model was constructed with 300 dimensions, as has been determined to be optimal for LSA model creation for a variety of tasks (Landauer, Laham & Foltz, 1997). 40 3.3 Machine learning In both Experiment 1 and Experiment 2, model predictions were quantified using the Orange machine learning software suite (Demsar et al., 2013). Classifiers were trained for tasks that required sorting stimuli into discrete groups and regressors were trained for tasks that required predicting continuous values, using the algorithms described in section 2.4.2. 3.4 Summary Chapter 3 described the construction of the GOLD model and an LSA model. These models will be used to predict rating data in Experiment 1 in Chapter 4, and neural activity in Experiment 2 in Chapter 5. 41 Chapter 4: Experiment 1 (behavioral data) Assessing relationships between words by asking participants to make rating judgments is a commonly used method that dates to at least the 1960?s, with Rubenstein and Goodenough?s ( 1965) experimental validation of contemporary theories of conceptual similarity. Rated word pairs of this nature are often used as standards of comparison for computational models of language (Budanitsky & Hirst, 2006; Kintsch & Mangalath, 2011) as they are thought to reflect theoretical accounts of semantic knowledge as well as empirical human judgment. 4.1 Stimuli 4.1.1 For human subjects in Experiment 1a and Experiment 2 The stimulus set was limited to 350-400 word pairs based on the duration of each trial (~4s, plus ITI) and the tolerance of participants to lengthy sessions. Word pairs were drawn from existing studies ( Chiarello, Burgess, & Richards, 1990; Thompson-schill, Kurtz, & Gabrieli, 1998; and Miller & Charles, 1991 and Rubenstein & Goodenough, 1965 as cited in Budanitsky & Hirst), and then additional word pairs were generated from the Reddit corpus. First, the lexicon of the cleaned Reddit corpus was reduced to words with frequency > 100 and length > a2. Words appearing in a taboo word list (words referring to racial slurs, explicit violence, etc.) were removed. Then, the following procedure attempted to produce a stimulus set from these words that spanned the relatedness space. Ten thousand words were randomly selected from the reduced word list. These 10,000 words were randomly 42 paired several times and sorted into bins based on their LSA cosines10. Two hundred word pairs from each of the 15 LSA bins were randomly selected, and those pairs were further whittled down by removing word pairs containing a word with multiple meanings. Because word frequency can influence behavior and neural activity, an attempt was made to balance words pairs in each bin on frequency, such that the average frequencies of words in each bin were equivalent, by removing word pairs with extreme frequency values (both high and low). However, this attempt was not entirely successful, because higher frequency words tend to have higher cosines with other words of high or medium-high frequency. It was more likely that word pairs that are unrelated according to LSA are also lower frequency, so the most unrelated bins have a slightly lower average frequency (see Appendix D). Many words were duplicated between the word pairs drawn from other studies and the randomly generated pairs. Duplicated stimuli is inappropriate for behavioral as well as EEG paradigms, which generally aim to avoid identical word repetition (unless in a ?repetition? condition). Accordingly, these sets of word pairs were reduced to sets containing only unique words. The final set of words totaled 345 pairs. Four pairs were later identified as containing duplicates with the remaining set, and were removed, leaving 341 pairs. During data collection, five word pairs that should have been rejected during the taboo word screening were identified. These 10 Due to a typo in the author?s code to generate the LSA model, these LSA values are based on a 30 dimensional model rather than a 300 dimensional model. This typo was discovered after human subjects data collection but before data analysis, so all later LSA values used in the analyses are from the (correct) 300-dimensional model. This error is not a major concern because the purpose of using LSA during stimuli selection was to group stimuli into very general bins of similarities, so precise assessment is not crucial. Additionally, the two versions of the model correlate with a Pearson correlation of 0.628 and Spearman correlation of 0.716. 43 words were changed to non-taboo words for the remaining participants and the five involved pairs were rejected post-hoc. Final analyses were conducted on 336 word pairs. 4.1.2 For model predictions in Experiment 1b The stimulus set described above was constrained in size due to the needs of human participants. If no humans are involved, or if pre-collected human data is used, then the stimulus set can be quite large. To expand upon some of the stimuli in the set described above, we tested the GOLD model and LSA on the complete sets of word pair stimuli from Plaut & Booth (2000) and Chiarello et al (1990). Plaut and Booth?s 240 word pairs are categorized as related and unrelated, based on free association norms (Nelson et al., 1999). Chiarello et al.?s 144 word pairs are sorted into three categories according to relationship type: associated only, similar only, and word pairs that are both similar and associated. These categorizations were assigned based on several sets of norms, and the words were balanced on length, frequency, and imageability. It is worth noting that some of the stimuli from Chiarello dated themselves; ostensibly related pairs such as decoy-duck were rated as unrelated by all participants in Experiment 1a, suggesting that this pair is no longer reliably associated in the modern lexicon. The same may be argued of some of the older commonly used sets, such as Rubenstein and Goodenough?s set (1965) that includes terms with vulgar connotations in modern parlance. Accordingly, post-hoc sorting and plotting of ERP data that was collected in Experiment 2 was based on rating data as well as 44 predefined word categories, as the rating data may better reflect the lexicon and language experience of the ERP participants. 4.2 Participants (1a) Reaction times and judgment data were collected in two tasks: the first was a task of similarity judgment, and the second a task of association judgment. Participants were 34 undergraduate students (3 male) in the association task, and 31 undergraduate students (7 male) in the similarity task, recruited from the Psychology Department participant pool and compensated with course credit. All were native English speakers. None of the participants who contributed data to the word pair judgment tasks also contributed data to the ERP task. 4.3 Procedure (1a) In each of the tasks, participants gave informed consent and then were seated at a standard desktop computer. Participants were first instructed on the nature of the relationship they were to judge, and then completed several example trials with the experimenter, discussing their judgments on each example trial. After the experimenter was satisfied that the instructions were understood, the participant then completed 341 trials, self-paced. Each trial consisted of a word pair presented with a Likert scale (1-7) with ends labeled as maximally or minimally related based on the specific relationship in the task. 4.4 Data analysis (1a) Brief post-hoc interviews with participants indicated some difficulty regarding task instructions, ranging from forgetting the instructions partway through the task to 45 inconsistency in following task-specific instructions. Data were cleaned by removing trials whose RTs were below 500ms (36 out of 11,594 trials in the association judgment task, and 12 out of 10,571 trials in the similarity judgment task). 4.5 Results 4.5.1 Ratings (1a) Rating data on the similarity and association judgment tasks were treated as continuous data and were separately predicted using several regression algorithms: support vector regressors (SVR), random forests, and k-nearest-neighbors. GOLD output and LSA were separately used as input features to these algorithms. Performance measures are averaged across 10 iterations of training and testing on randomly selected subsets of the data (70/30 train/test). Performance was quantified via r-squared and root mean squared error (RMSE), which is not meaningful alone and is thus compared to a predictor that always predicts the training set mean. The default parameters from the Orange software suite were used for each algorithm: SVM regression (type=nu, cost=8.0, complexity bound=0.5, kernel type=RBF, tolerance=.001), random forests (maximum 20 trees, minimum 5 numbers of instances per leaf), and k-nearest-neighbors (5 neighbors, weighting by Euclidean distance, normalizing continuous attributes). Table 1. Regressor performance on similarity and association ratings. Highest performance for each model is in a red font. Association Similarity Algorithm RMSE r2 RMSE r2 Mean 2.0308 -0.0173 1.6779 -0.015 smallGOLD SVM Regression 1.3869 0.5255 1.2273 0.4571 46 Random Forest 1.2625 0.6068 1.1081 0.5575 kNN 1.4437 0.4859 1.3023 0.3887 GOLD SVM Regression 1.3163 0.5726 1.2025 0.4789 Random Forest 1.2498 0.6147 1.1595 0.5155 kNN 1.3336 0.5613 1.2709 0.4179 LSA SVM Regression 1.6461 0.3317 1.3752 0.3184 Random Forest 1.7227 0.2679 1.4082 0.2853 kNN 1.9561 0.0562 1.5906 0.0881 Figure 4. Similarity predictions from one train/test using a random forest trained on smallGOLD (r=0.75). 47 Figure 5. Association predictions from one train/test iteration using a random forest trained on smallGOLD (r=0.79). GOLD and smallGOLD performed roughly equally, and quite well, at the task of predicting similarity and association ratings, with a maximum Pearson?s r = 0.78. One set of train/test from each set of ratings was randomly selected for display in Figures 6 and 7. LSA did not perform as well at this task; to ensure a fair assessment, raw Pearson correlations were also calculated between LSA and association ratings (r = 0.5847, r2 = 0.3418) and between LSA and similarity ratings (r = 0.5827, r2 = 0.3395). While GOLD performed well on the task of predicting continuous rating data, the high variability in human ratings suggests that these relationships may not all be ?true?, in the sense that they are not agreed upon by multiple speakers. A subset of the word pairs judged in the above tasks were drawn from sets of words with predefined relationships, such as the words from Chiarello et al. (1990) which were categorized into words that were associated only, similar only, or both similar and associated. These predefinitions rest on datasets that may more reliably reflect the underlying 48 word relationships, if at a coarser scale. Another set of words, from Plaut & Booth (1995), were categorized as related or unrelated, regardless of relationship type, which is at an even coarser scale. Accordingly, we next tested model performance on these full sets of words: first, the simpler classification task of related-unrelated pairs from Plaut & Booth (1995), and then the more complex task of distinguishing between the types of word relationships in the pairs from Chiarello et al. (1990). 4.6.1 Word pair categories (1b) 4.6.1.1 Distinguishing between related and unrelated words Performance measures are averaged across 10 iterations of training and testing on randomly selected subsets of the data (70/30 train/test). Performance measures of accuracy, sensitivity (rate of true positives/'hits?), and specificity (rate of true negatives/?correct rejections?) are presented, as well as confusion matrices. LSA was tested using several algorithms; best overall performance was achieved with neural networks (parameters: 1 hidden layer, 20 hidden layer neurons, regularization factor=1.0, maximum 300 iterations), so those data are presented here. Table 2. Classifier performance on the Plaut and Booth (2000) word pairs. Accuracy Sensitivity Specificity Related Unrelated Related Unrelated smallGOLD 0.9000 0.8914 0.9086 0.9086 0.8914 GOLD 0.9043 0.9000 0.9086 0.9086 0.9000 LSA 0.7443 0.6629 0.8257 0.8257 0.6629 Table 3. Classifier confusion matrices for the Plaut and Booth (2000) word pairs. Red percentages are the correct classifications. smallGOLD Related Unrelated True Related 89.1% 10.9% 49 class Unrelated 9.1% 90.9% The two GOLD models demonstrated nearly identical, high performance (90% accuracy). Inspection of word pairs that were incorrectly classified reveal that the unrelated words misclassified as related were sometimes clear errors (right-found) but often perhaps related (e.g. split-fight, yell-burst, treat-equal,). GOLD failed to identify some clearly related word pairs (e.g. horse-stall, great-super, take-bring, gives-share, slice-piece, glue-paste, right-wrong, live-death). It appears that several of these pairs have more specific relationships than relatedness, including synonymy and antonymy. LSA performed well (74% accuracy); its most common error was to mis-classify related words as unrelated. 4.6.1.2 Distinguishing among relationship types Having established that GOLD can distinguish related from unrelated word pairs, we turn to the task of distinguishing type of relatedness. As stated earlier, the distinction between association and semantic similarity is often a matter of degree as these factors are not orthogonal to one another. Thus, finding word pairs that are stronger in one dimension than the other or are stronger in both is a difficult task. Chiarello and colleagues (1990) have identified 144 such word pairs that are GOLD Related Unrelated True class Related 90.0% 10.0% Unrelated 9.1% 90.9% LSA Related Unrelated True class Related 66.3% 31.1% Unrelated 24.9% 82.6% 50 semantically related (table-bed) based upon category membership norms, associatively related (mold-bread) based upon free-association norms, and both semantically and associatively related (aunt-uncle). Following Lund, Burgess, and Atchley (1995, Experiment 3), we tested whether the metrics of the GOLD model could reliably classify these patterns of relationships and compared the results of the GOLD model to those of LSA. Table 4. Classifier performance on the Chiarello et al. (1990) word pairs. Accuracy Sensitivity Specificity Associated Both Similar Associated Both Similar smallGOLD 0.6023 0.6000 0.4857 0.7214 0.8250 0.7621 0.8172 GOLD 0.5791 0.6067 0.4429 0.6857 0.7250 0.7897 0.8517 LSA 0.3884 0.2667 0.5857 0.3214 0.7643 0.6862 0.6345 Table 5. Classifier confusion matrices for the Chiarello et al. (1990) word pairs. Red percentages are the correct classifications. smallGOLD Associated Both Similar True class Associated 60.0% 24.7% 15.3% Both 30.0% 48.6% 21.4% Similar 5.0% 22.9% 72.1% GOLD Associated Both Similar True class Associated 60.7% 24.0% 15.3% Both 41.4% 44.3% 14.3% Similar 13.6% 17.9% 68.6% LSA Associated Both Similar True class Associated 26.7% 27.3% 46.0% Both 15.0% 58.6% 26.4% Similar 32.1% 35.7% 32.1% 51 Overall accuracy is best for the smallGOLD model. Inspecting the confusion matrices indicates that the GOLD models? most common error is to mis-classify word pairs that are both similar and associated as associated-only; the next most common mistake is the reverse, where associated-only word pairs are mis-classified as both similar and associated. LSA?s most common error is to mis-classify the associated- only words as similar-only. It also assigns similar-only words equally often to the three categories. 4.6.1.3 Feature analysis This initial exploratory testing of the GOLD model relied on the ?shotgun approach? of feature generation, in which all of the combinations of normalization and metric calculation were used as inputs to the neural network. In order to determine which features the algorithm is relying on to produce its classifications, and perhaps to suggest which types of information are important for judging these word relationships, we investigated feature relevance using one- and two-feature classifiers, as well as standard feature selection methods. For the one- and two-feature classifiers, a neural network learner classified the similar/associated/both word pair on 5 iterations of 70/30 train/test splits. In the first round of analysis, the neural network was given each of the 105 smallGOLD features individually; maximum accuracy of the 105 classifiers reached 50%. The full set of 105 features was sorted and the 50 highest-accuracy features were retained. In the second round of analysis, the neural network was given all combinations of two features from these 50 features, one pair of features at a time; maximum accuracy reached 63% accuracy, which is on par with the full set of features. Inspection of these feature pairs revealed that the 52 majority of the top ranked pairs included two types of metrics: Method 5 from the similarity metrics (which considered only overlapping nodes, weighted by magnitude difference and normalized by size) and the PMI calculation of association. The top 30 performers were all pairs that included one association and one similarity measure. Limiting the neural network to those two methods (30 features) yielded 63% accuracy. Limiting the neural network inputs to those two metrics (30 features) yielded 63% accuracy. Using additional feature selection (linear SVM weights) to reduce the number of features to 10 produced 65% accuracy; reducing the number of features to 5 boosted accuracy to 68%, which is well in excess of performance using the full set. However, these performance outcomes should be interpreted as exploratory only. The broad conclusion regarding features is that the combination of association (direct connections between the two words) and similarity (based on the overlapping and nonoverlapping neighbors of the two words) metrics is more powerful at predicting category than either alone. It may be possible to conclude that the similarity metric considering normalized overlap only and the PMI calculation of association are the most useful, but the similar/associated/both word pairs are not designed to span the language space and thus this finding may not generalize to other regions of the graph. Chapter 5: Experiment 2 (neural data) 5.1 Participants Participants were 20 graduate and undergraduate students recruited from the University of Maryland campus. Participants (7 male, 13 female; mean age = 25.15 53 and SD = 2.79) were all right-handed. One male participant?s data were not considered in analyses, due to scores far below the sample mean on all of the reading and language assessments. All participants gave informed consent and were compensated for their participation with snacks. 5.2 Procedure In the first hour of the study, participants completed the Peabody Picture Vocabulary Test (PPVT; Dunn & Dunn, 2007), both subtests of the Test of Word Reading Efficiency (TOWRE; Torgensen, Wagner, & Rashotte, 1999) the Nelson- Denny Vocabulary and Comprehension tests (Brown, Fishco, & Hanna, 1993), and a handedness questionnaire. All assessments were pencil-and-paper. The PPVT is a standardized measure of receptive vocabulary in which participants must identify pictures that represent the meanings of orally presented words. The TOWRE consists of two subtests: Sight Word Efficiency and Phonetic Decoding Efficiency. The Sight Word Efficiency subtest is a measure of word reading fluency in which participants must read a list of words in 45 seconds, emphasizing both speed and accuracy The Phonetic Decoding Efficiency subtest is a measure of phonemic decoding skill in which participants read a list of pronounceable nonwords (e.g. pelnador) in 45 seconds, again emphasizing both speed and accuracy. The Nelson-Denny comprises a multiple-choice vocabulary test and a comprehensions test in which participants read passages and answer questions based on those passages. These assessments were not analyzed in the following work, but were rather used to ensure that participants were high-skill readers. The mean performance of the 19 participants who contributed ERP data is presented in Appendix B. 54 Following these behavioral measures, participants were fitted with the EEG cap and electrodes, seated in front of a standard LCD monitor, and asked to place their right hand on the number pad of the keyboard. Responses were made using the ?1? and ?2? keys on the number pad, and the next trial advanced using the ?enter? key on the number pad as well, all with the right hand. Experimental trials proceeded as in Figure 8 below. Each trial began with a fixation cross in the center of the screen for 450-550ms, jittered. The first word of the pair appeared for 800ms, followed by a blank screen for 200ms; then the second word of the pair appeared for 800ms, followed by a blank screen for 1000ms, followed by a prompt to judge if the pair was related or unrelated. The prompt remained onscreen until the participant responded. Between trials, a neutral screen encouraged participants to blink as needed before pressing enter to begin the next trial. Participants were encouraged to rest if their EEG appeared to be showing higher alpha power, if they appeared drowsy, or at their own discretion. Each participant completed all 341 trials in roughly 30 minutes. 55 Figure 6. Trial template in the ERP task. 5.3 Data collection and analysis 5.2.1 ERP collection and preprocessing EEG data were collected during the above task using the Biosemi system with a 64 channel electrode cap, referenced to linked mastoids. In two participants, one mastoid was irrecoverably noisy and/or separated from the scalp and thus their data were referenced to a single mastoid. In cases where a single scalp electrode failed (1 subject), it was interpolated. No more than one electrode was interpolated on any subject. No eye leads (EOG) were used; instead any trials contaminated by blink artifacts were rejected entirely. EEG was epoched (-200ms to 800ms), filtered (0.1Hz to 30Hz), and individual epochs rejected based on automated artifact identification (sliding window average). Trials were grand averaged by (a) word or response 56 characteristics, discussed below with visualizations, and (b) by individual word pair, to be exported for per-stimulus ERP values. 5.2.2 Features for machine learning A problem encountered in the course of ?predicting neural activity? is deciding what, exactly, should be predicted about neural activity. In the present study, the 64 channel electrode cap measured 512 timepoints per electrode per trial, which yielded ~30,000 data points per trial. It is reasonable to expect that only those timepoints and electrodes where the effect of word relationships is present will be predictable, so the tens of thousands of data points from other electrodes and time windows are not appropriate to consider. The n400 is typically measured as an average over the 300- 500ms time window, and that the component is typically maximal over centro- parietal sites (Lau, Phillips, & Poeppel, 2008), so the present study restricted predictions to the average in the n400 window at the Pz and CPz sites. 5.3 Results 5.3.1 ERP visualizations and sanity checks Grand average ERPs were visualized by averaging across trials sorted into various conditions in several ways: first, by individual subject responses (the ?yes? or ?no? judgments rendered while ERPs were collected); second, by the behavioral rating data in the relatedness and similarity tasks; and third, by category as defined in previous literature (the subset of words that appeared in the Chiarello et al. 1990 paper). As a sanity check, the first words of the word pairs in the yes-no judgment 57 figure were plotted as well, to ensure no pre-existing differences that might reflect any number of errors. Figure 7. First and second words of the wordpairs, sorted by participant response. Figure 9 above displays words that participants rated as related (?yes?) and unrelated (?no?). The first and second words of the word pair are displayed. Both word1s show a strong negativity in n400 window, which is to be expected, and are almost identical. Differences between the ?yes? and ?no? responses appear in the second word of the word pairs; related words produced an attenuated n400 compared to the first words of the pairs, and unrelated words produced either no difference or a smaller attenuation. This figure is assurance that the paradigm worked as intended in the broadest sense, and that the ERPs are thus far consistent with the literature. 58 The next set of figures will visualize the ERP data in several ways, and conduct statistical sets on certain contrasts. First, the ERPs sorted according to word pair rating will be presented and analyzed; then ERPs sorted according to category (the word pairs from Chiarello et al., 1990) will be presented and analyzed. Figure 8. Second words of the word pairs, sorted into high and low similarity and association ratings. Figure 10 above shows the second words of the word pairs, sorted into bins according to their ratings (by a different set of participants, in Experiment 1a). However, each trial contributes to two bins in these visualizations (each pair has both a similarity and an association rating), and many word pairs that were rated as minimally associated were also rated as minimally similar, so the two traces that look nearly identical are nearly identical, because they comprise a nearly identical set of 59 ERPs. In this figure it appears that words with the lowest ratings produced a large n400, and that highly rated similar and highly rated associated words each produced an attenuation of the n400 compared to their lower-rated counterparts. To examine this in more detail, Figure 11 and 12 present trials sorted by ratings binned into 6 bins, where each bin spans a single interval of the 7-point Likert scale (e.g. bin 1 holds word pairs rated from 1 to 2, bin 2 holds word pairs rated from 2 to 3, etc.). Figure 9. ERPs sorted by association ratings in six ordered bins. Figure 11 shows the traces for the association ratings, divided into six bins. Across sites, but particularly clearly at Pz, the magnitude of voltage dip in the n400 window appears to be modulated by the degree of association. 60 Figure 10. ERPs sorted by similarity ratings in six ordered bins. Figure 12 is as Figure 11, but displays bins of similarity ratings rather than association ratings. The modulation of the n400 by degree of similarity is still apparent but less clear. This may reflect a genuine effect of similarity, or it may be the case that the range of similarity in the present stimulus set is smaller or differently distributed than the range of association. However, this and the previous figures plotted only mean waveforms and included no variability information and no statistical tests. To determine if the ratings are reflected by real differences in the ERPs, statistical analyses were conducted on the highest versus the lowest bins of each of similarity and association, using t-maps or raster plots produced using the cluster- based permutation test from the Mass Univariate ERP Toolbox (Groppe, Urbach, & 61 Kutas, 2011). Cluster-based permutation tests capitalize on the broadly distributed effects of interest as well as the spatial density of the 64-channel electrode array. Additionally, although there are clear a priori predictions regarding the spatiotemporal distribution of effects for highly similar words, it is not known how these effects may change spatially or temporally with other types or degrees of relationships, and thus testing the entire timecourse and all electrodes using the cluster-based permutation test is appropriate (Groppe, Urbach, & Kutas, 2011). Raster plots were produced with the Mass Univariate ERP Toolbox. The raster plots display electrodes on the vertical axis (upper set is left hemisphere, middle set is midline, and lower set is right hemisphere; within each set, moving from top to bottom moves from anterior to posterior), and time on the horizontal axis. Filled electrode x timepoint boxes represent spatiotemporal locations with a significant difference (white boxes = condition 1 is more positive than condition 2, black boxes = condition 1 is more negative than condition 2). Figure 11. Main effect of association (lowest-highest). 62 Figure 12. Main effect of similarity: lowest-highest Figure 13 shows an n400 effect of association arising at around 300ms and extending through the rest of the epoch. Figure 14 shows and n400 effect of similarity, also arising at around 300ms and extending through the rest of the epoch. To determine if the spatiotemporal distributions of these two effects, are different, the interaction was tested as well (figure not shown). It was not significant at any timepoint: the two effects arise at the same time, taper off with the same general timescale, and are broadly distributed across electrodes. Some studies have found differences in spatial or temporal distribution of association and similarity effects (Koivisto & Revonsuo, 2001), but this finding was not replicated in the present ratings data. The next section examines ERPs to these word relationships sorted by predetermined category, rather than ratings. 63 Figure 13. Chiarello et al. (1990) words vs. lowest rated words. Figure 15 above displays the Chiarello et al. (1990) associated, similar, and similar-and-associated words compared to the words with the lowest ratings. All of the Chiarello et al. (1990) words produce some degree of attenuation of the n400 of the lowest rated words, but the degree of association appears to be graded. Words with both types of relationship produce the smallest n400, similar words produce a larger n400, and associated words produce an even larger n400. To determine if these categories are reflected by real differences in the ERPs, statistical analyses were conducted on the three main effects of similarity, association, and both, as well as the interactions between these effects, using the cluster analysis described above. For present purposes, the word pairs rated lowest are referred to as ?unrelated? and are used as a baseline to which the categorically related words may be compared. 64 Figure 14. Main effect of association (associated-unrelated) Figure 15. Main effect of similarity (similar-unrelated) 65 Figure 16. Main effect of similarity and association (both-unrelated) Figures 16, 17, and 18 demonstrate main effects of the associated, similar, and both associated and similar relationships. In all three main effects, an n400 attenuation appears by roughly 300 or 350ms, such that the related words are more positive than the unrelated words, and lasts for the duration of the epoch. These rasters do show some variability, so the next section will present the interactions to test if the effects of each relationship type are different. 66 Figure 17. Interaction between association and similarity (associated-similar). Figure 18. Interaction between association and both (associated-both) 67 Figure 19. Interaction between similarity and both (similar-both) Figures 19, 20, and 21 reveal that the interaction between the effect of similarity and the effect of association is not significant anywhere, but similarity and association each produce a smaller attenuation than both relationships together in the classic n400 window (300-500ms). These data support an account that the total relationship between two rods produces a particular n400 magnitude, rather than similarity or association contributing unique variance to the n400 magnitude. However; of the entire set of 341 word pairs that neural data were collected , only a small subset were drawn from the Chiarello et al. (1990) pairs (30 associated only pairs, 23 similar only pairs, and 21 similar and associated pairs). The author has previously found significant n400 effects and interactions with a similar number of trials per condition on the same hardware, software, and workflow, and with similar participants (Jackson & Bolger, in preparation), but, in the present study, it is possible that certain effects are present but would only reach significance with a larger pool of trials per participant. However, the choice of analysis (cluster analysis using the Mass 68 Univariate Toolbox) gives a high probability of finding an effect if it is large, which n400 effects tend to be. In summary, it is possible that a difference between similarity and association would be apparent in ERP under different circumstances. All of these visualizations demonstrate a clear n400, followed by a difference that lasts throughout the remainder of the epoch at a subset of the electrodes. This is not a common finding in the ERP literature, but it is a pattern that we have observed in language tasks recorded on the same equipment with a similar pool of subjects in the past. Whether this extended difference represents a genuine finding or an error of some sort in collection or processing is not clear. However, for the present, analyses will be confined to the n400 window, in which these ERPs display a canonical form. In summary, initial examinations of the ERPs are generally consistent with previous literature. Similarity and association are both reflected in the n400, though perhaps not differentially. We next turn to predictions of these ERPs. 5.3.2 Model predictions of ERP voltage Average voltages in the n400 time window at Pz, averaged across subjects, were treated as continuous data and were predicted using several regression algorithms: support vector regressors (SVR), random forests, and k-nearest- neighbors. GOLD output and LSA were separately used as input features to these algorithms. Additionally, similarity ratings and association ratings from Experiment 1a were used as predictors (each individually, and summed) to determine if that information is sufficient to predict neural activity. Performance was quantified via RMSE and r2 as in Experiment 1a, using the same algorithm parameters. 69 Table 6. Regressor performance on voltage at Pz, 300-500ms. Pz300-500ms Algorithm RMSE R2 Mean 2.0414 -0.0038 smallGOLD SVM Regression 1.9999 0.0366 Random Forest 2.1100 -0.0724 kNN 2.3154 -0.2914 LSA SVM Regression 2.0499 -0.0122 Random Forest 2.2054 -0.1716 kNN 2.5260 -0.5370 Ratings SVM Regression 2.0271 0.0102 Random Forest 2.1136 -0.076 kNN 2.4171 -0.4073 Performance on this task was best in all cases using SVM, but the maximum performance achieved was smallGOLD?s r2 of 0.0366, which is unimpressive. It is particularly strange that the ratings produce such poor performance as well. However, note that several of the r2 values are negative; this may indicate that r2 is an inappropriate measure, perhaps due to nonlinearity in the ERP data (Tremblay & Newman, 2013). Following Carlson et al. (2014), Spearman correlations were calculated for one randomly selected set of train/test for each prediction method. To ensure that the machine learning methods did not detract from the performance that a raw correlation would produce, those correlations were calculated as well. 70 Table 7. Correlations between metrics and ERP measures. Pearson Spearman SVM-smGOLD 0.237 0.246 SVM-LSA300 -0.103 -0.101 SVM-ratings 0.209 0.157 LSAval300 -0.112 -0.099 AssocRating -0.079 -0.059 SimRating -0.062 0.023 As this single iteration of train/test may be a fluke, the correlations between predicted ERP values and true ERP values for the test sets of 20 iterations of train/test were calculated for smallGOLD, SVM-LSA, and the raw LSA values. The correlations are reported in full in Appendix C. Correlations between the true ERP values and the raw LSA values were slightly higher than the SVM-LSA values, so raw LSA was taken as the best LSA performance. A t-test assuming unequal variances (Ruxton, 2006) was conducted on the Spearman correlations for smallGOLD and LSA; this test and found a significant difference, t(30) = 7.02, p < .001, such that smallGOLD correlations (M = 0.228, SD = 0.084) were significantly higher than LSA?s (M = 0.076, SD = 0.048). In comparison to the better behavioral data predictions in Experiment 1, this may also seem unimpressive. However, it is important to note standards from the literature. To refer to a recent example of predicting neuroimaging data, Carlson et al. (2014) calculate Spearman correlations between various computational models and brain activity in two different brain regions; the maximum Spearman correlation that any of the models achieved was ? = 0.154 (shown in their Figure 2). Accordingly, the mean smallGOLD performance of ? = 0.228 may be acceptable. 71 Chapter 6: Discussion 6.1 Model performance The fundamental goal of this paper was to demonstrate that as a computational model using more psychologically plausible architecture, the GOLD model could viably account for the relations between words using a graph constructed from the single mechanism of co-occurrences between words in discourse context. As such, the GOLD model performed very well (90% accuracy) on the simpler task of classifying words as related or unrelated. It performed well, but not as well (60%+ accuracy) on the more difficult task of determining whether the Chiarello et al. (1990) word pairs were similar, related, or both similar and related; however, this performance is considered with respect to an LSA model that reached only 39% accuracy on this task. GOLD reached ~60%, ~50%, and ~70% on the three relationship categories considered individually, and when it erred, it tended to err on word pairs in the ?both? category, which may reflect model error or may reflect greater strength of one or the other type of relationship. It was also much less likely to classify a word pair with only relationship type (associated only or similar only) as the other relationship type; if it erred on these word pairs, it was much more likely to categorize them as ?both?. GOLD was able to predict human ratings of similarity and association with high accuracy as well (Pearson?s r ranging from 0.7 to 0.8), again outperforming LSA?s r = 0.58. The task of predicting brain activity was much harder for both GOLD and LSA, and even the human judgments performed poorly, as measured by r2. However, an analysis based on previous literature that predicted neural activity from 72 language models indicated both that Spearman?s correlation is more appropriate given nature of neural activity, and that GOLD?s performance was actually quite good in the context of prior findings. One potential source of difficulty in predicting the ERP measure is that even fine-grained behavioral ratings of word pairs on the similarity and association axes were poor predictors. It may be the case that the influence of similarity and association combine in some nonlinear fashion to produce the n400 that is ultimately measured, or it may be the case that another variety of relationship entirely is also contributing variability to the ERP. Additional, direct testing of the n400 did not show waveform magnitude differences based on the type of relationship of the words that produced it; if anything, the n400 magnitude appeared to reflect total amount of relationship rather than any specific subtype. The predictive power of the GOLD model, which was constructed from co- occurrence alone, indicates that the information used to judge relationships among words may be present in lexical co-occurrence alone, without considering additional language information such as word order. Furthermore, because GOLD was able to predict multiple, graded varieties of relationships between words (similarity and association), it is implied that information sufficient to represent both relationship types is present in lexical co-occurrence. This predictive success lends support to a single-mechanism model of word knowledge, and suggests that the method of calculating relationships, rather than representing relationships, may be what differs between relationship types. This is consistent with theories that word meaning is constructed or retrieved on an ad-hoc basis (Kwantes, 2005, see Neely, 1991 for review), as multiple mechanisms of querying may reasonably be involved in that ad- 73 hoc construction. Preliminary analysis of the neural network classifier using the GOLD metrics indicates that the combination of association and similarity metrics are more powerful predictors than either type of metric alone, which lends additional support to this multiple querying mechanism account of word meaning. However, the data predicted in the present study were not reaction time data, as from priming studies, that may better distinguish between relationship types, as was done in Lund, Burgess, and Atchley (1995). As such, GOLD is agnostic as to which specific processes (such as automatic spreading of activation or post-lexical retrieval processes) its predictions are modeling or may be reflecting. 6.2 Word relationships An alternative explanation for GOLD?s misclassifications may not reflect an error in the model, but rather the fundamental difficulty of assigning words to different relationship types, which are non-orthogonal categories, as Chiarello and colleagues (1990) have done. In essence, the GOLD model, using a corpus of more natural language use and preserving that history in the connectivity patterns, may reveal that conceptually related words co-occur more frequently than assumed on the basis of free association norms. It may also be the case that the very question of ?how similar are these two words? is ill-posed to some degree. Consider hot and cold: these words are antonyms, but both are temperatures, and thus perhaps more similar than hot and rutabaga. Earthquake and tornado are wildly different concepts, but in a list of earthquake, tornado, and democracy, suddenly they are much more similar. In this vein, is it even meaningful to ask if two items are similar in isolation, or is a larger context 74 necessary? If the larger context is important, what is the brain actually doing with these word pairs in isolation? Clearly some sort of similarity judgment is possible, as an n400 response can be achieved in the case of minimal context, and furthermore, that n400 can be modulated by some manner of relationship between the prime and target words. 6.3 Benefits of computational models As was discussed in chapter 1, it has been argued that computational models are merely tools, from which nothing of substantive value can be learned. The GOLD model and its performance in the present study are intended as an argument to the contrary: as a model of language, rather than a tool, GOLD produced evidence that supports specific theoretical accounts of language acquisition, word meaning, and the reflection of language in neural activity. However, it is undeniable that computational models provide a major advantage in their capacity as tools, namely that computational models aren?t people and thus are free of human foibles11. The model doesn?t participate in the study inebriated, doesn?t grow fatigued or fall asleep, doesn?t ignore task instructions, and its performance doesn?t change over time, all of which are problems that plague human subjects research. The ultimate effects of these foibles on research data fall into the categories of consistency and following task instructions (much akin to the duality of accuracy and precision). For an example of both, during an informal post- hoc interview in Experiment 1a, one participant described that he ?drifted into? judging a different aspect of word meaning partway through the twenty minute rating 11 Model construction, of course, may be fraught with foible, but that is beyond the scope of the present study. 75 task; he had rated association for the first ten minutes, and then similarity for the last ten minutes. He was not consistent across word pairs in the session and was not following task instructions during the second half of the task. Other subjects encountered difficulties in following instructions, particularly in the semantic similarity judgment tasks, in which certain participants initially judged all word pairs as minimally similar because any two words in a pair ?[were not] the same words?. Certain studies have quantified within-subject variability on tasks of language judgment (e.g. Barsalou, 1987), and consistency varies widely; to the author?s knowledge, no formal study has been conducted of participant noncompliance in language tasks of this nature. However, it is common practice in behavioral research to include questions whose answers are trivially easy (e.g. ?Please fill box A on the response form for this question?), in order to check if participants are actually engaging with the task or following task instructions. In contrast to these problems, computational models perform with both accuracy and precision consistently and in a trivially replicable manner. 6.4 Graphs as models of language Graphs are a valuable tool in psycholinguistics research, both in service of analysis and of understanding. As a boon to analysis, graphs do not require discarding vast tracts of data in the process of dimensionality reduction, and so the model may maintain a higher degree of complexity that preserves additional information about relationships between words as well as overall statistical regularities that reflect the model?s ?experience? with language (see Steyvers & Tenenbaum, 2005). Analysis of a graph model of language rests on the centuries-old field of graph theory for a solid 76 mathematical foundation and a broad array of analytical algorithms, which allow for assessment of structural as well as functional properties. These algorithms may be useful methods of modeling larger contexts in psychologically meaningful manners, through existing methods of modeling network propagation, etc. In terms of aiding understanding, graphs may allow for more intuitive interpretation of calculations and results than methods that require complex transformations of the data (e.g. SVD, Landauer, or circular convolution, Jones & Mewhort 2007). However, these benefits, particularly the retained information, are accompanied by a major drawback: computational complexity. Analyzing graphs, particularly very large graphs as one might encounter in a language model, is computationally expensive. The patterns that may prove most interesting are also very complex; for example, identifying subgraph isomorphisms, one potential method of discovering useful patterns for word sense disambiguation or identifying word relationships, is in O(|Vgraph| |Vsubgraph|). Even performed in parallel, these operations quickly become intractable on standard hardware. Other types of graph theory algorithms may be valuable for identifying language features or word attributes, such as social network analysis for identifies ?bridge nodes? that may be homographs, or clique analysis that may be able to cluster register, or connotative/emotional content (Osgood, 1957), or feature similarities (McRae, De Sa, & Seidenberg, 1999; Plaut, 1995). These analyses are much more complex than something like LSA, and take exponentially more time to execute. The solutions to this complexity problem vary: recruiting massively parallel cloud computing resources, using only well-optimized 77 algorithms and data representations (Sun, Wang, Wang, Shao, & Li, 2012), reducing the graph size, or just choosing analyses that can avoid the brute force approach. One issue in graphs of word co-occurrence is that their high degree of interconnection makes many standard graph algorithms less useful, such as spanning trees and various measures of separation (e.g. Dijkstra, 1959). These algorithms are of course applicable, but may vary in their informativeness because the high degree of interconnectivity in a word-word graph means that words are typically very few steps away from any other word. In a graph like this, the weights of connections are more important than the presence of connections, so analyses must focus on algorithms that take weight into account, algorithms that consider larger patterns of weighted connectivity, or methods of graph pruning such that the presence of connections becomes informative ? perhaps by pruning low weight connections, or limiting words to some arbitrary number of connections. It may also be valuable to maintain more information during the graph construction process. In the present large GOLD model, each connection is weighted with weight=1, regardless of actual distance between words. It may be useful instead to record connection counts at several distances ? e.g. grumpy and cat co-occur immediately adjacent n0 times, separated by one word n1 times, separated by two words n2 times, etc. Maintaining word order information (perhaps through directional connections) may be a better predictor of human behavior as well, because, for example, bread-butter has a higher free association probability than butter-bread, etc. Lastly, as with all models of language, vagaries of the corpus can influence model performance. The corpus from which the GOLD model in the present study 78 was constructed may display a greater influence of conversational speech than, say, textbook-based corpora, as well as unorthodox grammatical structures and word usage. It also has a rather larger vocabulary of obscenities than a corpus constructed from the New York Times might, and spans different topics than standard language corpora (e.g. TASA; see Landauer et al., 1998). It was the aim of this corpus that it span a large range of unadulterated modern language use to again provide more ecological validity with respect to the behavioral data to which the GOLD model may be applied. 6.5 Individual differences Individual variability in language experience (explored in the author?s prior projects; (Bolger & Jackson, under review; Jackson & Bolger, in preparation) leads to dramatic differences in word knowledge and thus the neural response to words in context. In the case of paired priming paradigms, the context is minimal: one preceding word. Clearly, this minimal context is sufficient to bias the neural response, as the n400 effect may be reliably elicited in these paradigms. However, due to its brevity and low information density, this context may be less effective at preventing unrelated or idiosyncratic semantic activation than a sentence or larger preceding context might. For example: the pair grumpy-cat would elicit a small n400 from the author, who has encountered the feline referred to as Grumpy Cat12 in digital form on many occasions, but a large n400 from someone who is unfamiliar with this animal. However, if the context were larger and contained more information and thus more constraint, such as ?the mouse toy was chewed up by the huge, orange, grumpy cat?, 12 See www.grumpycats.com for details. 79 it may be the case that these two individuals? n400 responses to cat would be closer in magnitude. The rating tasks in Experiment 1a provided a clear example of individual differences influencing word knowledge. The author presented question-query as an example of words that might be rated as highly similar; however, easily half of the participants rated this pair very low in similarity, because they had never encountered (or could not recall a meaning of) the word query. Incidentally, this is why participants with extensive vocabularies and high reading skill were selected to contribute the ERP data; the model should be predicting English in as complete or objectively accurate a form as possible, rather than being limited to modeling the smaller subset of language that is known to lower skill readers. 6.6 Future research 6.6.1 Language The present study supports a single-mechanism account of the acquisition of these word relationships, but does not rule out an account in which acquisition is via a single mechanism, but later calculation or determination of the relationships (at time of judgment) occurs via multiple mechanisms. This question may be approached by examining the predictive elements of the model: are the features required for predicting association different than the features required for predicting similarity, and do these features reflect theoretical conceptions of association and similarity? Can the model predict other types of quantifications of word relationships, such as reaction time data, finer-grained ratings of word relationships, or neural activity in 80 response to sets of words? Do sets of words constrain meaning and/or concept activation better than individual word primes? 6.6.2 GOLD The present study explored whether the GOLD model could distinguish among similarity and associativity in word relationships. Future work should investigate whether GOLD can differentiate words along other axes and relationship types, such as antonyms/synonyms, multiple word senses, register, affective content, and so on. In support of these investigations should be the extraction of more complex measures from the graph, particularly those examining larger connectivity patterns. The present study was exploratory, and so was limited to an undirected, smaller graph and simpler, local algorithms. However, the full power of a graph model may lie in it its higher-order, more complex patterned relationships, so these should be evaluated. Preliminary exploration of the ML algorithms used to predict activity and behavioral from GOLD does not make it obvious what is driving their obtained accuracy. It is not clear either way if either of the theoretically association-based (direct links between words) or the theoretically similarity-based (overlap and non- overlap between words? neighbors) metrics are more informative, or if the metrics are equally informative and the manner of weight normalization is more important. However, it is clear that combination of several features is more predictive than each feature alone. Further investigating what this may imply for human language processing will require a tightly controlled stimulus set that spans many axes of the language space. 81 A crucial element of future work will be the identification of optimal methods of prediction from the model. The present study used many features and machine learners to learn patterns that may be predictive; other studies have used such methods as scaling by arbitrary units (Lund & Burgess, 1996), and assessing predictive ability based on Spearman correlations (such as on dissimilarity matrices entries in Carlson, Simmons, Kriegeskorte, & Slevc, 2014, and on other types of data as in Collins-Thompson & Callan, 2007 and Gabrilovich & Markovitch, 2007, to name two of countless studies). It may be also the case that larger contexts, such as those already used in judgments of document similarity, are necessary for more meaningful judgments of similarity. Future research with the GOLD model should address the development of metrics from GOLD that can be expanded to arbitrary- length inputs, which may enable greater predictive power as well as more accurate modeling of psychological reality. 6.6.3 Individual differences It is undeniable that individual differences contribute to neural responses to language. Future work may examine these individual differences by comparing neural activity in high-skill to readers to that in low-skill readers, particularly if the stimuli also vary along several dimensions of difficulty. The word stimuli used in the present study were fairly high frequency, but it?s not clear if higher-order interactions with words that are involved through spreading activation or other processes, or other additional information derived from greater experience with language, may have an effect on the measured waveforms. 82 6.6.3 ERP One of the major goals of the present study was to predict brain responses in a language task. The present study used a very simplistic approach to quantifying these brain responses: average voltage in a specific time window of ERP at a single electrode. Unfortunately, this approach discards a tremendous amount of data that may be very relevant in terms of differentiating word characteristics or cognitive processes (e.g. Halgren et al., 2002; Sereno, Brewer, & O?Donnell, 2003; Thornhill & Van Petten, 2012). A different method of encoding the total spatiotemporal pattern of the brain response may be valuable to capitalize on the additional information present in such patterns. Future work may also examine prediction in the other direction: predicting characteristics of words from ERPs. Using ERPs as predictors may better enable use of the entire spatiotemporal pattern of voltage, rather than collapsing such a complex pattern into a single value as in the present study. Koivisto and Revonsuo (2001) found that dividing the n400 window into early (250-375) and late (375-500) allowed for the discovery of different spatial and temporal patterns of effects for lexically associated as opposed to semantically similar word pairs; future work may follow this paper and attempt to predict differential activation in different time windows and electrode locations. 6.6.5 Extensions In the interest of maintaining a sensible scope of the present project, these applications were not explored. However, these applications have clear relevance to the reading and language literature, the cognitive literature, and other work in the 83 Bolger lab. This chapter will identify and briefly discuss several potential applications that GOLD, ERP data, or behavioral data might address. 6.6.5.1 Context variability The context variability hypothesis (Bolger et al., 2008) may be tested by replicating the contextual word learning paradigm (Jackson & Bolger, in prep) using GOLD as the ?participant?. The model could be ?taught? novel words in the same way that human participants were taught: exposure to the novel words embedded in sentence contexts. Model performance on this task may be compared to the human data from Jackson & Bolger (in prep), which include multiple choice sentence completion, congruent/incongruent sentence judgments (including ERPs to this task), and participant-produced definitions. 6.6.5.2 Semantic distance in fMRI Previous research in fMRI has found relationships between semantic distance of language input and activity in left IFG, bilateral MFG, and anterior temporal regions in a lexical decision priming task (Tivarus, Ibinson, Hillier, Schmalbrock, & Beversdorf, 2006), and in left frontopolar cortex in an analogy judgment task (Green, Kraemer, Fugelsang, Gray, & Dunbar, 2010). GOLD could attempt to predict activation from these studies. 6.6.5.3 Word sense disambiguation Words can be ambiguous in different ways: polysemy refers to multiple related meanings (a boot on a foot and to give something the boot), while homonymy refers to multiple unrelated meanings (the boot on a foot and the boot of a car). Previous research has used various approaches, including clustering (Levin, Sharifi, 84 & Ball, 2006; Lin & Pantel, 2002; Widdows & Dorow, 2002), an information-based approach (Durda, Caron, & Buchanan, 2010), a second-order cluster approach (Schutze, 1998), Wikipedia-based methods (Gabrilovich & Markovitch, 2007; Li et al., 2011) that uses additional information in a query (e.g. river bank vs bank loan), and hybrid methods that use both distributional data and human-annotated knowledgebases (Jiang & Conrath, 1997; Marton, Mohammad, & Resnik, 2009). GOLD may be able to disambiguate word senses based on the patterns of connectivity of the different senses. Bridge analyses, in certain social network analyses (Butts, 2008) and epidemiological modeling (Luke & Harris, 2007) aims to identify nodes that participate in otherwise disparate sub-networks of nodes (nodes that act as ?bridges? between groups). It may be the case that homonymous words are bridge nodes. For example, the word ball should be heavily interconnected with a group of nodes including bat, throw, pitch, baseball, football, which should all be heavily interconnected; ball should also be interconnected with a group that includes gown, dance, gala, and invitation, all of which should be heavily interconnected, none of which should be particularly heavily connected to the sport-related group. This type of analysis may also be helpful in identifying where information was lost in the parsing process; for example, all input is forced to lowercase before being weighted, and accordingly the difference between US and us is not detected in the first-order structure of the graph. If bridge analysis identifies ?us? as participating in two largely disparate clusters, one centering around groups and the other centering around foreign policy and military exercises in the Middle East, then GOLD may be able to distinguish between these two words. 85 6.6.5.4 Synonymy Distributional models generally perform well on tests of synonymy (Turney, 2001) and some methods have improved performance by specifically training on a thesaurus-based corpus (Jarmasz, 2003). Measures that preserve more dimensions are better at judging subtle differences between synonyms (?near-synonyms?), because less distinguishing information is discarded (Wang & Hirst, 2010). GOLD would not discard any data, and thus would be expected to perform well on a near-synonym judgment task (Inkpen, 2007; Turney, 2001), and may also be compared to human similarity judgments as in Budanitsky & Hirst (2005). Theoretically, words with similar meanings should be connected in similar ways to other nodes. Standard cluster analysis (Hartuv & Shamir, 1999; Schaeffer, 2007) may be able to identify groups of words with similar meanings. The ?central? node ? which measure of centrality would be appropriate here is an open question, but perhaps word frequency would be effective ? would be the ?label? of that group. This could simplify further computations (by reducing many nodes to a single ?supernode?), or be useful for generative queries (?generate synonyms of tired?). 6.6.5.6 Other The model may be applicable towards a variety of other standard tasks, including authorship attribution, Cloze tasks, assessing metaphors, judging definitions, and so on. The model is further flexible in its parameters: by propagating activation through the network and manipulating parameters like falloff time and propagation rate, it may mimic parameters of human memory like WM span and speed of processing. Further work may address even more pie-in-the-sky hypotheses: 86 can the model suggest meaning for slang? Can it make rudimentary jokes, perhaps by completing an input sequence with a low-probability word? 6.7 Conclusion The present study constructed GOLD, a graph model of language, from lexical co-occurrence, and used novel, theoretically-informed similarity metrics from GOLD to predict relationships among words, types of relationships among words, and neural activity elicited from reading words with particular relationships. The GOLD model is capable of distinguishing among types of relationships between words, predicting graded relationships between words, and predicting brain activity in response to words with varying relationships, using metrics constructed from theoretically-informed conceptualizations of association and similarity. These novel algorithms are theoretically informed in a straightforward manner: they consider how connections to associates that are common to both words and associates that are unique to each word differentially contribute to meaning. This type of calculation is more transparent in its reflection of the co-occurrence patterns of language that were used to construct the model than algorithms involving more complex transformations, and, because it doesn?t rely on spatial relationships of word representations in a particular language space (e.g. cosine between two word vectors), may be better able to account for psycholinguistic properties that would not be reflected in orthogonal relationships in a vector space model. 87 Appendix A. GOLD metrics Five methods were used to calculate similarity, all considering overlapping nodes and nonoverlapping nodes separately. It is theorized that a similar pattern of connectivity to overlapping nodes will arise when the word pair is more similar, but if their connections to nonoverlapping nodes are much greater, than the similarity in overlap may not contribute as much to the overall judgment of the word pairs. Accordingly, the following metrics involve various ways of summing weights to the overlapping nodes and summing weights to the nonoverlapping nodes, and comparing the two sums. Method 1: Overlap and nonoverlap sets. The weights to each set are summed as follows, where |Vo| is the number of nodes in the overlap set, |Vn| is the number of nodes in the nonoverlap set, and is the weight between word 1 and node i : However, any additive or subtractive combination of these values could be arbitrarily high. It would be ideal if the metric would map to a finite range for easy comparisons (like LSA?s output ranges from -1 to 1). One approach is to compare the proportion of the total weights that is accounted for by weights to the overlap and the nonoverlap sets. The difference between these proportions will map from -1 (in the case where 100% of weights are connected to nonoverlap nodes) to 1 (in the case where 100% of weights are connected to overlap nodes). 88 Method 2: Overlap and nonoverlap sets, normalized by size. Method 2 is calculated as Method 1, except that and are normalized by their relative sizes, as below: The final similarity metric is calculated as in Method 1, as the difference of proportions to the overlap and nonoverlap sets. Method 3: Overlap and nonoverlap sets, overlap set scaled by magnitude difference. For the remaining methods, the sum of weights to overlap transformed according to the following equation: This has the effect of scaling the two weights by how close they are in magnitude, such that weights that have a smaller magnitude difference will contribute more of their weight to the final total. In the example in Figure 3, grumpy-face has a weight of 9 while cat-face has a weight of 52; their combined transformed weight 89 would be 10.56 (18% of the original combined weights). In contrast, grumpy- depressed has a weight of 2 while cat-depressed has a weight of 3; their combined transformed weight would be 3.33 (66% of the original combined weights). In Method 3, weights to the overlap nodes are calculated as above, and the final similarity metric is calculated as in Method 1 (no additional normalization). Method 4: Overlap and nonoverlap sets, overlap set scaled by magnitude difference, both sets normalized by size. In Method 4, weights to the overlap nodes are calculated as above and then normalized by size as in Method 2. The final similarity metric is calculated as in Method 1. Method 5: Overlap set only, scaled by magnitude difference, normalized by size. In Method 5, only the overlap set is considered, and its weights are calculated as in Method 3 and normalized as in Method 2, as follows: Because the nonoverlap set is ignored, no proportions are calculated. This metric does not map from -1 to 1. Table 8. Weight normalization methods Normalization method Calculation of normalized weight Raw weights Weight Pointwise mutual information (PMI) 90 Sum of IDFs Product of IDFs Sum of document frequencies Product of document frequencies Inverse of sum of IDFs Inverse of prod of IDFs Inverse of sum of document frequencies Inverse of product of document frequencies Sum of frequencies Sum of frequencies multiplied by log sum of frequencies Product of frequencies multiplied by log product of frequencies Sum of frequencies divided by log sum of frequencies Product of frequencies divided by log product of frequencies 91 Appendix B. ERP Participant assessment results Table 9. ERP participant assessment results Assessment Mean SD Nelson-Denny Comprehension (raw score) 70.11 5.23 Nelson Denny reading rate (raw score) 298.47 94.35 PPVT (standard score) 119.74 10.56 TOWRE sight word (standard score) 103.53 9.63 TOWRE phonetic decoding (standard score) 101.37 9.90 92 Appendix C. ERP prediction performance Table 10. Correlations between models and predictions, 20 iterations of 70/30 train/test. Iteration Spearman Pearson SVM- smGOLD SVM-LSA LSA SVM- smGOLD SVM- LSA LSA 1 0.314 -0.044 0.069 0.304 -0.016 0.152 2 0.349 0.182 0.188 0.326 0.187 0.177 3 0.235 0.044 0.044 0.233 0.006 -0.001 4 0.335 0.054 0.078 0.323 0.069 0.118 5 0.246 0.007 0.007 0.218 -0.013 0.030 6 0.267 0.013 0.088 0.226 0.044 0.125 7 0.219 0.063 0.063 0.208 0.062 0.051 8 0.265 -0.020 0.115 0.242 -0.038 0.116 9 0.250 0.095 0.095 0.205 -0.026 0.036 10 0.192 0.106 0.106 0.147 0.013 0.039 11 0.150 0.079 0.079 0.140 0.038 0.045 12 0.233 0.154 0.154 0.223 0.117 0.108 13 0.200 -0.030 0.008 0.170 -0.052 0.016 14 0.238 0.054 0.054 0.215 0.084 0.082 15 0.129 0.094 0.094 0.133 0.056 0.056 16 0.357 0.092 0.092 0.300 0.022 0.011 17 0.175 -0.010 -0.010 0.195 -0.060 -0.080 18 0.009 0.026 0.027 -0.030 0.024 0.029 19 0.129 0.087 0.087 0.144 0.091 0.090 20 0.264 0.086 0.086 0.229 0.031 0.027 Min 0.009 -0.044 -0.010 -0.030 -0.060 -0.080 Max 0.357 0.182 0.188 0.326 0.187 0.177 Mean 0.228 0.057 0.076 0.208 0.032 0.061 SD 0.084 0.060 0.048 0.081 0.061 0.060 93 Appendix D. Stimuli for ratings and ERP study Table 11. Stimuli and stimuli parameters for ratings and ERP. Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq accuracy case random 1.45 2.70 0.80 0.02 1288 35962 actress bandage random 1.16 1.21 -0.26 0.00 609 127 adultery putty random 1.39 1.09 -0.45 0.11 249 112 alpaca cap random 1.42 1.59 -0.25 0.06 151 2818 apple grape Chiarello - similar 5.32 5.67 0.14 0.10 17029 482 army navy Chiarello - both 5.52 6.56 0.76 0.68 6615 1615 artist paint Chiarello - associated 4.48 6.65 0.64 0.30 4579 3980 assumption rant random 2.03 2.24 0.36 0.04 2729 1943 assure addition random 1.39 1.21 0.36 0.22 1210 3252 asylum madhouse Miller-Charles 5.97 5.94 0.07 0.03 471 19 atheism pouch random 1.00 1.12 -0.41 -0.06 6700 165 attractiveness chili random 1.26 1.15 -0.30 -0.06 296 688 authority regime random 4.71 4.53 0.82 0.24 3118 1116 background usage random 1.55 1.85 0.71 0.12 6642 2350 ball bat Chiarello - both 3.97 6.32 0.81 0.33 7764 1919 banana peach Chiarello - similar 5.10 5.32 0.48 0.18 1586 367 barrel council random 1.35 1.15 0.33 0.00 2251 1162 basin sink Chiarello - both 4.94 4.47 0.63 0.66 85 1890 battle director random 1.55 1.62 0.78 0.17 4396 2201 bear twist random 1.03 1.09 0.91 0.09 5815 3252 bedroom hypothesis random 1.06 1.12 -0.28 0.01 2048 1031 bee honey Chiarello - associated 4.45 6.88 0.51 0.35 799 2640 bias perception random 4.39 4.94 0.93 0.51 3531 1982 bigot internship random 1.16 1.26 -0.36 -0.08 429 504 birch elm Chiarello - similar 4.55 5.26 0.38 -0.16 76 114 bird eagle Thompson-Schill et al. 5.55 6.18 0.40 -0.03 2814 1001 blackmail protein random 1.00 1.03 -0.38 -0.03 275 2669 blanket waste random 1.19 1.18 0.17 0.01 1545 6528 bloat housemate random 1.03 1.09 -0.40 -0.07 185 139 blouse skirt Chiarello - both 4.94 5.59 0.72 0.34 60 557 book page Chiarello - associated 4.94 6.45 0.78 0.12 26642 15032 boy clue random 1.29 1.74 0.92 -0.01 9003 2680 brand pose random 1.81 1.82 0.42 -0.05 5895 1005 brandy wine Chiarello - both 5.33 5.74 0.51 0.20 83 3147 94 Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq brass iron Chiarello - similar 5.23 5.06 0.78 0.14 663 3496 brick privacy random 2.03 2.03 0.18 0.02 1410 2754 bruise stereotype random 1.29 1.76 -0.33 0.00 223 1628 brush comb Thompson-Schill et al. 5.74 6.62 0.46 0.20 1541 207 building punishment random 1.26 1.29 0.24 -0.04 10218 2995 burlap felt Chiarello - similar 3.94 2.85 0.36 0.11 33 16590 bus mode random 1.87 2.26 0.22 0.02 5125 4201 butter session random 1.06 1.12 0.54 0.14 3434 1536 bye goodbye Other 6.71 6.71 0.58 0.34 649 906 bystander yeast random 1.06 1.18 -0.32 0.00 152 909 camel hump Chiarello - associated 4.10 6.15 0.39 0.01 429 308 canada steak random 1.13 1.45 0.30 0.01 11553 1935 candle flame Chiarello - associated 5.06 6.65 0.69 0.31 621 852 carbon efficiency random 2.06 4.12 0.81 0.60 1867 1411 carrot corn Chiarello - similar 5.10 5.26 0.49 0.42 438 1836 carry executive random 2.26 1.82 0.57 0.13 8404 1274 casserole gender random 1.00 1.18 -0.40 -0.14 143 6320 castle designer random 1.58 2.44 0.52 -0.01 1217 1442 chapter reason random 1.35 1.65 -0.04 -0.01 1199 47925 chip penny random 1.37 1.41 0.91 0.16 1783 1457 church theism Other 4.00 3.65 0.52 0.81 11313 317 circle cross Chiarello - similar 2.65 3.24 0.67 0.34 3800 5968 circus clown Chiarello - associated 4.45 6.65 0.57 0.24 464 856 clause burden random 1.87 1.47 0.81 0.06 1061 1710 closet vast random 1.71 2.50 -0.09 -0.03 1535 3981 cloth dress Chiarello - associated 5.10 5.39 0.60 0.18 614 3789 cloud output random 1.48 1.18 0.90 0.28 2633 1407 combination animation random 1.40 1.53 0.87 0.25 2555 1477 companion intuition random 1.35 1.82 -0.04 0.10 694 341 compassion brownie random 1.48 2.00 -0.20 -0.02 870 183 complexity porch random 1.16 1.12 -0.43 -0.04 1067 633 concept resource random 2.55 2.85 0.73 0.22 7432 1477 concert lunch random 1.35 1.79 0.82 0.07 1728 3519 congressman anime random 1.19 1.00 -0.27 -0.09 355 4113 consideration tradition random 1.87 1.74 0.85 0.09 1429 2023 constitution communism random 1.94 3.47 0.84 0.30 3467 1204 container victim random 1.32 1.47 -0.27 -0.04 1002 4470 content alternative random 1.58 1.65 0.93 0.26 11623 3944 95 Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq contrast comparison random 4.19 6.26 0.91 0.46 1557 4716 cooker commandment random 1.23 1.09 -0.42 -0.01 264 120 correlation coat random 1.00 1.06 -0.35 0.00 1620 1499 cotton silk Chiarello - similar 5.13 5.88 0.76 0.34 694 269 couch philosophy random 1.13 1.59 -0.42 -0.02 2423 5971 cradle baby Chiarello - associated 4.13 5.88 0.30 0.05 227 14248 crater moon Chiarello - associated 4.19 6.00 0.60 0.22 140 4108 creationism treadmill random 1.03 1.00 -0.28 -0.01 678 512 crop trigger random 1.39 1.15 0.69 0.21 1115 2563 cube scroll random 1.13 1.36 0.71 0.13 1409 1463 currency bolt random 1.61 1.44 0.16 0.06 2240 1095 custom actor random 1.58 1.88 0.10 -0.02 3001 2711 cut scissors Thompson-Schill et al. 4.84 6.52 0.69 0.28 18614 485 decoy duck Chiarello - associated 2.19 1.97 0.43 0.04 120 2809 deer pony Chiarello - similar 4.32 4.09 0.45 0.00 2169 799 definition smell random 1.45 1.18 -0.18 -0.04 7387 5125 design sweetheart random 1.23 1.47 -0.26 -0.08 10014 248 desk stool Chiarello - similar 4.19 4.94 0.86 0.24 2955 226 devotion milk random 1.03 1.26 -0.13 -0.03 176 5325 diaper multiplier random 1.26 1.21 -0.27 0.08 444 180 dirt mud Chiarello - both 6.32 6.70 0.85 0.45 1829 844 disagreement tuna random 1.00 1.09 -0.12 0.00 593 689 disgusting gross Other 6.35 6.82 0.78 0.56 3814 3194 distinction liar random 1.52 2.24 0.02 0.11 1769 1492 divorce mother random 2.58 4.24 0.93 0.67 1741 15465 doom agent random 1.45 1.71 0.54 -0.02 1223 1810 dorm politics random 1.23 1.79 -0.30 -0.04 807 9173 dose furniture random 1.26 1.29 0.65 -0.05 1216 1052 downstairs jargon random 1.06 1.24 -0.41 0.01 497 206 drums piano Chiarello - similar 4.58 5.68 0.68 0.67 966 1245 ear foot Chiarello - similar 4.29 4.64 0.89 0.42 2783 5617 elephant paragraph Other 1.00 1.21 0.45 -0.02 1227 1614 empowerment spaghetti random 1.13 1.03 -0.25 -0.12 101 936 end mess random 1.45 1.65 0.94 0.09 47547 4462 enforcement net random 1.83 2.29 0.82 -0.01 1792 3717 engine car Chiarello - associated 4.61 6.32 0.38 0.20 4319 32872 entry score random 2.87 2.94 0.78 0.29 2101 3525 evidence bead random 1.26 1.12 -0.27 0.01 13829 109 96 Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq exam gravity random 1.32 1.82 0.19 0.02 1271 2130 faith shower random 1.19 1.35 -0.40 -0.05 6813 3466 farmer plow Chiarello - associated 3.81 5.88 0.21 -0.01 810 287 feature tablet random 1.58 2.33 0.93 0.67 4862 3988 fever obligation random 1.32 1.26 -0.34 0.00 432 1205 fiction manager random 1.13 1.21 0.11 -0.11 2755 4528 fitness vet random 1.77 2.18 -0.19 -0.03 1953 1605 flavor tribe random 1.29 1.38 -0.11 0.04 2038 562 flea ant Chiarello - similar 4.81 4.97 0.41 0.05 274 545 flew regret random 1.35 1.32 0.39 0.03 1121 2931 fork spoon Thompson-Schill et al. 5.32 6.59 0.81 0.41 938 892 format dispatcher random 1.42 1.62 -0.45 -0.05 2238 108 fox horse Chiarello - similar 4.19 4.12 0.47 0.02 5528 4972 freedom beach random 2.23 2.94 -0.13 -0.04 7782 2330 frown smile Chiarello - both 3.65 6.00 0.41 0.51 178 3864 gallon jug Chiarello - associated 4.68 5.71 0.77 0.62 1146 181 garage piracy random 1.58 1.41 -0.31 -0.06 1764 1069 gas lemonade Other 1.16 1.32 0.65 0.17 8933 377 gaze turtle random 1.16 1.26 -0.01 0.08 219 957 gem jewel Miller-Charles 6.74 6.44 0.00 0.00 1504 117 gene world random 2.13 2.03 0.83 -0.02 1122 60125 ghost half random 1.13 1.62 0.79 0.08 2307 27181 grade libertarian random 1.23 1.76 -0.20 -0.04 7001 3432 grammar beauty random 1.03 1.82 0.68 0.13 3164 2385 grandson query random 1.29 1.50 -0.45 -0.04 220 249 graph grandma random 1.00 1.18 -0.16 -0.03 1231 2268 grave mileage random 1.48 1.21 -0.31 -0.07 1058 684 grocer store Chiarello - associated 4.13 5.94 0.73 0.53 65 16594 grumpy grouchy Other 6.55 6.53 0.56 0.34 754 34 guy capitalist random 2.06 1.97 -0.11 0.00 79747 1431 habit steam random 1.10 1.06 0.67 -0.01 1841 5414 hair fur Chiarello - similar 5.61 5.82 0.54 0.43 11644 884 happy carpet random 1.10 1.35 0.38 0.14 23716 1149 harbor boat Chiarello - associated 3.87 5.88 0.65 0.16 514 3734 hardware section random 1.77 3.03 0.65 -0.03 5085 4964 head leg Chiarello - similar 4.10 5.24 0.94 0.32 27709 4339 heckler revenue random 1.58 1.74 -0.29 -0.07 100 3390 hermit cave Chiarello - associated 3.19 4.03 0.55 0.23 146 1115 97 Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq hi hello Other 6.97 6.88 0.89 0.60 4112 2954 hockey ice Chiarello - associated 3.81 6.59 0.70 0.21 4010 7437 home valley random 1.90 2.35 0.53 -0.02 35632 1009 house lesson random 1.61 2.18 0.66 -0.04 29295 2608 hypocrisy balance random 1.29 1.68 0.45 -0.06 1042 5120 ideology razor random 1.03 1.24 0.09 -0.01 1726 1063 immigration snow random 1.03 1.29 0.09 0.03 1235 4216 incident destroy random 2.39 2.61 0.21 0.16 2824 3991 infection treat random 2.42 3.09 -0.14 0.22 1413 5969 insight blatant random 1.94 1.74 0.56 0.12 1702 1274 integer buddy random 1.26 1.15 -0.21 -0.04 202 4761 involve halfway random 1.65 1.62 -0.07 -0.02 1765 1667 jeep plane Chiarello - similar 3.81 4.09 0.79 0.29 523 3729 jelly jam Chiarello - both 6.32 6.68 0.74 0.02 1254 1376 jet budget random 1.19 2.15 0.51 0.24 1208 5256 justification eliminate random 1.65 1.97 0.72 0.22 1421 1315 justify summer random 1.13 1.03 -0.29 -0.19 3652 6621 key door Chiarello - associated 3.90 6.29 0.14 0.18 7588 12802 knock warrant random 2.10 2.85 0.19 0.08 2680 1313 law justice Thompson-Schill et al. 5.32 6.50 0.87 0.35 24055 4562 lawsuit meaningless random 1.45 2.06 0.45 0.02 1111 1801 lawyer nurse Chiarello - similar 3.29 3.79 0.43 0.10 3001 1813 layer liquid random 2.06 2.72 0.93 0.46 1677 2631 leap pen random 1.32 1.18 0.53 0.06 1025 1696 lee grown random 1.42 1.00 0.32 0.00 1420 3649 legalization toad random 1.00 1.03 -0.47 -0.04 1142 145 lemon pear Chiarello - similar 4.68 5.00 0.56 0.20 1034 151 lie sweet random 1.06 1.47 0.24 0.01 7123 8294 light lamp Thompson-Schill et al. 6.39 6.65 0.76 0.71 16912 724 lord tab random 1.23 1.03 0.08 0.04 3944 1586 lotion cream Chiarello - both 5.90 6.12 0.74 0.31 355 3650 machine villain random 1.45 1.76 0.22 0.06 8932 1292 mad anger Thompson-Schill et al. 6.61 6.56 0.37 0.15 6534 2365 man woman Chiarello - both 4.65 6.79 0.37 0.08 71832 22936 management chart random 2.55 3.41 0.72 0.08 3810 1304 market carrier random 2.13 2.53 0.81 0.08 16947 1779 maximum manufacturer random 1.74 2.35 0.81 0.08 1620 1150 meal unfortunate random 1.03 1.35 0.03 -0.17 3198 1980 98 Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq medicine amount random 2.55 4.15 0.86 0.15 2674 20920 met texture random 1.10 1.15 -0.24 -0.08 10417 1060 miner coal Chiarello - associated 4.03 6.56 0.02 0.12 92 1386 minimum consumption random 1.55 2.91 0.81 0.30 5352 1710 minister aroma random 1.23 1.15 -0.04 0.01 1095 112 mint candy Chiarello - both 4.81 5.71 0.30 0.01 1129 3069 mistaken criticism random 2.90 2.88 0.82 0.29 1620 2724 modernism wrist random 1.29 1.15 -0.11 0.01 103 1105 mold bread Chiarello - associated 3.03 4.75 0.66 0.31 652 3838 mortgage shown random 1.32 1.47 0.34 -0.01 1245 4312 moth fly Chiarello - both 5.52 5.50 0.49 0.19 273 4892 mouse rat Chiarello - both 5.61 6.44 0.37 0.02 4177 1197 movement association random 2.77 2.38 0.94 0.33 5406 1225 mug beer Chiarello - associated 3.68 5.94 0.46 0.30 529 11410 name tortilla random 1.06 1.18 -0.21 -0.01 34714 191 nationalist cuddle random 1.03 1.00 -0.45 -0.04 284 452 needle thread Thompson-Schill et al. 4.06 6.85 0.04 -0.14 819 18459 needless force random 1.42 2.09 0.01 0.04 1403 14107 nickel dime Chiarello - both 5.74 6.41 0.55 0.24 462 740 nightmare tape random 1.00 1.38 0.84 0.08 1679 2901 onion tears Chiarello - associated 3.26 5.71 0.15 -0.01 1314 3040 opinion evening random 1.16 1.44 -0.30 -0.12 19305 1716 opportunity contest random 2.87 2.44 0.73 0.16 5308 1424 orb scum random 1.39 1.29 -0.03 0.00 165 1014 ounce pound Chiarello - both 4.84 6.24 0.74 0.47 623 1957 outrage deodorant random 1.23 1.26 0.02 -0.07 876 227 oxygen rating random 1.35 1.53 0.41 -0.04 1251 1219 paradox valentine random 1.26 1.24 -0.06 -0.04 816 365 patriarchy raccoon random 1.26 1.15 -0.48 -0.02 690 335 percentage summary random 2.39 2.38 0.62 0.22 3395 1001 persuasion seal random 1.23 1.44 -0.07 -0.02 164 1187 petty attitude random 3.19 3.76 0.90 0.37 1296 4693 phenomenon struggle random 1.61 1.85 0.69 0.21 1232 2252 pillow fort Other 2.57 4.62 0.46 0.14 920 603 platform default random 1.97 1.97 0.94 0.60 3735 4551 poll knife random 1.26 1.24 0.12 -0.06 1440 4387 pool translate random 1.13 1.26 0.08 -0.07 3752 1408 pork mentality random 1.16 1.12 0.26 -0.01 1037 2119 99 Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq prediction diner random 1.10 1.32 -0.21 0.00 807 221 pregnancy glad random 2.23 3.29 0.19 0.03 2269 13374 press pitch random 1.93 1.76 0.84 0.14 5675 2002 procreation maple random 1.26 1.06 0.04 -0.04 140 855 promote identity random 1.81 2.26 0.90 0.34 1881 2510 prude freezer random 1.32 1.00 -0.43 -0.05 137 904 python guilt random 1.06 1.21 -0.01 0.04 3110 1813 qualify stable random 1.90 2.12 0.61 0.15 1292 2495 rage farm random 1.48 1.24 0.56 -0.03 3124 2435 rake leaf Chiarello - associated 4.06 6.38 0.56 0.07 280 804 ram edge random 1.65 1.82 0.76 0.25 2294 3957 raw disagree random 1.29 1.29 0.23 -0.03 2606 10275 reassurance pencil random 1.03 1.15 -0.14 0.01 126 882 recommend unity random 1.35 1.91 0.76 0.22 7297 1509 recover sugar random 1.39 1.47 0.48 0.32 1081 3624 recovery quest random 2.45 2.21 0.63 0.12 1782 1524 reform apartment random 1.32 1.76 -0.15 -0.09 1389 3813 relativism boxer random 1.06 1.06 -0.20 -0.05 243 785 requirement battery random 2.00 2.61 0.55 -0.01 1537 5794 retirement task random 1.77 1.85 0.38 0.07 1669 1919 revolution unknown random 1.35 1.71 0.73 0.15 2402 1758 righteousness scan random 1.23 1.24 0.00 -0.15 190 1099 riot procedure random 1.90 1.62 0.54 0.03 1105 1531 rob require random 1.42 1.29 0.08 -0.09 1338 6441 robber thief Thompson-Schill et al. 6.26 6.85 0.84 0.19 238 857 rub stream random 1.42 1.29 0.10 0.03 1234 3835 rubber tire Chiarello - associated 4.39 6.26 0.80 0.34 1350 1317 rush stuck random 1.42 2.35 0.92 0.39 2755 7733 salad atheist random 1.00 1.32 -0.38 -0.02 1077 7137 scan controller random 2.19 2.50 0.70 0.05 1099 2523 scenario belief random 2.07 2.12 0.68 -0.03 3354 5689 school apocalypse random 1.16 1.68 0.25 -0.06 49862 1088 script eye random 1.84 1.94 0.46 -0.03 2292 10393 search engineer random 1.84 2.56 0.69 0.05 8026 3442 sector audio random 1.84 1.59 0.42 0.02 1683 2435 seem hung random 1.26 1.24 -0.21 -0.04 27491 1569 semi spin random 1.45 1.76 0.63 0.27 3153 1995 senate safe random 1.48 1.56 0.49 0.02 1367 10812 100 Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq send reflect random 2.03 1.91 0.16 0.02 9255 1441 sergeant variety random 1.16 1.18 -0.20 -0.01 323 2775 set role random 2.87 3.15 0.14 0.01 26099 6125 setup menu random 3.03 3.21 0.93 0.38 2495 2732 shark trout Chiarello - similar 4.19 4.47 0.29 -0.04 1240 205 sheep wool Chiarello - associated 4.68 6.21 0.04 0.19 1392 348 shell sea Chiarello - associated 4.48 6.68 0.79 0.21 2350 3317 shirt polo Other 5.35 6.00 0.60 0.36 5656 189 shoe sandal Other 5.50 5.82 0.44 0.18 1225 17 shoulder chest random 4.42 5.29 0.97 0.74 2584 3241 sickness health Thompson-Schill et al. 3.74 6.29 0.46 0.30 395 12793 skip jump Thompson-Schill et al. 5.19 5.88 0.70 0.25 2179 6878 smoke tobacco Thompson-Schill et al. 4.81 6.85 0.30 0.42 5925 1165 snake mask random 1.23 1.38 0.92 0.26 1796 2011 socks shoes Other 4.94 6.65 0.82 0.63 1427 4032 sofa chair Chiarello - both 5.58 5.85 0.71 0.46 263 2633 sole compliment random 1.16 1.39 0.19 -0.05 1376 1224 somebody filter random 1.10 1.15 0.32 0.05 8207 1863 sort license random 1.16 1.38 0.25 0.03 22981 3775 sound union random 1.84 1.74 0.13 0.02 20130 6012 source emotion random 1.77 2.32 0.57 0.04 17256 1666 speech sin random 1.42 1.65 0.86 0.14 6926 2676 spider web Chiarello - associated 3.90 6.91 0.43 -0.01 2214 6093 spirit legacy random 3.35 2.76 0.91 0.06 2548 1080 stage prize random 2.55 3.56 0.66 0.12 4288 1492 star sky Chiarello - associated 4.84 6.50 0.63 0.36 10093 3427 station trail random 2.71 2.68 0.92 0.29 4406 1255 stem petal Chiarello - similar 4.39 5.85 0.00 0.01 1373 19 sticker monkey random 1.23 1.32 0.38 0.01 1166 1914 stigma pint random 1.16 1.09 -0.04 -0.10 906 446 stoop avocado random 1.00 1.03 -0.43 -0.10 162 261 stretch cast random 1.52 1.76 0.82 0.31 2278 3567 string rope Chiarello - both 5.48 6.26 0.65 0.18 2527 1169 sue society random 1.52 2.32 0.31 0.09 2198 15468 sunflower modesty random 1.17 1.35 -0.41 0.03 113 122 surgery equality random 1.61 1.21 -0.33 -0.06 3665 2515 symbol suggestion random 2.03 2.76 0.58 0.00 1421 1853 syntax broke random 1.55 1.94 -0.23 -0.04 1008 7083 101 Word1 Word2 Category Sim Rating Assoc Rating LSA 30 LSA 300 Word1 freq Word2 freq tack nail Chiarello - both 5.13 5.32 0.59 -0.04 247 1835 team immune random 1.45 1.47 0.47 -0.01 20020 1177 technology heart random 1.55 2.09 0.11 0.01 8367 10160 teeth camp random 1.23 1.12 0.60 0.10 4559 2980 text prose Other 4.29 4.31 0.63 0.30 8898 283 throw toss random 6.65 6.33 0.91 0.48 10469 1721 tiger lion Chiarello - both 5.65 6.18 0.80 0.46 1335 1720 till slide random 1.61 1.24 0.80 0.20 4137 2112 tired sleepy Other 6.74 6.88 0.73 0.43 5570 297 tooth react random 1.35 1.62 -0.16 0.21 1105 2237 tourist dare random 1.16 1.53 -0.11 0.05 801 2676 tub bath Thompson-Schill et al. 6.19 6.74 0.87 0.81 872 1268 tube truth random 1.06 1.15 -0.23 -0.08 1687 10100 tulip daisy Chiarello - similar 5.61 6.12 0.15 -0.10 81 221 tuner profession random 1.84 1.74 -0.01 -0.03 120 1083 twitter audience random 2.65 3.68 0.83 0.14 3628 4260 typo stranger random 1.16 1.21 0.18 0.06 1053 2179 tyranny pepper random 1.23 1.24 -0.34 -0.01 579 1781 uncle aunt Chiarello - both 5.32 6.44 0.56 0.91 3232 1607 unhappy jerk random 2.84 3.88 0.68 0.03 1024 3395 uniform weapon random 2.52 3.74 0.72 0.27 1214 4959 usher movie Chiarello - associated 2.32 3.32 0.28 0.20 122 33581 velvet linen Chiarello - similar 4.19 4.91 0.56 0.25 193 66 verify jury random 3.52 3.79 0.44 0.20 1134 1337 vermin pan random 1.39 1.09 -0.22 -0.02 110 2063 wallpaper daughter random 1.29 1.38 -0.15 0.00 1087 6182 wash cook random 3.35 4.68 0.73 0.38 2425 4626 wave ocean Chiarello - associated 5.23 6.68 0.77 0.30 2198 2318 way immature random 1.19 1.44 0.41 0.04 145795 1204 weird bud random 1.26 1.38 0.63 0.15 16343 1172 wife instrument random 1.26 1.35 0.03 -0.03 16363 1119 winter spring random 4.97 6.24 0.92 0.57 4403 2212 wolf dog Chiarello - both 5.42 5.38 0.48 0.77 1567 23133 word sentence Other 4.42 6.41 0.80 0.65 24159 5793 wrap tournament random 1.48 1.06 0.10 -0.08 1804 1862 zone gear random 1.48 1.94 0.83 0.15 2812 4726 102 Appendix E. Stimuli and stimuli parameters Table 12. Word pairs from Chiarello et al. (1990) Associated only Similar and associated Similar only alley cat ale beer apple grape apple tree arm leg arm nose artist paint army navy bacon steak bee honey ball bat banana peach bone dog basin sink bean onion book page blouse skirt bear cow button coat boot shoe birch elm camel hump brandy wine brass iron candle flame brush comb burlap felt cheese mouse butter bread car ship circus clown coat hat carrot corn cloth dress coffee tea circle cross cow milk cotton wool coat gown cradle baby dirt mud cotton silk crater moon doctor nurse dagger rifle crew ship dog cat deer pony crown king engine motor desk stool decoy duck figure shape drums piano engine car frown smile ear foot farmer plow inch foot flea ant fish water jacket coat floor wall flea dog jelly jam fox horse floor wood knife fork garlic mint gallon jug lizard snake gin wine grocer store lotion cream hair fur hammer nail man woman head leg harbor boat mint candy house cabin hermit cave moth fly jeep plane hockey ice mouse rat knife pot key door nickel dime lamp chair miner coal ounce pound lawyer nurse mold bread oven stove lemon pear mug beer pepper salt music art nest bird pot pan oak maple onion tears queen king orchid tulip pilot plane road path pan bowl rake leaf sea ocean pants hat rubber tire shirt tie roof door rug floor silver gold shark trout sheep wool sleet snow shoe glove 103 Associated only Similar and associated Similar only shell sea sofa chair steel brass spider web steel iron stem petal star sky string rope street path stove heat sword knife sugar salt train track tack nail table bed usher movie tiger lion train canoe waist belt uncle aunt tulip daisy wave ocean wolf dog velvet linen Table 13. Word pairs from Plaut and Booth (2000). Related Unrelated adult child admit learn agony pain ahead piece alarm clock alike post argue fight allow knee birth death alone death blade knife anger look blank empty angle tight blaze fire apart aunt bored tired arrow reef bride groom avoid talk brief short basic human bring take beast tree canoe boat begin open chain links bench tale chuck throw blind exit cigar smoke bound rain clean dirty burst yell close open cabin glue coach team cause south coral reef charm happy court judge check hotel crane lift cheek book creek river chest live cycle bike chief black death live china bird ditch hole clear music donor blood climb ghost enter exit cloth sharp fairy tale cloud watch fence post color year 104 Related Unrelated flame fire count bike flood water crack groom fresh fruit crash curse funny laugh crawl pain ghoul ghost cream fire glove hand crowd judge grain wheat curve move grasp hold dense fake grass green dream noise heavy light drill broom honey sweet drink dress house home early take joint knee equal treat knock door event green labor work extra call large small faith stop lemon lime favor fire loose tight final child major minor floor money maple tree found right march april front young mint candy frost bread month year giant smoke motel hotel glory decay north south going paper novel book guard knife paint brush guest steal paste glue habit plane phone call hurry laugh phony fake leave write piano play level door pilot plane lower short poker cards meter lion print write model turn quack duck moist throw queen king motor metal radio music nerve links razor sharp never work reach grab notes beach scent smell nurse path shame guilt party small share gives patch fruit sheet paper pearl duck shift gears pitch april shirt pants plain blood 105 Related Unrelated shore beach plate ocean shout yell prize sweet skirt dress proud bite slice piece pupil pants smile happy quick horse snake bite raise shoes socks shoes rapid fork sound noise ready light spare tire reply play speak talk rifle chair spend money rough lime spoon fork scale track stall horse score hold stare look screw clock steel metal shape home still move shine minor stone rock shock king storm rain shoot team stuff things sight hand super great solid brush swear curse split fight sweep broom stalk cards table chair stamp rock teach learn stand thing thief steal state great tiger lion steam candy toast bread stiff smell tooth decay store tire touch feel straw hole trail path swamp wheel train track swift guilt trick treat tense gear truce peace today water twist turn topic lift wagon wheel total peace waves ocean tower boat white black trunk tired wings bird unite dirty wrist watch usual river wrong right visit feel youth young voice give width wheat worse grab 106 References Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Pasca, M., & Soroa, A. (2009). A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the ACL (pp. 19?27). Audet, C., & Burgess, C. (1999). Using a high-dimensional memory model to evaluate the properties of abstract and concrete words. In Proceedings of the Cognitive Science Society (pp. 37?42). Mahwah, NJ: Laurence Erlbaum Aasociates. Barsalou, L.W., Santos, A., Simmons, W. K., & Wilson, C. D. (2008). Language and simulation in conceptual processing. In M. De Vega, A. M. Glenberg, & A. C. Graesser (Eds.), Symbols, embodiment, and meaning (pp. 245?283). Oxford: Oxford University Press. Barsalou, Lawrence W. (1987). The instability of graded structure: implications for the nature of concepts. In U. Neisser (Ed.), Concepts and conceptual development: Ecological and intellectual factors in categorization (pp. 101? 140). Cambridg. Bastian, M., Heymann, S., & Jacomy, M. (2009). Gephi?: An Open Source Software for Exploring and Manipulating Networks. In International AAAI Conference on Weblogs and Social Media. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3, 1137?1155. Bird, S., Loper, E., & Klein, E. (2009). Natural Language Processing with Python. O?Reilly Media. Blouw, P., & Eliasmith, C. (2003). A Neurally Plausible Encoding of Word Order Information into a Semantic Vector Space, 1905?1910. Bolger, D. J., Balass, M., Landen, E., & Perfetti, C. (2008a). Context Variation and Definitions in Learning the Meanings of Words: An Instance-Based Learning Approach. Discourse Processes, 45(2), 122?159. doi:10.1080/01638530701792826 Bolger, D. J., Balass, M., Landen, E., & Perfetti, C. a. (2008b). Context Variation and Definitions in Learning the Meanings of Words: An Instance-Based Learning Approach. Discourse Processes, 45(2), 122?159. doi:10.1080/01638530701792826 107 Bolger, D. J., & Jackson, A. F. (n.d.). Acquiring Word Meaning: How are Contexts different from Definitions? Under review. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5?32. Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 32(1), 14?47. Bullinaria, J. a, & Levy, J. P. (2007). Extracting semantic representations from word co-occurrence statistics: a computational study. Behavior research methods, 39(3), 510?26. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/17958162 Bullinaria, J. a, & Levy, J. P. (2012). Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behavior research methods, 44(3), 890?907. doi:10.3758/s13428-011-0183-8 Burgess, C. (2000). Theory and Operational Definitions in Computational Memory Models: A Response to Glenberg and Robertson. Journal of Memory and Language, 43(3), 402?408. doi:10.1006/jmla.2000.2715 Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kucera and Francis. Behavior Research Methods, Instruments, & Computers, 30(2), 272?277. Burgess, C., & Lund, K. (1997). Modelling Parsing Constraints with High- dimensional Context Space. Language and Cognitive Processes, 12(2), 177?210. doi:10.1080/016909697386844 Burgess, C., & Lund, K. (1998). The Dynamics of Meaning in Memory. In Dietrich & Markman (Eds.), Cognitive Dynamics: Conceptual Change in Humans and Machines (pp. 1?23). Burrows, S., & Tahaghoghi, S. M. M. (2007). Source code authorship attribution using n-grams. In Proceedings of the 12th Australasian Document Computing Symposium (pp. 32?39). Butts, C. T. (2008). Social network analysis: A methodological introduction. Asian Journal Of Social Psychology, 11(1), 13?41. doi:10.1111/j.1467- 839X.2007.00241.x Carlson, T. A., Simmons, R. A., Kriegeskorte, N., & Slevc, L. R. (2014). The Emergence of Semantic Meaning in the Ventral Temporal Pathway. Journal of Cognitive Neuroscience, X(Y), 1?12. doi:10.1162/jocn Chelba, C., Bikel, D., Shugrina, M., Nguyen, P., & Kumar, S. (2012). Large Scale Language Modeling in Automatic Speech Recognition (pp. 1?6). 108 Cheng, B., & Titterington, D. M. (1994). Neural networks: A review from a statistical perspective. Statistical Science, 9(1), 2?54. Chiarello, C., Burgess, C., & Richards, L. (1990). Semantic and Associative Priming in the Cerebral Hemispheres?: Some Words Do , Some Words Don ? t . . . Sometimes , Some Places, 104, 75?104. Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82(6), 407?428. Collins-Thompson, K., & Callan, J. (2007). Automatic and human scoring of word definition responses. In Proceedings of the NAACL-HLT 2007 Conference (pp. 476?483). Rochester. Daalen-kapteijns, M. Van, & Elshout-mohr, M. (2001). Deriving the Meaning of Unknown Words From Multiple Contexts, (March), 145?181. Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V, ? Ng, A. Y. (2012). Large Scale Distributed Deep Networks. In Neural Information Processing Systems (pp. 1?11). Demsar, J., Curk, T., Erjavec, A., Gorup, C., Hocevar, T., Milutinovic, M., ? Zupan, B. (2013). Orange: Data mining toolbox in Python. Journal of Machine Learning Research, 14, 2349?2353. Deyne, S. D. E., & Storms, G. (2008). Word associations?: Network and semantic properties. Behavior Research Methods, 40(1), 213?231. doi:10.3758/BRM. Dijkstra, E. W. (1959). A Note on Two Problems in Connexion with Graphs. Numerische Mathematick, 1, 269?271. Dorogovtsev, S. N., & Mendes, J. F. (2001). Language as an evolving word web. Proceedings. Biological sciences / The Royal Society, 268(1485), 2603?6. doi:10.1098/rspb.2001.1824 Dumais, S., Banko, M., Brill, E., Lin, J., & Ng, A. (2002). Web Question Answering: Is More Always Better?? In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. Durda, K., Caron, R., & Buchanan, L. (2010). An Application of Operational Research to Computational Linguistics?: Word Ambiguity. Information Systems and Operational Research, 48, 1?21. Eck, N. J. Van, & Waltman, L. (2009). How to Normalize Co-Occurrence Data?? An Analysis of Some Well-Known Similarity Measures Nees Jan van Eck and Ludo Waltman REPORT SERIES. 109 Eifrem, E. (2009). Neo4j - The Benefits of Graph Databases. In Qcon San Francisco. Federmeier, K. D., & Kutas, M. (1999). A Rose by Any Other Name: Long-Term Memory Structure and Sentence Processing. Journal of Memory and Language, 41(4), 469?495. doi:10.1006/jmla.1999.2660 Finn, P. J. (1977). Word frequency, information theory, and cloze performance: A transfer feature theory of processing in reading. Reading Research Quarterly, 13(4), 508?537. Frishkoff, G. a, Collins-Thompson, K., Perfetti, C. a, & Callan, J. (2008). Measuring incremental changes in word knowledge: experimental validation and implications for learning and assessment. Behavior research methods, 40(4), 907?25. doi:10.3758/BRM.40.4.907 Frishkoff, G. A., Perfetti, C. A., & Collins-Thompson, K. (2010). Lexical quality in the brain?: ERP evidence for robust word learning from context. Developmental Neuropsychology, 35(4), 376?403. doi:10.1080/875656412010480915 Fukkink, R. G., Blok, H., & de Glopper, K. (2001). Deriving Word Meaning from Written Context: A Multicomponential Skill. Language Learning, 51(3), 477? 496. doi:10.1111/0023-8333.00162 Gabrilovich, E., & Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In International Joint Conferences on Artificial Intelligence (pp. 1606?1611). Goldman, S. R., Hogaboam, T. W., Bell, L. C., & Perfetti, C. A. (1980). Short-Term Retention of Discourse During Reading. Journal of Educational Psychology, 72(5), 647?655. Green, A. E., Kraemer, D. J. M., Fugelsang, J. a, Gray, J. R., & Dunbar, K. N. (2010). Connecting long distance: semantic distance in analogical reasoning modulates frontopolar cortex activity. Cerebral cortex (New York, N.Y.?: 1991), 20(1), 70? 6. doi:10.1093/cercor/bhp081 Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological review, 114(2), 211?44. doi:10.1037/0033- 295X.114.2.211 Groppe, D. M., Urbach, T. P., & Kutas, M. (2011). Mass univariate analysis of event- related brain potentials/fields I: A critical tutorial review. Psychophysiology, 48(12), 1711?1725. doi:10.1111/j.1469-8986.2011.01273.x Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157?1182. 110 Halgren, E., Dhond, R. P., Christensen, N., Van Petten, C., Marinkovic, K., Lewine, J. D., & Dale, A. M. (2002). N400-like Magnetoencephalography Responses Modulated by Semantic Context, Word Frequency, and Lexical Class in Sentences. NeuroImage, 17(3), 1101?1116. doi:10.1006/nimg.2002.1268 Hall, M. A. (1999). Correlation-based Feature Selection for Machine Learning. The University of Waikato. Hartuv, E., & Shamir, R. (1999). A Clustering Algorithm based on Graph Connectivity. Information Processing Letters, 76(4), 1?9. Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: what is it, who has it, and how did it evolve? Science, 298(5598), 1569?79. doi:10.1126/science.298.5598.1569 Hintzman, D. L. (1984). MINERVA 2: A simulation model of human memory. Behavior Research Methods, Instruments, & Computers, 16, 96?101. Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences of the United States of America, 79, 2554?2558. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359?366. Howard, M. W., Addis, K. M., Jing, B., & Kahana, M. J. (2005). Semantic structure and episodic memory. In T. K. Landauer, D. Mcnamara, S. Dennis, & W. Kintsch (Eds.), Latent Semantic Analysis?: A Road to Meaning. La uren ce E r lbau m. Hughes, T., & Ramage, D. (2007). Lexical Semantic Relatedness with Random Graph Walks, (June), 581?589. Hutchison, K. A. (2003). Is semantic priming due to association strength or feature overlap?? A micro analytic review. Psychonomic Bulletin & Review, 10(4), 785? 813. Inkpen, D. (2007). A statistical model for near-synonym choice. ACM Transactions on Speech and Language Processing, 4(1), 1?17. doi:10.1145/1187415.1187417 Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions on Knowledge Discovery from Data, 2(2), 1?25. doi:10.1145/1376815.1376819 Jackson, A. F., & Bolger, D. J. (n.d.). Neurophysiological Markers of Learning Word Meaning from Context. In preparation. 111 Jarmasz, M. (2003). Roget?s thesaurus as a lexical resource for natural language processing. Jenkins, J. R., Stein, M. L., & Wysocki, K. (1984). Learning vocabulary through reading. American Educational Research Journal, 21(4), 767?787. Jiang, J. J., & Conrath, D. W. (1997). Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In Proceedings of International Conference Research on Computational Linguistics. Jones, M. N., & Mewhort, D. J. K. (2007). Representing word meaning and order information in a composite holographic lexicon. Psychological review, 114(1), 1?37. doi:10.1037/0033-295X.114.1.1 Kaan, E. (2007). Event-Related Potentials and Language Processing: A Brief Overview. Language and Linguistics Compass, 1(6), 571?591. doi:10.1111/j.1749-818X.2007.00037.x Kakkonen, T., Myller, N., & Sutinen, E. (2006). Applying part-of-speech enhanced LSA to automatic essay grading. In Proceedings of the 4th IEEE International Conference on Information Technology: Research and Education (pp. 500?504). Kakkonen, T., Myller, N., Timonen, J., & Sutinen, E. (2005). Automatic Essay Grading with Probabilistic Latent Semantic Analysis. In Proceedings fo the 2nd Workshop on Building Educational Applications using NLP (pp. 29?36). Kintsch, W., & Mangalath, P. (2011). The Construction of Meaning. Topics in Cognitive Science, 3(2), 346?370. doi:10.1111/j.1756-8765.2010.01107.x Kintsch, W., & van Dijk, T. A. (1978). Toward a model of text comprehension and production. Psychological Review, 85(5), 363?394. Koivisto, M., & Revonsuo, A. (2001). Cognitive representations underlying the N400 priming effect. Cognitive brain research, 12(3), 487?90. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/11689310 Kolb, P. (2006). Experiments on the difference between semantic similarity and relatedness. In Proceedings of the 17th Nordic Conference of Computational Linguistics (pp. 81?88). Kutas, M., & Federmeier, K. D. (2011). Thirty Years and Counting: Finding Meaning in the N400 Component of the Event-Related Brain Potential (ERP). Annual Reviews Psychology, 62, 14.1?14.27. doi:10.1146/annurev.psych.093008.131123 112 Kwantes, P. J. (2005). Using context to build semantics. Psychonomic Bulletin & Review, 12(4), 703?10. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/16447385 Landauer, T. K., & Dumais, S. T. (1997). A Solution to Plato?s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104(2), 211?240. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2-3), 259?284. doi:10.1080/01638539809545028 Landauer, T. K., Laham, D., & Foltz, P. (1997). Learning Human-like Knowledge by Singular Value Decomposition: A Progress Report. In Proceedings of the 1997 Conference on Advances in Neural Information Processing Systems. Lau, E. F., Phillips, C., & Poeppel, D. (2008). A cortical network for semantics?: (de)constructing the N400. Nature Review of Neuroscience, 9, 920?933. doi:10.1038/nrn2532 Ledoux, K., Camblin, C. C., Swaab, T. Y., & Gordon, P. C. (2006). Reading words in discourse: The modulation of lexical priming effects by message-level context. Behavioral and Cognitive Neuroscience Reviews, 5(3), 107?127. Lee, S., Baker, J., Song, J., & Wetherbe, J. C. (2010). An Empirical Comparison of Four Text Mining Methods. In 43rd Hawaii International Conference on System Sciences (pp. 1?10). Ieee. doi:10.1109/HICSS.2010.48 Levin, E., Sharifi, M., & Ball, J. (2006). Evaluation of Utility of LSA for Word Sense Discrimination. In Proceedings of the Human Language Technology Conference of the North American Chapter of the ACL (pp. 77?80). Li, C., Sun, A., & Datta, A. (2011). A Generalized Method for Word Sense Disambiguation based on Wikipedia. In ECIR?11 Proceedings of the 33rd European Conference on Advances in Information Retrieval. Lin, D., & Pantel, P. (2002). Concept Discovery from Text. In Proceedings of ACM Special Interest Group on Information Retrieval (pp. 199?206). Tampere, Finland. Lotte, F., Congedo, M., L?cuyer, a, Lamarche, F., & Arnaldi, B. (2007). A review of classification algorithms for EEG-based brain-computer interfaces. Journal of neural engineering, 4(2), R1?R13. doi:10.1088/1741-2560/4/2/R01 113 Lowe, W. (2000). What is the Dimensionality of Human Semantic Space?? In Proceedings of the 6th Neural Computation and Psychology Workshop (pp. 303?311). Luke, D. a, & Harris, J. K. (2007). Network analysis in public health: history, methods, and applications. Annual review of public health, 28, 69?93. doi:10.1146/annurev.publhealth.28.021406.144132 Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203?208. Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. In Proceedings of the Cognitive Science Society (pp. 660?665). Hillsdale, N.J.: Erlbaum Publishers. Magliano, J. P., & Graesser, A. C. (2012). Computer-based assessment of student- constructed responses. Behavior research methods, 44(3), 608?21. doi:10.3758/s13428-012-0211-3 Marton, Y., Mohammad, S., & Resnik, P. (2009). Estimating Semantic Distance Using Soft Semantic Constraints in Knowledge-Source ? Corpus Hybrid Models. In Conference on Empirical Methods in Natural Language Processing. Mcdonald, S., & Ramscar, M. (2000). Testing the Distributional Hypothesis?: The Influence of Context on Judgements of Semantic Similarity Distributional Models of Word Meaning. In Proceedings of the 23rd Annual Conference of the Cognitive Science Society. Mervis, C. B., & Rosch, E. (1981). Categorization of natural objects. Annual Review of Psychology, 32(341), 89?115. Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Team, T. G. B., ? Aiden, E. L. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014), 176?182. doi:10.1126/science.1199644.Quantitative Mihalcea, R., Corley, C., & Strapparava, C. (2005). Corpus-based and Knowledge- based Measures of Text Semantic Similarity. In Proceedings of the 21st National Conference on Artificial Intelligence (pp. 775?780). Miller, G. a. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39?41. doi:10.1145/219717.219748 Minkov, E., & Cohen, W. W. (2008). Learning Graph Walk Based Similarity Measures for Parsed Text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 907?916). 114 Osterhout, L., Kim, A., & Kuperberg, G. (2006). The Neurobiology of Sentence Comprehension. In The Cambridge Handbook of Psycholinguistics (pp. 1?23). Cambridge: Cambridge University Press. Pad?, S., & Lapata, M. (2006). Dependency-based Construction of Semantic Space Models. Computational Linguistics, 33(2), 161?199. Palla, G., Der?nyi, I., Farkas, I., & Vicsek, T. (2005). Uncovering the overlapping community structure of complex networks in nature and society. Nature, 435(7043), 814?8. doi:10.1038/nature03607 Parviz, M., Johnson, M., Johnson, B., & Brock, J. (2011). Using Language Models and Latent Semantic Analysis to Characterise the N400m Neural Response. In Proceedings of Australasian Language Technology Association Workshop (pp. 38?46). Perfetti, C. A., Wlotko, E. W., & Hart, L. A. (2005). Word learning.pdf. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31(6), 1281?1292. Piattelli-Palmarini, M. (1980). How hard is the ?hard core? of a scientific program? In Language and Learning: The Debate between Jean Piaget and Noam Chomsky (the Royaumont debate) (pp. 1?20). Plaut, D. C., & Booth, J. R. (2000). Individual and developmental differences in semantic priming: empirical and computational support for a single-mechanism account of lexical processing. Psychological review, 107(4), 786?823. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/11089407 Radev, D., & Mihalcea, R. (2008). Networks and Natural Language Processing. Artificial Intelligence Magazine, 29(3), 16?28. Recchia, G., & Jones, M. N. (2009). More data trumps smarter algorithms?: Comparing pointwise mutual information with latent semantic analysis. Behavior Research Methods. doi:10.3758/BRM.More Rehurek, R., & Sojka, P. (2004). Software Framework for Topic Modelling with Large Corpora. In LREC 2010. Rodd, J. M., Gaskell, M. G., & Marslen-Wilson, W. D. (2004). Modelling the effects of semantic ambiguity in word recognition. Cognitive Science, 28, 89?104. doi:10.1016/j.cogsci.2003.08.002 Rogers, T. T., & McClelland, J. L. (2011). Semantics without categorization. In E. M. Pothos & A. J. Wills (Eds.), Formal Approaches in Categorization (pp. 88?119). Cambridge: Cambridge University Press. 115 Rohde, D. L. T., Gonnerman, L. M., & Plaut, D. C. (2005). An Improved model of semantic similarity based on lexical co-occurrence. Unpublished manuscript. Rubenstein, H., & Goodenough, J. B. (1965). Contextual Correlates of Synonymy. Computational Linguistics, 8(10), 627?633. Ruxton, G. D. (2006). The unequal variance t-test is an underused alternative to Student?s t-test and the Mann-Whitney U test. Behavioral Ecology, 17(4), 688? 690. doi:10.1093/beheco/ark016 Schaeffer, S. E. (2007). Graph clustering. Computer Science Review, 1(1), 27?64. doi:10.1016/j.cosrev.2007.05.001 Schatz, E. K., & Baldwin, R. S. (1986). Context clues are unreliable predictors of word meanings. Reading Research Quarterly, 21(4), 439?453. Scheel, S. L. (1998). French language purism: French linguistic development and current national attitudes. Schutze, H. (Xerox P. A. R. C. (1998). Automatic Word Sense Discrimination. Computational Linguistics, 24(1), 97?123. Sereno, S. C., Brewer, C. C., & O?Donnell, P. J. (2003). Context effects in word recognition: evidence for early interactive processing. Psychological science?: a journal of the American Psychological Society / APS, 14(4), 328?33. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/12807405 Shaoul, C., & Westbury, C. (2010). Exploring lexical co-occurrence space using HiDEx. Behavior research methods, 42(2), 393?413. doi:10.3758/BRM.42.2.393 Silva, T. C., & Amancio, D. R. (2013). Discriminating word senses with tourist walks in complex networks. The European Physical Journal B, 86(7), 297. doi:10.1140/epjb/e2013-40025-4 Silver, R. A. (2010). Neuronal arithmetic. Nature Reviews Neuroscience, 11(7), 474? 489. doi:10.1038/nrn2864 Steyvers, M., & Tenenbaum, J. B. (2005). The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cognitive science, 29(1), 41?78. doi:10.1207/s15516709cog2901_3 Strube, M., & Ponzetto, S. P. (2006). WikiRelate! Computing Semantic Relatedness Using Wikipedia. In AAAI?06 Proceedings of the 21st National Conference on Artificial intelligence (pp. 1419?1424). 116 Sun, Z., Wang, H., Wang, H., Shao, B., & Li, J. (2012). Efficient Subgraph Matching on Billion Node Graphs. In Proceedings of the VLDB Endowment (pp. 788? 799). Swanborn, M. S. L., & de Glopper, K. (1999). Incidental Word Learning while Reading?: A Meta-Analysis. Review of Educational Research, 69(3), 261?285. Swanborn, M. S. L., & de Glopper, K. (2002). Impact of Reading Purpose on Incidental Word Learning From Context. Language Learning, 52(1), 95?117. doi:10.1111/1467-9922.00178 Thompson-schill, S. L., Kurtz, K. J., & Gabrieli, J. D. E. (1998). Effects of Semantic and Associative Relatedness on Automatic Priming, 458(38), 440?458. Thornhill, D. E., & Van Petten, C. (2012). Lexical versus conceptual anticipation during sentence processing: Frontal positivity and N400 ERP components. International journal of psychophysiology?: official journal of the International Organization of Psychophysiology. doi:10.1016/j.ijpsycho.2011.12.007 Tivarus, M. E., Ibinson, J. W., Hillier, A., Schmalbrock, P., & Beversdorf, D. Q. (2006). An fMRI study of semantic priming: modulation of brain activity by varying semantic distances. Cognitive and behavioral neurology?: official journal of the Society for Behavioral and Cognitive Neurology, 19(4), 194?201. doi:10.1097/01.wnn.0000213913.87642.74 Tremblay, A., & Newman, A. J. (2013). Modelling Non-linear Relationships in ERP Data Using Mixed-effects Regression with R Examples. Tsang, V., & Stevenson, S. (2010). A Graph-Theoretic Framework for Semantic Distance, (December 2007). Turney, P. D. (2001). Mining the Web for Synonyms?: PMI-IR versus LSA on TOEFL PMI-IR. In Proceedings of the 12th European Conference on Machine Learning (pp. 1?12). Utsumi, A. (2010). Exploring the Relationship between Semantic Spaces and Semantic Relations. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (pp. 257?262). Van Daalen-Kapteijns, M., & Elshout-Mohr, M. (1981). The Acquisition of Word Meanings as a Cognitive Learning Process. Journal Of Verbal Learning And Verbal Behavior, 20, 386?399. Van Daalen-Kapteijns, M., Elshout-mohr, M., & de Glopper, K. (2001). Deriving the Meaning of Unknown Words From Multiple Contexts. Language Learning, 51(1), 145?181. 117 Velik, R. (2008). Discrete Fourier Transform Computation Using Neural Networks. 2008 International Conference on Computational Intelligence and Security, 120?123. doi:10.1109/CIS.2008.36 Wang, T., & Hirst, G. (2010). Near-synonym Lexical Choice in Latent Semantic Space. In Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1182?1190). Weeds, J., & Weir, D. (2005). Co-occurrence Retrieval?: A Flexible Framework for Lexical Distributional Similarity. Computational Linguistics, 31(4), 439?475. Weeds, J., Weir, D., & McCarthy, D. (2004). Characterising Measures of Lexical Distributional Similarity. In Proceedings of the 20th international conference on Computational Linguistics. Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2009). Distance Metric Learning for Large Margin Nearest Neighbor Classification. The Journal of Machine Learning Research, 10, 207?244. Widdows, D., & Dorow, B. (2002). A Graph Model for Unsupervised Lexical Acquisition. In Proceedings of the 19th international conference on Computational linguistics. Wiemer-Hastings, P. (2000). Adding syntactic information to LSA. In Proceedings of the 22nd Annual Conference of the Cognitive Science Society. Xu, P., & Jelinek, F. (2004). Random forests in language modeling. In Empirical Methods on Natural Language Processing.