ABSTRACT Title of dissertation: THE ROLE OF STRUCTURAL INFORMATION IN THE RESOLUTION OF LONG-DISTANCE DEPENDENCIES Anton Malko Doctor of Philosophy, 2019 Dissertation directed by: Professor Colin Phillips Department of Linguistics The main question that this thesis addresses is: in what way does structural information enter into the processing of long-distance dependencies? Does it con- strain the computations, and if so, to what degree? Available experimental evidence suggests that sometimes structurally illicit but otherwise suitable constituents are accessed during dependency resolution. Subject-verb agreement is a prime example (Wagers et al., 2009; Dillon et al., 2013), and similar effects were reported for neg- ative polarity items (NPIs) licensing (Vasishth et al., 2008) and reflexive pronouns resolution (Parker and Phillips, 2017; Sloggett, 2017). Prima facie this evidence suggests that structural information fails to perfectly constrain real-time language processing to be in line with grammatical constraints. This conclusion would fall neatly in line with an assumption that human sentence processing relies on cue- based memory (e.g McElree et al., 2003; Lewis and Vasishth, 2005; Van Dyke and Johns, 2012; Wagers et al., 2009, a.m.o.), the key property of which is the fragility of memory search, which can return irrelevant results if they look similar enough to the relevant ones. The attractiveness of such an approach lies in its parsimony: there is independent evidence that general purpose working memory is cue-based (Jonides et al., 2008), so we do not need to postulate any language specific mechanisms. Ad- ditionally, the processing of multiple linguistic dependencies can be analyzed within the same theoretical framework. Cue-based approach has also been argued to be the best one in terms of its empirical coverage: some of the experimental evidence was assumed to only be ex- plainable within it (the absence of ungrammaticality illusions in subject-verb agree- ment is the main example, to which we will return in more detail later). However, recently several other approaches have been suggested which would be able to ac- count for these cases (Eberhard et al., 2005; Xiang et al., 2013; Sloggett, 2017; Ham- merly et al., draft.april.2018). These approaches usually assume separate processing mechanisms for different linguistic dependencies, and thus lose the parsimonious at- tractiveness of cue-based memory models. They also take a different stance on the role of structural information in real-time language processing, assuming that struc- tural cues do accurately guide the dependency resolution. A priori there is no reason why they could not turn out to be true. But given the theoretical attractiveness of cue-based models in which structural information does not categorically restrain processing, it is important to critically evaluate these recent claims. In this the- sis, we focus on reflexive pronouns and on the novel pattern reported in Parker and Phillips (2017) and Sloggett (2017): the finding that reflexive pronouns are sensitive to the properties of structurally inaccessible antecedents in some specific conditions (interference effect). The two works report consistent findings, but the accounts they give take opposite perspectives on the role of structural information in reflex- ive resolution. Our aim in this thesis is to assess the reliability of these findings and to experimentally investigate cases which would hopefully provide clearer evidence on how the structure guides reflexives processing. To this aim, we conduct two direct replications of Parker and Phillips (2017) and four novel experiments further investigating the properties of the interference effect. None of the six experiments provided strong statistical support for the pre- vious findings. After ruling out several possible confounds and analyzing numerical patterns (which go in the expected direction and are consistent with previous re- sults), we conclude that interference effect is likely real, but may be less strong than the previous studies would lead to believe. These results can be used for setting more realistic expectations for future studies regarding the size of the effect and sta- tistical power necessary to detect it. With respect to our main goal of distinguishing between cue-based and alternative accounts of the interference effect, we tentatively conclude that cue-based approaches are preferred; however, one has to assume that some structural features are able to categorically rule out illicit antecedents. Further highly powered studies are necessary to verify and confirm these conclusions. The role of structural information in the resolution of long-distance dependencies by Anton Malko Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2019 Advisory Committee: Professor Colin Phillips, Chair Assistant Professor Ellen Lau Professor Jeffrey Lidz Associate Professor Omer Preminger Associate Professor Robert Slevc ©c Copyright by Anton Malko 2019 Acknowledgments I have finally reached the stage of writing this part of my dissertation. The path to this point has been long and not always straightforward, and I would like to express my gratitude to people who supported me during the process and who made this journey even remotely possible (the list is far from being exhaustive, of course!). My greatest thank you goes to my advisor, Colin Phillips. Colin patiently guided me through my sometimes confused thoughts and continuously pushed me to re-think my assumptions, to shape my ideas better and to communicate them extremely clearly - all of this without explicitly holding my hand and giving me enough room for trying things out on my own. Without Colin’s guidance and support this thesis would truly have been impossible, and I owe him my deepest gratitude. I greatly enjoyed our discussions and conversations with Ellen Lau. She was always able to provide encouraging advice and also point out to me some unexpected connections between my ideas and existing research. Ellen somehow always managed to make things brighter and clearer than what they had appeared to me - it was such a comforting and stimulating experience. The research I discuss in this thesis would have been much harder without the logistical support I received. First, I would like to thank those who helped by providing native speaker judgments, helping to prepare experimental stimuli and collecting the data: Julia Buffinton, Hanna Muller, Maggie Kandel, Lalitha Bal- ii achandran, Cassidy Wyatt. I express my gratitude to all of them. Thank you also to Kim Kwok, who has been tremendously helpful and supportive in all the adminis- trative questions, especially considering enormous amounts of work she was (and is!) doing. Finally, I want to acknowledge the financial support I received from the De- partment of Linguistics, Language Science Center and NSF (the research reported in this thesis is based upon work supported by the National Science Foundation under Grant No.1449815). Thank you to the people who sparked and supported my interest in statistics and modeling. To Michael Dougherty and Kevin O’Grady for teaching my first ever statistic courses in a fun way and making me excited about them - I don’t think I would have been able to write a lot of this thesis (especially Chapter 4!) without the foundations they gave to me. To Naomi Feldman, who introduced me to computational psycholinguistics, and to Philip Resnik for his exciting and extremely lucid discussion of the NLP ideas in the computational linguistics class. If I am now interested in computational modeling and am not afraid of MCMC methods, it is because of Naomi and Philip. Also thank you to my fellow grad students, Phoebe Gaston, Hanna Muller and Adam Liter for sharing my interest in statistics, for frequent related discussions we had and for the little quest for understanding the Bayesian approaches we pursued together. I would have never learned and known the things above if I had not come to Maryland, and I would like to express my gratitude to several people who made this possible. First and foremost, to Natalia Slioussar, who I was very lucky to have as my Master’s thesis advisor. My PhD dissertation would not have been even considered iii if I have not met her in my MA years. Natalia introduced me to (psycho)linguistics in general, being very generous with her time and advice. The current thesis would have probably been very different, if not for our work on agreement attraction in Russian, which has quite directly connected me to the topics I consider here. It was also her who greatly influenced my decision to come to Maryland. I am deeply grateful for all the guidance, support and advice Natalia gave me in the years I have known her. People from NYI 2012 in Saint-Petersburg have shaped my idea of coming to a graduate school abroad at all - most prominently, Sabine Iatridou. And thank you to Nikos Angelopoulos and Christine Boucher for supporting me in this idea and for the great fun we had together in Saint-Petersburg. My gratitudes list would be incomplete without mentioning my friends from inside and outside of the department, who added a lot of fun and enjoyment to the graduate school experience. Thank you to the Hyattsville gang - Allyson Ettinger, Lara Ehrenhofer, Kasia Hitczenko, Phoebe Gaston, Christian Brodbeck, Paulina Lyskawa, Miloš Nikolić, Suddhasattwa Das, Amit Nag - for all the great experiences we had, including, to name just a few, conversations about everything and anything, kayaking, movie nights, Easter Eggs hunt, playing Settlers of Catan, and several kidnappings. I remember these fondly. Ilia Kurenkov and Natalia Lapinskaya were among the very first people I met having come to Maryland and they remained good friends throughout my time here. Thanks to them I have not forgotten my Russian completely! Ilia was also the person who introduced me to LATEXand git, and thus indirectly contributed to the creation of this document. Thank you to Sol Lago, Shota Momma, William Matchin, Eric Pelzl, Nick Huang for having many iv invariably enjoyable conversations throughout these years. I cannot even start expressing my gratitude towards Lena. I would have hardly survived the final stages of writing without her continuous support; but so much more would have been impossible without her. Thank you for all the lessons you taught me and the love and joy you shared with me. Finally, thank you to my family for giving me the foundations which allowed me to make it this far, and for supporting me and believing in me all the way through. v Table of Contents Acknowledgements ii List of Tables ix List of Figures x 1 Introduction 1 1.1 Empirical evidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Evidence interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.2.1 Accounts of lure-match effects . . . . . . . . . . . . . . . . . . 16 1.2.2 Choosing an account . . . . . . . . . . . . . . . . . . . . . . . 21 1.2.3 Evidence interpretation: summary . . . . . . . . . . . . . . . . 28 1.3 Lewis & Vasishth (2005) memory model . . . . . . . . . . . . . . . . 29 1.3.1 Explaining RT patterns . . . . . . . . . . . . . . . . . . . . . 33 1.4 Recent arguments against structure-defeasible models . . . . . . . . . 37 1.4.1 LV05 model fit to agreement data is worse than usually thought 37 1.4.2 Reflexives processing is qualitatively different from agreement 48 1.5 Parker & Phillips (2017) vs. Sloggett (2017) . . . . . . . . . . . . . . 55 1.5.1 Empirical observations . . . . . . . . . . . . . . . . . . . . . . 55 1.5.1.1 PP2017 . . . . . . . . . . . . . . . . . . . . . . . . . 57 1.5.1.2 S2017 . . . . . . . . . . . . . . . . . . . . . . . . . . 63 1.5.2 Empirical coverage: summary . . . . . . . . . . . . . . . . . . 66 1.5.3 My experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2 Interference effects from non-c-commanding lures 69 2.1 Experiment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.1.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.1.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.1.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 2.1.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 2.1.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 2.1.6 Exploratory analysis . . . . . . . . . . . . . . . . . . . . . . . 86 2.1.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 vi 3 Interference effects with quantificational lures 96 3.1 Representing c-command in cue-based models . . . . . . . . . . . . . 99 3.2 Outline of the experiments . . . . . . . . . . . . . . . . . . . . . . . . 106 3.3 Experiment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 3.3.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.3.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.3.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 3.4 Experiment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.4.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.4.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 3.4.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.4.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 3.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 3.4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 3.5 Experiment 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.5.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.5.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 3.5.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.5.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 3.5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 3.5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 3.6 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 4 Replicability of lure-match effects in reflexive resolution 148 4.1 Experiment 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 4.1.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.1.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.1.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 4.1.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 4.1.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 4.1.6 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . 166 4.1.6.1 Qualitative analysis . . . . . . . . . . . . . . . . . . 167 4.1.6.2 Quantitative analysis . . . . . . . . . . . . . . . . . . 168 4.1.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 4.2 Experiment 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.2.1 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 4.2.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.2.3 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.2.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 4.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 4.3 Interference magnitude across studies . . . . . . . . . . . . . . . . . . 182 vii 4.4 Pooled data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 4.5 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 5 Conclusions 204 5.1 Choosing between PP2017 and S2017 . . . . . . . . . . . . . . . . . . 207 5.1.1 On sub-command binding . . . . . . . . . . . . . . . . . . . . 207 5.1.2 On the role of c-command . . . . . . . . . . . . . . . . . . . . 208 5.1.3 On the reliability of lure-match effect . . . . . . . . . . . . . . 210 5.1.3.1 Magnitude of the effect . . . . . . . . . . . . . . . . 210 5.1.3.2 Timing of the effect . . . . . . . . . . . . . . . . . . 214 5.1.4 Choosing between the two accounts . . . . . . . . . . . . . . . 217 5.2 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226 Appendix A Mean reaction times in PP2017and replication (PP2017analysis) 229 Appendix B Mean RTs 234 Appendix C Model coefficients 238 Appendix D List of changes to the Parker and Phillips (2017) stimuli in the replication experiment 243 Appendix E Appendix: Experiment 2 fillers types 245 Bibliography 254 viii List of Tables 2.1 Experiment 1 materials example . . . . . . . . . . . . . . . . . . . . . 78 2.2 Order of random effect structure simplification in Exp.1. . . . . . . . 82 3.1 Experiment 2 materials example . . . . . . . . . . . . . . . . . . . . . 114 3.2 Experiment 3 materials example . . . . . . . . . . . . . . . . . . . . . 128 3.3 Experiment 4 materials example . . . . . . . . . . . . . . . . . . . . . 133 4.1 Experiment 5materials example . . . . . . . . . . . . . . . . . . . . . 152 4.2 Order of random effect structure simplification, Exp.5. Effects were removed from the model, starting from the top. . . . . . . . . . . . . 154 A.1 Mean RTs in PP2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 A.2 Mean RTs in the replication of PP2017 . . . . . . . . . . . . . . . . . 233 B.1 Experiment 2 means for agreement stimuli. . . . . . . . . . . . . . . . 234 B.2 Experiment 2 means for reflexives stimuli. . . . . . . . . . . . . . . . 235 B.3 Experiment 3 means for agreement stimuli. . . . . . . . . . . . . . . . 235 B.4 Experiment 3 means for reflexives stimuli. . . . . . . . . . . . . . . . 236 B.5 Experiment 4 mean reaction times. . . . . . . . . . . . . . . . . . . . 237 C.1 Model coefficients for Experiment 1. . . . . . . . . . . . . . . . . . . . 239 C.2 Model coefficients for Experiment 2. . . . . . . . . . . . . . . . . . . . 240 C.3 Model coefficients for Experiment 3. . . . . . . . . . . . . . . . . . . . 241 C.4 Model coefficients for Experiment 4. . . . . . . . . . . . . . . . . . . . 242 E.2 Experiment 2 stimuli types and counts. . . . . . . . . . . . . . . . . . 247 ix List of Figures 2.1 Experiment 1. RT means for agreement sentences. . . . . . . . . . . . 84 2.2 Experiment 1. RT means for reflexive sentences. . . . . . . . . . . . . 84 2.3 Sensitivity analysis: Model coefficients in Exp.1 . . . . . . . . . . . . 88 2.4 Experiment 1. Comparisons of the interference effect magnitudes in reflexive conditions with Parker and Phillips (2017). . . . . . . . . . . 93 3.1 Experiment 2. RT means for agreement sentences. . . . . . . . . . . . 117 3.2 Experiment 2. RT means for reflexive sentences. . . . . . . . . . . . . 117 3.3 Experiment 2. Comparisons of the interference effect magnitudes in reflexive conditions with Parker and Phillips (2017). . . . . . . . . . . 121 3.4 Experiment 2. Comparisons of the interference effect magnitudes in reflexive conditions with Parker and Phillips (2017). . . . . . . . . . . 122 3.5 Experiment 3. RT means for agreement sentences. . . . . . . . . . . . 127 3.6 Experiment 3. RT means for reflexive sentences. . . . . . . . . . . . . 127 3.7 Experiment 3. Comparisons of the interference effect magnitudes in reflexive conditions with Parker and Phillips (2017). . . . . . . . . . . 131 3.8 Experiment 4. RT means for agreement sentences. . . . . . . . . . . . 147 3.9 Experiment 4. Comparisons of the interference effect magnitudes in 2-feature target mismatch conditions with PP2017and S2017. . . . . . 147 4.1 Mean RTs in PP2017 (red) and my replication (blue). . . . . . . . . . 155 4.2 Interference effect in target: two mismatch conditions (lure match - lure mismatch) in PP2017(red) and my replication (blue). . . . . . . 157 4.3 Model estimates in PP2017 (red) and my replication (blue). . . . . . 159 4.4 Predicted and observed RT values for PP2017 and my replication. . . 191 4.5 Sensitivity analysis: Intrusion effect (lure mismatch minus lure match ) in 2-feature target mismatch conditions. . . . . . . . . . . . . . . . 192 4.6 Sensitivity analysis: Model coefficients in PP2017 Exp.3 and replication.193 4.7 Mean RTs in PP2017 (red), my previous replication (blue) and my current replication (green). . . . . . . . . . . . . . . . . . . . . . . . . 194 4.8 Interference effect in target: two mismatch conditions (lure match - lure mismatch) in PP2017 (red), my previous replication (blue) and my current replication (green). . . . . . . . . . . . . . . . . . . . . . . 195 x 4.9 Model estimates in PP2017 (red) and my replication (blue). . . . . . 196 4.10 Mean RTs in PP2017(red), my previous replication (blue) and my current replication (green). . . . . . . . . . . . . . . . . . . . . . . . . 197 4.11 Interference effect in target: two mismatch conditions (lure match - lure mismatch) in PP2017 (red), my previous replication (blue) and my current replication (green). . . . . . . . . . . . . . . . . . . . . . . 198 4.12 Model estimates in PP2017 (red) and my replication (blue). Ex- ploratory analysis: removing missing values instead of replacing them with zeros. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199 4.13 Intrusion effect in five studies with most similar design (PP2017 Exp.3, its two replications (this thesis Exp. 5 and 6), S2017 Exp.1c, this thesis Exp.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 4.14 Intrusion effect in all studies investigating 2-feature target mismatch configurations (PP2017, S2017, the current study). . . . . . . . . . . 201 4.15 Distribution of lure-match effect sizes in 2-feature target mismatch conditions obtained in a bootstrapping simulation. . . . . . . . . . . . 202 4.16 Model estimates for the models fit to the pooled data from Exp.5 and 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 xi Chapter 1: Introduction The main question that this thesis addresses is: in what way does structural information enter into the processing of long-distance dependencies? Does it con- strain the computations, and if so, to what degree? Available experimental evidence suggests that sometimes structurally illicit but otherwise suitable constituents are accessed during dependency resolution. Subject-verb agreement is a prime example (Wagers et al., 2009; Dillon et al., 2013), and similar effects were reported for neg- ative polarity items (NPIs) licensing (Vasishth et al., 2008) and reflexive pronouns resolution (Parker and Phillips, 2017; Sloggett, 2017). Prima facie this evidence suggests that structural information fails to perfectly constrain real-time language processing to be in line with grammatical constraints. This conclusion would fall neatly in line with an assumption that human sentence processing relies on cue- based memory (e.g. McElree et al., 2003; Lewis and Vasishth, 2005; Van Dyke and Johns, 2012; Wagers et al., 2009, among many others), which has similarity-based interference as a key property. The attractiveness of such an approach lies in its parsimony: there is independent evidence that general purpose working memory is cue-based (Jonides et al., 2008), so we do not need to postulate any language specific mechanisms. Additionally, the processing of multiple linguistic dependencies can be 1 analyzed within the same theoretical framework. Empirically, some of the experimental evidence was assumed to only be ex- plainable within cue-based approach (the absence of ungrammaticality illusions in subject-verb agreement is the main example, to which I will return in more detail later). However, recently several other approaches have been suggested which would be able to account for these cases, even if we assume that structural information accurately guides the parsing (Eberhard et al., 2005; Hammerly et al., 2018; Xiang et al., 2013; Sloggett, 2017). These approaches usually assume separate mechanisms for different linguistic dependencies, and thus lose the parsimonious attractiveness of cue-based memory models. A priori there is no reason why they could not turn out to be true. But given the theoretical attractiveness of cue-based models in which structural information does not categorically restrain processing, it is important to critically evaluate these recent claims. In this thesis, I focus on reflexive pronouns and on the novel pattern reported in Parker and Phillips (2017) and Sloggett (2017). The two works report consistent findings, but the accounts they give take opposite perspectives on the role of structural information in reflexive resolution. Our aim in this thesis is to experimentally investigate cases which could provide clearer evidence on how structure guides reflexives processing. The rest of this chapter is structured as follows. First, I survey available experimental evidence in subject-verb agreement and reflexive resolution. Then, I describe a prominent cue-based model used to ac- count for these data and show how it can explain them. As we will see, adopting this explanation for the data will make us assume that structural information is not able to categorically rule out illicit dependency formation. I proceed to discuss the 2 recent alternative approaches which assume that structure restricts the processing to be compliant with the grammar. Finally, I describe the cases which would dis- tinguish between the two classes of approaches and outline the experiments I carry out to investigate these cases. 1.1 Empirical evidence In this section I discuss the evidence for grammatically illicit constituents being considered during real-time processing. I start the discussion with the most popular experimental paradigm used in this line of research, and then I turn to empirical data on subject-verb agreement, NPI licensing and reflexive resolution. Feature-mismatch paradigm is very widely used in the investigations of long- distance dependencies formation. It is based on the observation that people quickly notice if feature composition of the two dependency elements does not match (for instance, the subject is singular and the verb is plural) (e.g Osterhout and Mobley, 1995; Osterhout et al., 1997). We can use this reaction to assess whether feature mismatch with a grammatically inaccessible, but otherwise plausible, dependency component elicits similar complications. Thus, the paradigm is usually structured as follows. People are presented with sentences containing the dependency of interest and two possible ways to resolve it - one grammatically licensed and the other one not. For example, people may be presented with a sentence like “The key to the cabinet is rusty”. This sentence contains two NPs, both of which could in principle control verbal agreement, but only one of them - “the key” - is in 3 the correct structural position to do so, i.e. is a grammatically licensed agreement controller. In what follows, I will call the grammatically licit option “target”, and the grammatically illicit - “lure”. The features of the target and the lure are factorially manipulated, so that they either match or mismatch the other dependency element. In the example above, the number of both NPs could be manipulated so that they match or mismatch the number of the verb. We do expect to observe some indication of processing difficulty in target mismatch conditions as compared to target match. But of critical interest is whether we observe any reaction to feature mismatch with the lure. If we do, we can conclude that at some point in time feature information on the lure had entered the processing system in a way in which it was able to affect dependency resolution process. In what follows, I will refer to any lure-related effects as “lure-match effects” or “interference”. I will talk about “facilitation” (“facilitatory interference”), if the lure-match conditions are read faster than lure- mismatch conditions, and about “inhibition” (“inhibitory interference”) otherwise. It is important to keep in mind that the interpretation of the results obtained with this paradigm is not always straightforward. Its goal is to assess whether the presence of the lure affects processing in any measurable way: it may be differences in interpretations, reaction times, electrophysiological responses. . . However, just the fact that a certain constituent affects processing does not mean that it is actively accessed by the parser; the differences we observe may well stem from other sources. Therefore, additional steps must be made in order to ensure that the evidence we obtain is indeed the evidence about access mechanisms. I will return to this discussion after we have surveyed available experimental evidence. 4 Subject-verb agreement Subject-verb agreement presents a prime example of a dependency for which lure-match effects are observed: people tend to mis-identify the agreement controller in sentences like (1b) - (1a), a phenomenon known as “agreement attraction”. Both of the sentences are ungrammatical, since the head of the subject noun phrase (“key”) mismatches the verb in number. However, this violation is perceived as milder in the second sentence. This effect is apparent in acceptability judgments and reaction time (RT) profile (with (1b) eliciting higher ratings and faster RTs than (1a)) (Wagers et al., 2009; Dillon et al., 2013; Ham- merly et al., 2018), as well as in electrophysiological measures (Xiang et al., 2013; Tanner et al.). I will refer to this effect as “grammaticality illusion”: an ungram- matical sentence may have been fleetingly considered grammatical by the readers. Grammaticality illusions have been observed in a number of languages and with multiple grammatical features (e.g., see Lago et al. (2015) for Spanish, Tucker et al. (submitted) for Arabic and Slioussar and Malko (2016) for Russian). (1) a. *The key to the cabinet are on the table. b. *The key to the cabinets are on the table. c. The key to the cabinets is on the table. d. The key to the cabinet is on the table. While the presence of lure-match effect in ungrammatical sentences is uncontrover- sial, the patterns arising in grammatical sentences are less clear. There are three logical possibilities ((1c) is exactly as difficult as (1d); (1c) is easier than (1d); (1c) 5 is harder than (1d)). Each possibility would support a different class of models, as I will discuss in section 1.2.2. The first possibility - that the processing difficulty of grammatical sentences does not depend on the number-marking of the lure - is commonly assumed to be an actuality, based on a number of studies (e.g. Wagers et al., 2009; Dillon et al., 2013; Tanner et al.). This pattern is known as the “absence of ungrammaticality illusions” (i.e. there are no cases when people perceive grammatical sentences as degraded). However, this observation may be overly general. Two recent meta- analytic reviews suggest that the literature contains a fair amount of evidence for people being faster in (1c) as compared to (1d). Jäger et al. (2017) review reaction time (self-paced, eye-tracking) and electrophysiological studies and find that 12 out of 25 report statistically significant effects in this direction (12 studies report no sig- nificant effects and one reports the opposite pattern)1. These qualitative conclusions are supported by quantitative Bayesian meta-analysis of reaction time data: it esti- mates the speed-up in (1c) to be of 6.6ms with the 95% of the posterior probability mass in the interval [-16.2, 3.7] (91% of the probability mass below 0)2. Hammerly et al. (draft.april.2018) reach similar conclusions in an informal review of the liter- ature with somewhat less detailed comparison. They review a partially overlapping subset of reaction time studies of agreement attraction; in addition, they also look 1I only report the data for configurations with singular subjects, since there is almost no data for configurations with plural subjects. 2One has to keep in mind that for eye-tracking, only first-pass reading times were entered into the analysis. 6 at interpretation studies. Out of 26 tested instances of agreement attraction config- urations3, in 11 cases the authors reported significant effects in both grammatical and ungrammatical sentences. This evidence will be of crucial importance when we choose between possible accounts of agreement attraction in section 1.2.2. NPIs Negative Polarity Items (NPIs) licensing is another dependency whose for- mation is arguably prone to interference (Vasishth et al., 2008; Xiang et al., 2008; Parker and Phillips, 2016). NPIs are items like “any”, “ever”, “lift a finger” and others which can only occur in a special linguistic environment. The exact nature of the environment is still debated , but one generalization is that NPIs have to occur in a scope of a downward entailing element, e.g. negation. It is possible to conceptualize this relation as an item-to-item dependency4: when people encounter an NPI, they look for a downward entailing element in structurally higher position, and if they find it, the NPI is licensed. Interestingly for our purposes, sentences with an NPI licensor in an inappropriate structural position (2c) are perceived as better than sentences without a licensor altogether (2b) (Vasishth et al., 2008; Parker and Phillips, 2016). (2) a. No diplomats have ever supported a drone strike. 3Most of them correspond to whole studies, but in a few cases, Hammerly et al. (draft.april.2018) count subsets of stimuli in the same study separately. 4Although this conceptualization is contested, and as we will see in the discussion of the evi- dence, this is the reason why NPI data does not provide strong evidence for or against the use of structural information 7 b. *The diplomats have ever supported a drone strike. c. *The diplomats that no congressman could trust have ever supported a drone strike. Reflexive pronouns I now turn to the evidence which will be most relevant for my thesis - influence of structurally inappropriate antecedents on the resolution of reflexive pronouns. The evidence is rather mixed. On the one hand, a number of experiments in a variety of methodologies (cross-modal priming, eye-tracking, ERPs) failed to provide robust evidence that lures affect reflexives resolution (cross- modal priming: Nicol and Swinney (1989); self-paced reading: Badecker and Straub (2002) (Exp.5,6); eye-tracking: Sturt (2003); Dillon et al. (2013); Cunnings and Sturt (2014); ERPs: Xiang et al. (2008) among others). On the other hand, some studies do report interference effects (Badecker and Straub (2002) (Exp.4), King et al. (2012a); Cunnings and Felser (2013); Patil et al. (2016); Parker and Phillips (2017); Sloggett (2017)). I will discuss these groups of studies in order. The first study to consider the influence of lures on reflexive resolution was Nicol and Swinney (1989). They report the results of an experiment relying on cross-modal priming paradigm. The participants were auditorily presented with sentences like (3). At the moment when people heard the pronominal, a word appeared on the screen, and people were asked to decide whether it was a real English word. The word could be related to one of the potential antecedents or unrelated to any of them. The logic underlying the experiment is that if during the course of anaphora resolution people reactivate an antecedent, words related to 8 that antecedent would be primed, and thus people reaction times would be faster as compared to the unrelated word conditions. If binding principles are accurately applied, we expect that only words related to “boxer” and “skier” would be primed when people encounter “him” and only words related to “doctor” will be primed when people encounter “himself”. This is exactly what Nicol and Swinney report. (3) The boxer told the skier that the doctor for the team would blame him / himself for the injury. These results suggest that people were successfully using structural informa- tion to restrict antecedent selection, and a number of subsequent studies fell in line with this conclusion. Sturt (2003) examined sentences like (4) (the examples come from Exp.1 and 2, correspondingly). (4) a. Jonathan was pretty worried at the City Hospital. The surgeon who treated Jonathan / Jennifer had pricked himself / herself with a used syringe needle. b. Jonathan / Jennifer was pretty worried at the City Hospital. He / She remembered that the surgeon had pricked himself / herself with a used syringe needle. Neither Experiment 1 or 2 provided evidence for any detectable effects of the lure in early measures5. Dillon et al. (2013) considered sentences like (5). Agreement 5Experiment 1 did provide evidence for some late effects. First, in re-read times on the reflexive 9 conditions (5a) showed a profile indicative of agreement attraction: lure match conditions being read faster than lure mismatch conditions within target mismatch sentences only6. However, no evidence for the impact of the lure was found in the reflexive conditions (5b). King et al. (2012b) report similar results7. (5) a. The new executive who oversaw the middle manager / managers appar- ently was / were dishonest about the companys profits. b. The new executive who oversaw the middle manager / managers appar- ently doubted himself / themselves on most major decisions. Similarly to Dillon et al. (2013), Xiang et al. (2008) compared NPI licensing to re- the target match conditions were read faster than target mismatch conditions within lure mismatch conditions only. Or, to put it another way, target mismatch-lure match conditions were the slowest of all. Second, in re-read times on the pre-final region, lure match conditions were faster than lure mismatch conditions within target match conditions only. A follow-up sentence interpretation experiment suggested that in lure match conditions people provide more answers compatible with the resolution of the dependency to the lure. I have to note, however, that a replication of this experiment by Cunnings and Sturt (2014, Exp.1) failed to find support for the late effects of the lure. It is also worth noting that in both studies the number of participants was rather low, 24 for Sturt (2003) and 28 for Cunnings and Sturt (2014, Exp.1). Thus, the statistical power that these studies had was likely low . Given this, I do not consider this evidence against the absence of lure-match effects in Sturt (2003) as very strong. 6In this case, “target” refers to the subject of the main clause, and “lure” - to the subject of the embedded clause. 7This study did report lure match effects in a different configuration, but I will return to them later 10 flexive resolution in sentences like (6) using EEG. NPI conditions did show effects of structurally inappropriate licensor: the amplitude of P6008was reduced as compared to the sentences with no licensor at all. However, no similar evidence was found for the reflexive sentences: regardless of whether the lure matched or mismatch the reflexive in gender, target mismatch conditions elicited smaller P600. (6) a. NPI, grammatical: No restaurants [ that the local newspapers have recommended in their dining reviews ] have ever gone out of business9. b. NPI, illicit licensor: The restaurants [ that no local newspapers have recommended in their dining reviews ] have ever gone out of business c. NPI, no licensor: Most restaurants that the local newspapers have recommended in their dining reviews have ever gone out of business d. Reflexive, target match: The tough soldier that Fred treated in the military hospital introduced himself to all the nurse. e. Reflexive, target mismatch, lure match: The tough soldier that Katie treated in the military hospital introduced herself to all the nurses. f. Reflexive, target mismatch, lure mismatch: The tough soldier that Fred treated in the military hospital introduced herself to all the nurses. 8An event-related potential component often associated with the processing of syntactic viola- tions. Bigger amplitude is taken to be associated with more severe reaction to the violation. 9I omit an additional manipulation of NPI licensor type, since it is not relevant for interpreting these results. 11 While the studies above report consistent evidence for reflexive resolution being accurately guided by structural information, other studies have reached different conclusions. Badecker and Straub (2002) used feature-mismatch paradigm in a self- paced reading study with sentences like (7). They reported a slow-down in the second region after the reflexive when both the lure and the target matched the reflexive in features. (7) Jane / John thought that Bill owed himself another opportunity to solve the problem. King et al. (2012a) compare conditions with verb-adjacent and non-verb-adjacent reflexives to test the following conjecture. The verbs likely re-activate their ar- guments, including subject, regardless of the presence of the reflexive. Reflexive pronouns are often found in direct object position, so if the information about the subject persists in the focus of attention, it might be too prominent to allow any effects of the lure be detected. If this conjecture is right, we will only observe the effects of the lure in the latter case. This is what King et al. report10. Only for non-adjacent reflexives lure match conditions were read faster than lure-mismatch conditions within target mismatch conditions. 10A caveat is in order: as far as I know, these results have never been published as a research paper. The information is available as a poster and a conference abstract, but the amount of statistical and numerical details is obviously much smaller than what would be possible in a paper. Thus, the interpretation presented here is correct to the degree that I interpret this limited information correctly. 12 (8) a. Verb adjacent: The mechanic who spoke to John / Mary sent himself / herself a package. b. Non verb adjacent: The mechanic who spoke to John / Mary sent a package to himself / herself. Cunnings and Felser (2013) investigated whether reflexive resolution is affected by participants’ working memory characteristics. They used eye-tracking with fea- ture mismatch paradigm, and additionally median-splitted their participants into two groups according to their working memory capacity (as assessed by Daneman- Carpenter test (Daneman and Carpenter, 1980)). In two experiments they found limited evidence for the influence of the lure. In Experiment 1, it was observed at the pre-final region for both memory groups in first fixation duration; in Experiment 2, only participants in the low-working-memory group were affected by the lure (but the effect was noticeable earlier, at the reflexive itself). An eye-tracking study by Patil et al. (2016) reports the influence of lure match on first-pass regression prob- ability11, observing more regressions in lure match conditions as compared to lure mismatch conditions within target match only. Finally, Parker and Phillips (2017) (furthermore, PP2017) and Sloggett (2017) (furthermore, S2017) provide perhaps the clearest evidence of lure-match effects in reflexive resolution. PP2017 rely on the same feature mismatch manipulation as most of the previous studies, with an additional twist: they manipulate the degree to 11I.e. the probability that the eyes will move to earlier regions during the first pass on the current region 13 which the target mismatches the reflexives. In all of the previous studies, the target either fully matched the reflexive in morphological features, or mismatched it in one morphological feature. Parker and Phillips additionally considered conditions where the target mismatches the reflexive in two morphological features, as exemplified in (9). They found that people were much faster to read the reflexive if it matched the features of the lure, but only in the 2-feature target mismatch conditions. The authors considered target mismatch in all possible two-features combinations of gender, number and animacy and found similar effects for all of them. (9) The talented actor/actress mentioned that the attractive spokeswomen praised himself for a good job. . I will slightly delay the discussion of the interpretation that PP2017 give. I will only note that according to their account, having mismatch in two features between the reflexive and the target is crucial for observing lure-match effects. Therefore it is not surprising that a big proportion previous studies failed to do so: all of them used 1-feature target mismatch configurations. S2017 replicates most of PP2017 findings. Moreover, he demonstrates that the lure match effect in PP2017 configurations depends not only on the degree of feature match between the reflexive and the potential antecedents, but also on dis- course factors. First, he shows that the identity of the embedding verb is important - in sentences like (10a) lure-match effects were only observed with report verbs like “say”, but not with perception verbs like “hear” (Experiment 1b). Second, 14 he shows that the identity of the target is important - lure-match effects do not appear if the target is an indexical pronoun like “I” or “you” (10b) (Experiments 3b,4b). Finally, in an interpretation study (Experiment 1c), he demonstrates that lure-match effects are potentially detectable even in grammatical sentences. In this experiment people were presented with sentences like (10c) in a self-paced reading manner, and at the end of each sentence had to answer a comprehension question. For grammatical sentences like those in (10c), the question was probing the inter- pretation of the reflexive, e.g.: “Who was misrepresented at the meeting? The librarian / the schoolgirl”. People chose answers compatible with non-local resolu- tion (e.g. “The librarian”) in roughly 30% of the time despite the sentence being fully grammatical12. (10) a. The librarian / janitor said / heard that the schoolboys misrepresented herself at the meeting. b. The actor/actress said that Joanna I horribly misrepresented herself in the article. c. The librarian / janitor said / heard that the schoolgirl misrepresented herself at the meeting. 12While this may, indeed, indicate that target mismatch is not a necessary pre-condition for lure-match effects to arise, we have concerns about how the data fromS2017 Experiment 1c should be interpreted. We discuss these concerns later in the thesis. 15 1.2 Evidence interpretation The evidence reviewed above suggest that at least sometimes grammatically ir- relevant constituents do affect real-time dependency formation. This information in itself, however, is not sufficient to claim that grammatical constraints fail to uniquely guide the parser. In this section we will review three types of accounts of how lure- match effects come to be. We will call an account “structure-strict” if it suggests that structural information can categorically rule out illicit dependency formation, and “structure-defeasible” otherwise. The evidence above could be used to support “structure-defeasible” accounts only if we manage to argue that “structure-strict” accounts fit the evidence less well or should be dispreferred on theoretical grounds. We will show that such an argument could be made. We will then discuss potential counter-arguments, and this will lead us to the main goal of the thesis - evaluating these counter-arguments in order to understand whether our original argument for “structure-defeasible” models should or should not be abandoned. 1.2.1 Accounts of lure-match effects The first class of accounts can be called “representational”. They come in different flavors (e.g. Nicol et al., 1997; Vigliocco and Nicol, 1998; Franck et al., 2002; Eberhard et al., 2005), but they all share the assumption that lure-match effects arise because syntactic representation of the target is distorted: e.g. the whole phrase receives its number marking not from its head but from the complement. Interference effects arise when a correctly functioning access mechanism contacts 16 an invalid linguistic representation. Thus, representational accounts have to be “structure-strict” - they assume that structure perfectly guides the search for the target constituent. A second type of accounts could be called “grammatical” (e.g. Xiang et al., 2013; Sloggett, 2017). These accounts basically suggest that lure match effects are not a result of a processing error, rather, they are a manifestation of grammatical strategies licensed by the language. Xiang et al. (2013) suggest an account of this sort for the NPI data. They argue that constraints on the dependency licensor should not always be analyzed as “be in the c-command domain of a downward- entailing expression”. Rather, the grammar also provides an alternative licensing path, via pragmatic inferences from the context. In certain cases this strategy could lead to spurious inferences, which would lead to illusory NPI licensing. Sloggett (2017) suggests this type of account for lure-match effects in reflexives, arguing that they should be analyzed as instances of logophoric behavior of the anaphors. We will discuss this account in much more detail in a later section. These accounts are also “structure-strict”: if one wants to use grammatical principles as the explanation, one has to assume that they are applied accurately. Finally, a third class of accounts could be called “memory” accounts: they assume that the errors arise at the level of memory operations underlying lan- guage comprehension (Lewis and Vasishth, 2005; McElree, 2006; Van Dyke and McElree, 2006; Engelmann et al., draft4; Jäger et al., 2017; Parker et al., 2017). The most common assumption is that memory access is faulty and sometimes the parser may access constituents not licensed by the grammar. This amounts to 17 “structure-defeasibility” in my terms. However, memory models are not inherently “structure-defeasible” - one could modify model parameters so that the parser would be predicted to only access grammatically licit constituents. Such models would not predict lure-match effects arising from faulty memory access. One could hope that they can be easily ruled simply by the fact that lure-match effects are empirically attested. Unfortunately, lure-match effects can arise even in “structure-strict” mem- ory models, since memory access is not the only place where things could go wrong. Another such place is encoding. It might be the case that when one item needs to be entered into memory and another item with similar features is already present there, the encoding for one or both items may get degraded, perhaps because it is difficult to keep distinct bindings for similar features (Nairne, 1990; Vasishth et al., 2017). The degraded representation will be more difficult to access during retrieval, causing a slow-down. Thus, similarity between the target and the lure might affect RTs even if retrieval operates completely faithfully13. Originally, this argument has only been applied to cases of inhibitory interference: slow-down when the lure matches the target in features (Dillon et al., 2013; Chow et al., 2014). However, recently Patil et al. (2016, p.17) and Parker et al. (2017, p.129) pointed out that facilitatory interference (speed-up when the lure matches the features on the other end of the dependency, but not on the target) does not have to stem from retrieval stage either. For example, if representations can be degraded during the encoding stage, they will be degraded more in (1a) than in (1b), since in the former case the 13Notice that in this case the dependency formation is completely licensed by the grammar, but is made more difficult by extra-linguistic factors. 18 two nouns have a greater overlap in features. Thus, even if memory access is always accurate and the target is reliably retrieved in all cases, it will take longer time in (1a), creating a profile of reaction times consistent with facilitatory interference. 1 a. *The key to the cabinet are on the table. b. *The key to the cabinets are on the table. c. The key to the cabinets is on the table. d. The key to the cabinet is on the table. Another possible locus of lure-match effects in memory models is in post- retrieval operations, e.g. repair. Different types of models assume different memory dynamics. Lewis and Vasishth (2005) model, which we will discuss in detail later, assumes that an item’s prominence in memory determines both how likely it is to be faithfully recovered and how quickly this item can be made available for processing. Thus this model can account for reaction times differences directly; if needed, repair processes could be assumed, but without having a good reason for doing so, a parsimonious Lewis and Vasishth-type model could make do without them. On the other hand, in McElree (2006) framework access times are always constant: an item’s prominence only determines the probability of successfully accessing the item. If this model is closer to the truth, RT differences would have to be produced by some other mechanism. One possibility would be that longer reaction times stem from re-analysis/repair processes. E.g. if retrieval fails, a second retrieval could be initiated to compensate for that. The slow-down due to additional retrievals could 19 potentially account for lure-match effects14. Notice that in this case, RTs again will not necessarily tell us anything useful about how structural information is used to constrain real-time processing: repair processes might involve a different set of processing routines/constraints15. Another possibility is that the strength of the recovered representation might matter: even if any item can be retrieved equally fast, some will be recovered more faithfully, and the processes downstream from memory retrieval may be sensitive to these differences. 14Consider the following possibility. As we discuss in section 1.3, cue-based models assume that when a memory item has to be accessed, a set of retrieval cues is specified; all items are matched in parallel to these cues, and receive a boost in prominence for each matching feature; the most prominent item is then accessed with some probability of success. Now, the lure would be more prominent in (1b) than in (1a) because it would match the number of the verb, which presumably would be a part of the set of retrieval cues. Fewer re-analysis operations would be initiated in the first case, potentially leading to a speed up (but see Nicenboim and Vasishth (2018) who suggest it is not clear how repair processes would even work in ungrammatical sentences and how the RTs profiles in those would be accounted by McElree model). 15Of course, we might argue that repair is a normal part of everyday language use and thus identifying the constraints involved is useful. Still, one needs to be careful: if we are talking about “first-pass” processing that just directly leads to interpretation, the set of memory mechanism we can “blame” for errors is narrower and better defined. If we are talking about repair, the spectrum is much wider: it may be that repair relies on the “usual” memory access procedures with some minor tweaks. Or maybe the “usual” memory access is used but with completely different set of parameters. Or maybe qualitatively different mechanisms are involved. Without clearly specifying our theory of repair, it would be hard to figure out what exact conclusions should we draw out of our RT data. 20 1.2.2 Choosing an account We have looked at three groups of accounts: representational, memory, and grammatical. As we have seen, the first two groups of accounts have to assume that structural information can successfully constrain dependency resolution. Memory accounts can be either “structure-strict” or “structure-defeasible”, depending on the assumptions one makes about the access properties and the source of behavioral evidence: encoding, retrieval, repair. Only if we can unambiguously attribute the observed lure-match effects to the retrieval processes, will we be able to make a claim about “structure-defeasibility” of normal real-time sentence processing. In this section we will discuss why one might, indeed, decide that memory retrieval processes stand behind lure-match effects, and why, therefore, one would want to believe the “structure-defeasibility” thesis. Ruling out representational accounts The biggest argument against repre- sentational models is empirical: it is the absence of illusions of ungrammaticality in agreement attraction (representational accounts have been historically developed to deal with agreement data) (Wagers et al., 2009). In these accounts, the distortion of the target NP happens phrase-internally due to morphosyntactic mechanisms (e.g. erroneous feature percolation from the head level to the phrase level) and inde- pendently of other factors such as sentence well-formedness. Thus, it is as likely to happen in grammatical as in ungrammatical sentences, and we should commonly ob- serve cases in which a perfectly well-formed sentence is perceived as ungrammatical. 21 However, such cases are rarely reported. Another potential argument against representational accounts is that they are not easily expandable to cover other dependencies which exhibit similar empirical behavior (NPI licensing or reflexive resolution). In contrast, memory models can explain multiple phenomena in terms of the same underlying mechanisms, and thus may be preferred, if they are comparable in terms of empirical coverage. A priori this may seem counter-intuitive: since we know that different linguistic dependencies have qualitatively different constraints on them, wouldn’t we expect them to behave qualitatively different during the processing as well? I argue that while this is possible, it is not a logical necessity. It is inevitable that at some point during language processing linguistic con- straints have to be translated into processing instructions. I argue that it may be advantageous to assume that this translation process is very straightforward at the interface: every linguistic constraint is turned into a retrieval cue16. However, from this point onwards what happens with these cues is determined only by the properties of the memory system: i.e. it is agnostic about where the cues came from and just knows how to perform operations with them. Consider anaphora as an example. Under Chomskyan Binding theory, an anaphor should be bound by a local (roughly, a clause-mate) and c-commanding antecedent. According to our hypothesis, these specifications would be converted to retrieval cues like “+ local” and a “+ c-commands current NP” cue17. Another set of constraints - semantic or 16See section 1.3. 17Real constraints will have to be more complicated. The actual definition of locality domain 22 perhaps pragmatic - require that the anaphor matches its antecedent in φ-features - correspondingly, φ-features of the anaphor would be translated to retrieval cues as well (e.g + sg, + masculine etc.). Memory systems would then use these cues like any others in order to determine which memory chunk should be retrieved during re- flexive resolution. Notice that in this case the parser’s failures to be uniquely guided by linguistic constraints are due to either the translational procedure used at the interface, or the properties of the general purpose memory system. Correspondingly, we would expect to observe reasonably similar effects across multiple dependencies, which arguably corresponds to actual empirical observations. But even if we ques- tion the degree of this alignment, such a unified system is theoretically useful: it allows to easily generate precise predictions for a range of linguistic phenomena and suggests what types of empirical observations we should be looking for. This could make it easier to discover new things. In addition, a model with a unified treatment of different dependencies my be preferred because it assumes that linguistic constraints straightforwardly map onto general purpose mechanisms. As soon as we start making additional assumptions connected to the linguistic nature of the representations - e.g. that different levels of representations (syntax, pragmatics etc.) may have different access procedures or that different dependencies rely on qualitatively different mechanisms, we introduce an additional burden of explaining why such a division exists18. is actually more involved than the simplification we used, and as we discuss later, encoding c- command information in terms of cues is notoriously difficult. 18Notice that PP2017 of lure-match effects in reflexives (discussed in Section 1.3.1), while largely keeping to the unificational spirit of memory models, already violates the “pure mapping” hypoth- 23 Ruling out grammatical accounts Grammatical accounts are somewhat harder to rule out. We can think of two arguments against them. First, the argument for preferring memory models we made above is even more relevant for grammatical models: they explicitly state that linguistic constraints affect processing in a non- trivial way (perhaps invoking qualitatively different resolution mechanisms and/or memory access routines) - as we argued, such approach may be undesirable as a working hypothesis. Second, empirical coverage of grammatical models is narrower than that of memory models. As far as we are aware, there are no grammatical accounts of agreement attraction - it is quite unequivocally considered a processing error, and not a manifestation of some not-so-well-described grammatical strategy. Now, if we have accepted the first argument, the fact that at least one phenomenon is a result of a processing error would force us to assume that other lure-match effects stem from the processing side of things as well. These arguments are certainly not rock-solid, but for the moment we will rely on them. Ruling out non-retrieval memory accounts Ruling out the possibility that lure-match effects stem from the encoding stage is hard to do once and for all, unless we have very fine-grained methods of investigating the contents of memory. However, Jäger et al. (2015) show that at least in some cases the “encoding degra- dation” account is not supported. The study looked at German reflexive pronoun “sich” in configurations like (11). Since “sich” is unmarked for gender, a reasonable esis by assuming different weighting schemas for structural cues in subject-verb agreement vs. reflexive resolution 24 assumption would be that this feature is not included in the set of retrieval cues during the search for the antecedent. If inhibitory interference results from encoding processes, this will not matter and we will observe inhibition in (11) when the two nouns match each other in gender. If, on the other hand, inhibitory interference arises due to retrieval dynamics, we should not observe any effect of gender match between the two potential antecedents. This is indeed what Jäger et al. report. This lack of the effect was observed in both self-paced and eye-tracking reading times. The study used unusually large sample sizes (around 150 participants in each of the experiments), which gave them roughly 90% power to detect an effect of 20ms with a standard deviation of 75ms. Thus, the lack of evidence for inhibitory interference is quite reliable. (11) a. Der Diebi/Die Diebini, dem/der der Hehlerj/die the thief-MASC/the thief-FEM whom the dealer-MASC/the Hehlerinj befohlen hat zu stehlen, hat überraschenderweise sichi/∗j dealer-FEM obliged has to steal has surprisingly self und die Kollegen angezeigt, berichtete das Hochglanzmagazin. and the colleagues denounced reported the magazine. The thief whom the dealer obliged to steal surprisingly denounced him- self/herself and the colleagues, reported the magazine. Jäger et al. (2015) results suggest that interference found in the reading times from feature mismatch paradigm is likely due to retrieval processes. Still, the study only shows that in some cases encoding processes are an unlikely culprit, and completely ruling out this possibility or determining the range of phenomena for which it holds may be quite difficult. 25 Whether repair explanations should or should not be considered is not clear at this point. Choosing between Lewis and Vasishth (2005) model, which does not crucially involve repair, and McElree (2006) model, which does, is complicated. On the one hand, constant speed of memory access, which is only captured by McElree (2006) framework has quite extensive empirical support (McElree et al., 2003; McElree, 2006; Martin and McElree, 2009, 2011). On the other hand, Lewis and Vasishth (2005) model is straightforwardly predicting facilitatory interference RT profile (as observed in agreement attraction), and McElree (2006) model has problems with it. Nicenboim and Vasishth (2018, p.21) suggest that McElree model would not be able to predict it at all, since it is unclear how exactly repair processes would operate in ungrammatical sentences 19. For the predictions in which the models can be compared directly (inhibitory interference in target match configurations), they appear to be tied. Nicenboim and Vasishth (2018) directly compare the fit of (particular implementations of) Lewis and Vasishth and McElree models to a dataset coming from a study on agreement attraction in target match configurations (Nicenboim et al., 2017). They found that 19We think that the following suggestion might work, although it does rely on certain assump- tions about repair, and unless we are able to independently validate those assumptions, this sug- gestion remains purely speculative. Suppose that in target mismatch configurations the system initiates repair quite often, since neither the target or the lure match all of the retrieval cues. Sup- pose also that fewer repairs will be attempted in lure match conditions, since the error-detection mechanism will be less capable of noticing the violations. Then, in the long run, more time will be spent trying to repair the representation in lure mismatch conditions, giving rise to facilitatory time patterns. For this schema to work, we need to assume that the ease or difficulty of detecting a mis-retrieval is contingent on the number of matching features - i.e. repair detection works in the same way as search result detection. This assumption raises a question: why is feature mismatch not detected during the original search, but is at a later stage? One possible answer could be that the retrieval system only “cares” about finding an item without further knowledge of what has to be done to it, so it may not incorporate on-the-fly sensibility checks. On the other hand, the repair detection mechanism may specifically depend on how smoothly the retrieved item can be integrated in the representation, so it may be more sensitive to such mismatches. 26 McElree model fit the data somewhat better than a simple variant of Lewis and Vasishth model. However, only a slight complication to the Lewis and Vasishth model allows it to fit the data on par with the McElree model. Thus, at least for target match configurations, neither model is clearly preferred. Finally, a principled way of distinguishing repair from “first-pass” processes regardless of a model is to look at the time course of lure-match effects. One might argue that an effect appearing early is unlikely to stem from repair processes, unless one assumes they can be really quick. From this point of view, eye-tracking might be the only widely used method allowing for such distinctions: self-paced reading times only gives data about total reading time per region; EEG data, while having perfect time resolution, may be hard to interpret as stemming from a particular time point in the processing. If one accepts this argument, only the studies which find lure-match effect in early eye-tracking measures (such as first fixation or first-pass time) should be taken as evidence on memory search processes. To sum up my discussion of memory models, lure-match effects can be ex- plained in at least three ways: as stemming from encoding, search or repair mecha- nisms. Choosing between these possibilities is not always straightforward. We have some evidence against encoding account Jäger et al. (2015), but it is rather limited. Distinguishing between accounts with and without repair as their necessary compo- nent has so far been complicated, with no clear winner available. In what follows, we will simplistically assume that any RTs difference we observe stem from memory access processes, but one should keep the issues we have just discussed in mind in order not to become overconfident in the conclusions we draw. 27 1.2.3 Evidence interpretation: summary In the last several sections we have been discussing whether we have reasons to choose “structure-defeasible” accounts over competing solutions. We presented the following argument for doing so. Memory accounts arguably provide the best explanation for subject-verb agreement data20. There are no grammatical accounts of agreement attraction, and representational accounts are ruled out by the absence of ungrammaticality illusions in subject-verb agreement studies. If we settle on memory accounts, “structure-defeasible” flavors have to be chosen, because other- wise memory accounts would not be able to explain agreement data either. Now, due to the universality of these accounts, a priori it would be tempting to extend them to other linguistics phenomena such as anaphora resolution and NPI licens- ing. To keep the models’ universal coverage, one would also be tempted to preserve “structure-defeasible” assumption we made for agreement. This is essentially the line of explanation pursued in Vasishth et al. (2008) and Parker and Phillips (2017). This line of argumentation is undercut in two places by several recent studies. First, it has been argued that the generalization about the absence of ungrammati- cality illusions is not empirically accurate. Jäger et al. (2017) provide evidence for the existence of ungrammaticality illusions based on the results of a quantitative meta-analysis. Hammerly et al. (draft.april.2018) provide corroborating findings and suggest that the existing generalization is just an artifact of the experimental method. If true, this claim would take away our reasons to prefer memory models to representational ones for the agreement data (and relevant to the main topic of 20At least in comprehension 28 my thesis, “structure-defeasible” to “structure-strict” models). This will have reper- cussions for our accounts of other dependencies as well: if we do not have to pick “structure-defeasible” models for agreement, we have less ground to a priori assume that other dependencies should be accounted for by “structure-defeasible” models, regardless of the particular architecture we use. Indeed, Sloggett (2017) explains lure-match effects in reflexives with a memory model of “structure-strict” type. It is also the case that these models lose the universality of cue-based models and are very specific. It is not clear whether Hammerly et al. (draft.april.2018) model can be extended to other dependencies apart agreement, and it is clear that Sloggett (2017) model cannot - by design, it is reflexive-specific. Assessing (some of) these recent counter-suggestions is the main goal of this thesis. To start addressing this problem, in the following section I will discuss a prominent model of cue-based parsing (Lewis and Vasishth, 2005) and discuss how it can be used to accommodate (a big part of) the empirical evidence. Then I turn to the recent problematic findings and show in more detail how they might undermine the argument for preferring memory models. 1.3 Lewis & Vasishth (2005) memory model The central assumption behind cue-based memory models used in psycholin- guistic research is that human working memory access is content-addressable and parallel. That is, the items in memory are located by their contents (and not, say, their location) and can be investigated in parallel without the need to go through them one by one. In such an architecture, it is hard or impossible to avoid inter- ference - when multiple items are similar enough in terms of their content, it is 29 harder to select the correct one. Therefore, if long-distance dependencies resolution does rely on such system, we can expect processing mistakes similar to those I have reviewed above. As I have already mentioned, cue-based parsing is theoretically attractive. First, there is independent evidence that domain-general working memory is content- addressable (Jonides et al., 2008, for a review of evidence and arguments). Thus, a theory of long-distance dependencies relying on content-based memory would not postulate any language-specific mechanisms. Second, in cue-based models the dy- namics of memory access remain the same regardless of the specific cues one is using. Thus, it might be possible to analyze the processing of different linguistic dependen- cies in terms of the same underlying mechanisms. In what follows, I will introduce a popular model of cue-based parsing by Lewis and Vasishth (2005) (henceforth, LV05), which I will rely upon in this thesis. I will start with discussing the model itself; then, I will turn to showing how it can account for the data presented above; finally, I will briefly discuss its shortcomings and a few caveats. In the parsimonious spirit of the discussion above, LV05 model is couched within ACT-R, a general purpose computational model of cognition (Anderson, 2005). According to the model, memory items are represented as bundles of features (for linguistic purposes, it could be “+ NP”, “+ subj”, “+ animate” etc). Items in memory are not directly available for operations on them - they have first to be “retrieved”, i.e. put in a special prominent state. I will call such a state “focus of attention”. The capacity of focus of attention is assumed to be extremely limited21, 21The exact capacity of focus of attention is debated. Some proposals suggest that it may only 30 thus, in order to be able to carry out any remotely complicated procedure, e.g. parsing, memory items have to be constantly shunted in and out of the focus of attention. In order to put the item in a focus of attention, the system needs to find the item which is most prominent for the current needs. In the model, items’ prominence is represented as a quantity called “activation” which is different for each item. This quantity is determined by the history of an item’s use, the current search needs and stochastic noise, and is formalized in Eq.1.1. During memory access, the system attempts to boost the activation of the relevant item above a certain threshold. In order to do this, it forms a set of retrieval cues22, which is compared to the features of each memory item in parallel. Each item gets an activation boost proportional to the number of features overlapping with the set of retrieval cues. Formally, Sji is the strength of association item i and the cue j, calculated as shown in Eq.1.2. Each cue has only a limited amount of activation it can “share”, denoted by S. Thus, if multiple items match a given cue, the activation boost for each item will be reduced. This corresponds to the intuition that the more items have a certain feature, the less useful this feature is to distinguish between them (“fan effect”, Anderson and Reder (1999)). The activation conferred by each matching cue is additionally weighted by wj. Usually, all retrieval cues are assigned equal weights, but it does not have to be be able to hold four chunks (e.g. Cowan, 2001), while others claim that the capacity may be even smaller, one or two chunks at most (McElree et al., 2003). ACT-R assumes that the system possesses a number of specialized “buffers”, each with capacity of one memory chunk. Of interest to us are two buffers: the goal buffer, in which the set of retrieval cues is placed (see the discussion), and the retrieval buffer which holds the memory access result. Only the item in the retrieval buffer is assumed to be accessible for further operations. 22Which is put in the goal buffer. 31 the case. ∑ Ai = Bi + wjSji (1.1) j Sij = S − ln(fanji)fanji = 1 + itemsj (1.2) The boost from retrieval cues determines the item’s activation only partially. Items are assumed to have fluctuating baseline activation (Bi in Eq.1.1), which is influenced both by how frequently the item was retrieved in the past and by how recent the last retrieval was. Moreover, the process is assumed to be stochastic, so that activation values are not perfectly predictive of whether the item will be retrieved. The probability that the item will be retrieved on a given access attempt is determined by the assumed distribution of the noise and is described by Eq.1.3, where Ai is the activation value for item i, τ - retrieval threshold, and s - a parameter controlling noise. If the activation for the item does cross the threshold, it is retrieved (i.e. made available for further operations on it) with latency described by Eq.1.4. F is an arbitrary constant, which varies from model to model. 1 Pi = 1 + e− (1.3) (Ai−τ)/s Ti = Fe Ai (1.4) 32 1.3.1 Explaining RT patterns LV05 model can successfully explain grammaticality illusions across multiple dependencies. Essentially, a variant of the following (simplified) explanation can be used: when a perfectly matching target is not present in the sentence, an imperfectly matching lure can occasionally be retrieved, leading to faster reading times. Notice that this explanation is “structure-defeasible”, since the lures by design located in a structurally inappropriate position; so the fact that they are retrieved at all would mean that the structure fails to constrain the memory access properly. I will use subject-verb agreement to explain the mechanism in detail and then will show how it could be applied to other dependencies. Consider the ungram- matical sentences from (1), repeated below for convenience. In (1a) the target is a better match to the retrieval cues: it matches the “+subj” (or “+ Nominative” ) cue, while the lure matches none. On the other hand, in (1b) the target and the lure are each an equally good match23. In this situation, response times will be on average faster in (1b) due to statistical facilitation (Raab, 1962). Roughly speaking, if we assume that there are two reaction times distributions, one for the target and the other one - for the lure, and we generate the observed reaction time by drawing a value from each distribution and choosing the smallest one24, the mean of the distribution of observed RT values will be smaller then the mean of either of the 23Assuming that the cues are treated by the system as equally important, i.e. assigned equal weights. 24I.e. in this context - the fastest RT. 33 sampling distributions25. 1 a. The key to the cabinet are on the table. b. The key to the cabinets are on the table. A big selling point for the LV05 model is that it is assumed to be able to explain the absence of ungrammaticality illusions. Wagers et al. (2009) suggested that it could be captured in the following way: since in grammatical sentences the target always matches the retrieval cues better than the lure, it will always be reliably retrieved. NPI cases work similarly, with the difference that in the lure-match sentences there is no licit target at all. For example, the following explanation is advanced by Vasishth et al. (2008). When people encounter an NPI, they initiate retrieval with “c-commander”26 and “negative” as cues. In (2c) “no congressman” is indeed a negative element, thus it provides a partial match to the retrieval cues and may be retrieved on a proportion of times. Finally, lure-match effects in reflexive resolution can be explained in very much the same way as they are for subject-verb agreement (Parker et al., 2017). The only difference is that for reflexives structural cues like “+ local” or “+ c-commander” are 25Unless the two sampling distributions do not overlap at all. In the context of the model, we assume that the overlap is bigger in (1b) than in (1a), because the activation values for the lure and the target are closer to each other in the first case, and closer activation values map to more similar retrieval times in ACT-R. 26As we discuss in section 3.1, implementing “c-commander” cue may not be straightforward. Vasishth et al. choose to assume that such a cue can be implemented and abstract from the question of how exactly. 34 weighted higher than morphological cues like “+ sg”. During memory access, the mismatch in a single morphological feature is not enough to make the activation for the target sufficiently low to approach the activation for the lure. But the mismatch in two morphological features suffices to do this, and the system finds itself in the situation I described earlier for subject-verb agreement, with the target and the lure having comparable activation values. Such a mechanism would explain why the majority of the previous studies did not find evidence for lure-match effects: all of them used designs in which the target mismatched the reflexive at most in one morphological feature. Prima facie, lure-match effects reported in some of the previous studies (Badecker and Straub, 2002; Patil et al., 2016; Cunnings and Felser, 2013; King et al., 2012a) would be problematic for PP2017 account, since they were not observed in 2-feature mismatch configurations. However, most of these findings can be explained away. Inhibitory interference findings (i.e. lure match leading to slow-down) reported by Badecker and Straub (2002) (Exp.4), Patil et al. (2016) and Cunnings and Felser (2013) (Exp.1) could be discounted on multiple grounds. First, one could argue that these findings reflect fan effect (cf. the discussion above for agreement). Second, they could be dismissed as irrelevant: as I discussed in section 1.2, inhibitory interference can be attributed to other processes apart from memory retrieval, e.g. memory encoding27. Finally, this evidence could be dismissed altogether as a fluke: as Jäger et al. (2017, p.326) notice, the meta-review suggests that interference is more likely 27This argument is rather weak, especially since Jäger et al. (2017) provides evidence that at least in some cases inhibitory interference is informative about retrieval. 35 to be observed in ungrammatical sentences, and even high-powered studies like Jäger et al. (2015) fail to find reliable evidence for it. Facilitatory interference observed by Cunnings and Felser (2013) in their Ex- periment 2 (low-working-memory participants only) could be accounted for if we assume that lures were more prominent than targets in the experimental materi- als28. This assumption would not be too far fetched: the lures were linearly closer to the reflexives and were prominent in the discourse (pronouns were used as lures, while full NPs were used as targets). Within the context of PP2017 model one would have to make an additional assumption that the activation boost from being promi- nent was functionally equivalent to a single morphological feature mismatch: i.e. combined with an actual single feature mismatch, it was big enough to counteract the higher influence from syntactic cues. The findings by King et al. (2012a) could be dismissed on different grounds. PP2017 argue that the lure-match effects reported by King et al. (2012a) are of different nature, since they were observed in sentences where the reflexive pronoun was situated in a non-argument position. It is known from the linguistic descrip- tions that such pronouns may obey a different set of constraints (e.g. Reinhart and Reuland, 1993), therefore they may rely on different processing routines compared to reflexives in argument positions (e.g. it could be the case that parser weighs syn- tactic cues lower when resolving reflexives in a non-argument position; this would be compatible with the observation that the resolution of such reflexives is often 28See discussion in Section 1.5.1.1 about modifications to LV05 model allowing to take linguistic prominence into account 36 influenced by discourse factors). 1.4 Recent arguments against structure-defeasible models While LV05 model can successfully explain empirical evidence across multiple dependencies, a number of empirical results are problematic. In this section I focus on two objections to the wholesale adoption of “structure-defeasible” memory mod- els, both of which come from very recent studies. The first objection says that the main argument for choosing memory models over representational ones is invalid. The second objection suggests that “structure-defeasible” memory models should not be used across the board, because different dependencies rely on qualitatively different mechanisms. I discuss them in turn. 1.4.1 LV05 model fit to agreement data is worse than usually thought The big reason for choosing memory models over representational ones was their ability to explain the existence of lure-match effects in ungrammatical sentences and the absence of lure-match effects in the grammatical sentences. However, this argument is problematic for two reasons. First, as Jäger et al. (2017) discuss, the model’s predictions for the ungrammatical sentences are different. In fact, (1d) is not predicted to be as fast as (1c), it is predicted to be slower, since in (1d) the target and the lure share the number feature. As I have discussed in section 1.3, the amount of activation each cue can contribute is limited. If more than one item matches a given cue, this limited amount of activation will be spread evenly 37 between the matching items. In my example, this means that the target in (1d) will on average have lower activation than the target in (1c). And since in LV05 model activation directly determines retrieval times, people should be slower in (1d). 1 a. *The key to the cabinet are on the table. b. *The key to the cabinets are on the table. c. The key to the cabinet is on the table. d. The key to the cabinets is on the table. Second, empirical evidence appears to be incompatible with either Wagers et al. (2009) account or with the predictions of a basic LV05 model. It appears to be the case that the “absence” of ungrammaticality illusions may be better characterized as “rarer observations” of ungrammaticality illusions. A Bayesian meta-analysis of the available evidence conducted by Jäger et al. (2017) reveals a pattern indica- tive of ungrammaticality illusions: people read grammatical sentences faster, not slower, when the lure matches the target in features. Interestingly, Hammerly et al. (draft.april.2018) report similar observations regarding reaction times; moreover, they suggest a way of capturing the data from both grammatical and ungrammati- cal sentences using a representational model. This turns the argument for memory models and against representational models upside down: now it may be the case that it is representational accounts which can capture the patterns in both gram- matical and ungrammatical sentences, and memory models can do so only in the latter case. We discuss the Hammerly et al. (draft.april.2018) data below and argue 38 that the amount of evidence is not yet sufficient to completely overturn the pref- erence for memory models, but the direction of the argument and the challenges it presents to cue-based accounts are worth noting. Hammerly et al. (2018) The main claim of Hammerly et al. (draft.april.2018) (henceforth HSD2018) is that the reported absence of ungrammaticality illusions is largely an artifact of people’s bias to treat their linguistic input as grammatical. In a series of three grammaticality-judgment experiments and computational sim- ulation, they show that if this grammaticality bias is canceled, lure-match effects appear equally frequently in grammatical and ungrammatical sentences. This re- sult undermines the most important empirical argument for cue-based models and has the potential of bringing the representational models back into limelight. This will mean, of course, returning to the view that structural information categorically guides real-time dependency formation. HSD2018 argue that to say that ungrammaticality illusions are absent would be too strong, and a more accurate statement would be that both grammaticality and ungrammaticality illusions are observed, although the latter are observed rarer. They suggest that this asymmetry is an artifact of experimental procedure, and not a reflection of real cognitive differences. Namely, the fact that most of experimen- tal sentences are usually grammatical biases participants. In a judgment task that would result in answering “grammatical” more often than “ungrammatical”, and this in turn will result in a pattern of responses where ungrammatical sentences are deemed grammatical more often than grammatical sentences are deemed ungram- 39 matical. This explanation suggests that if we manage to eliminate this response bias, the number of grammaticality and ungrammaticality illusions should be roughly the same. This is, indeed, what HSD2018 show. They discuss the results of three acceptability judgment experiments, where they manipulated participants bias. The first experiment was a replication of pre- vious results, showing the absence of ungrammaticality illusions in a judgment task (Wagers et al., 2009). The sentences followed a standard agreement attraction de- sign, with the grammaticality of the sentence ((12a)-(12b) vs. (12d)-(12c)) and the match between the lure and the verb ((12a)-(12d) vs. (12b)-(12c)) were manip- ulated. The results were consistent with the previous findings: an interaction of grammaticality and lure match was observed. Grammatical sentences were accu- rately judged as grammatical regardless of lure match. Ungrammatical sentences were much more likely to be judged as grammatical if the lure matched the verb. That is, the results showed the absence of ungrammaticality illusions. (12) a. The friend of the nurse frequently visits... b. The friend of the nurses frequently visits... c. The friend of the nurse frequently visit... d. The friend of the nurses frequently visit... The next two experiments attempted to manipulate people’s response bias by chang- ing the proportion of ungrammatical items29 and drawing participants’ attention to 29Two thirds of the items were ungrammatical vs. a half in Experiment 1. 40 this fact in the instructions. In Experiment 2, participants were specifically told that two thirds of the items were ungrammatical. This manipulation led to slight reduc- tion of the asymmetry observed in Experiment 1: now people were a little bit more likely to respond “ungrammatical” to a grammatical sentence with the lure which mismatched the verb. This effect was statistically significant; however, the critical interaction was still present in the data. Thus, even is the asymmetry between the effect of the lure in grammatical and ungrammatical sentences was weakened, it did not disappear completely. In Experiment 3 the instructions were changed so that instead of receiving the exact proportion of ungrammatical stimuli, the participants were simply told that the majority of the items were ungrammatical. This manipulation was more efficient: people became less accurate in grammatical sentences and more accurate in ungrammatical sentences when the lure mismatched the verb. This made the critical interaction disappear: only the effect of lure match was statistically significant. This pattern suggests that both illusions of grammaticality and ungrammaticality were present. HSD2018 suggest that a representational account by Eberhard et al. (2005) (“Marking and Morphing”) could be used to explain their results. This approach assumes that (nominal) number is represented as a continuous value between nega- tive and positive infinity, where negative values correspond to “more singular” and positive values correspond to “more plural”. The number of an entire noun phrase, S(r), is determined as in Eq.1.5. S(n) is the notional component of the number valuation, coming from the conceptual message. The sum term represents the syn- 41 tactic contribution to the number valuation: it is a linear combination of the number information on the head and the embedded phrases, weighted by the distance from the root node. S(r) can be conceptualized as the probability of the plural verb, if we put it through a logistic transformation, thus bounding its values between 0 and 130. ∑ S(r) = S(n) + (wj × S(m)j) (1.5) j Consider how this model would predict agreement attraction. For a subject phrase like “The key to the cabinet” both nouns inside it are singular, thus, the number representation of the whole phrase will correspond to “singular” by default. On the other hand, in a phrase like “The key to the cabinets” the embedded noun is plural. This plurality will increase the S(r), making the plural verb more probable. Results of Experiments 2 and 3 would be captured by this account rather easily. Experiment 1 would be more problematic: as we have already discussed, the changes to the number marking of the subject NP will happen regardless of what the number on the verb actually is, so both illusions of ungrammaticality and grammaticality are predicted. It is also not immediately clear how to incorporate the bias manipulation in the Marking and Morphing model. HSD2018 propose that both goals will be achieved if we map Making and Morphing parameters to the parameters of a drift diffusion model (Ratcliff, 1978). Drift diffusion models model 30As far as we understand, S(r) is always greater than 0. Thus, the actual logistic transformation includes a negative bias term in order to bias the representation towards singular in the absence of other sources of number information. 42 a binary decision process as a diffusion process (a continuous version of random walk) starting at a random point between two decision boundaries. The amount of evidence available to the decision taker determines how quickly the model walks toward the correct decision boundary; however, due to the stochastic nature of the process it is not guaranteed that it will reach it first: it may happen that enough steps occur in the “wrong” direction so that the incorrect decision boundary is hit. Now, we can map the S(r) value from the Marking and Morphing account onto the strength of evidence parameter of the diffusion model: in this way, a more extreme number value on the subject way would lead to (on average) faster and more accurate grammaticality decisions31. The crucial assumption is that participants response bias can be mapped to the starting point of the diffusion process. That is, if they are biased to treat sentences as grammatical, the diffusion process will start closer to the “grammatical” decision boundary. Here is how this model would capture the results of the experiments. In Exp.1, where by hypothesis participants have the strongest grammaticality bias, the diffusion process would start close to the grammatical boundary. This would lead to a situation where for grammatical sentences people would almost always hit the grammatical boundary, even if the number marking on the target NP is less unambiguously singular due to the influence of the lure: the model will have 31Notice that S(r) corresponds to the number information, however, the decision boundaries are labeled “grammatical” and “ungrammatical”. It seems that the parser would have to decide ahead of time, e.g. based on the information from the verb, which number value should correspond to “grammatical” decision boundary. 43 little time to accumulate enough drift in the wrong direction. In the ungrammatical sentences, the ambiguous number marking on the head noun as in (12c) would make the drift towards the “ungrammatical” decision boundary slower than in (12c), giving the process enough time to accumulate evidence in the wrong direction and to occasionally hit the “gramamtical” boundary. Overall, we would expect to see the asymmetry between grammatical and ungrammatical sentences: only for the latter the number information on the lure would lead to an increased proportion of incorrect responses. Exp. 2 and 3, by hypothesis, neutralized the grammaticality bias, shifting the starting point of the diffusion model to a position where it would be roughly equidistant from both boundaries. In this situation, we would expect that in both grammatical and ungrammatical sentences, more ambiguous number marking on the target (as in 12b and 12d) would lead to identical slowdown of the diffusion process in both grammatical and ungrammatical sentences. This would make it roughly equally likely for the wrong decision boundary to be hit, leading to a symmetrical distribution of incorrect responses. People would produce as many “grammatical” responses in (12d) as they would produce ungrammatical responses in (12b), that is, we would observe both illusions of grammaticality and ungrammaticality. If the reasoning above is correct, it undercuts the argument for choosing “structure-weak” memory models at several points. To begin with, it eliminates a crucial argument against representational accounts: if they can capture the absence of ungrammaticality illusions in certain conditions (which, arguably, correspond to the conditions in which no ungrammaticality illusions have been observed), we 44 lose our strongest reason to prefer memory models. HSD2018 findings also provide evidence against memory models. First, the fact that manipulating participants’ bias affects the rate of ungrammaticality illusions is prima facie unexpected in a cue-based model. Indeed, if we think that a memory item that perfectly matches the retrieval cues will always outcompete other memory chunks, in grammatical sentences the correct agreement controller should always be retrieved32. Second, the observed pattern of reaction times is hard for a cue-based model 32In principle, bias can be incorporated in cue-based models, but it is less clear to what degree we would be able to capture all the patterns reported by HSD2018. E.g. one could hypothesize that the weights for retrieval cues are adjusted dynamically depending on the current linguistic (and potentially extra-linguistic) context. The size of the adjustment would correspond to grammatical bias. In a context where most of the sentences are grammatical, the parser can be reasonably sure about its parses, and thus may be willing to rely on structural cues. On the other hand, if most of the sentences are ungrammatical, the parser may lessen its reliance on structural cues. This will result in a situation in which the target has less of an advantage over the lure, and due to stochastic noise, lures will have more chances to be retrieved. In sentences like (12a) this should not affect participants’ responses: both the target and the lure match the verb, so whatever noun is retrieved, the result can be judged as grammatical. On the other hand, in sentences like (12b) retrieving the lure would lead to the “ungrammatical” response. Overall, we would observe fewer correct judgments in (12b), than in (12a) - i.e. we would observe ungrammaticality illusions. Unfortunately, this explanation would fail to capture the pattern of results in ungrammatical sentences observed in HSD2018 Exp.3: when the response bias decreased, the proportion of correct responses in the ungrammatical sentences increased. This is the opposite of what my ad hoc account would predict: if the parser does reduce its reliance on structural cues, lures should become more likely to be retrieved regardless of sentence grammaticality. Thus, we would expect that there will be more incorrect judgments in both grammatical and ungrammatical sentences. 45 to fit. The critical observation is that if the lure matches the target in features, the judgments are given faster, while LV05 model would predict a slow-down in such configurations. This evidence is particularly interesting, since it aligns with the results of the quantitative metaanalysis by Jäger et al. (2017), which we have discussed earlier. Again, it is not the case that cue-based models cannot account for these RT patterns: if we adopt the Engelmann et al. (draft4)’s suggestion that linguistic prominence can affect memory chunks’ activation (see section 3.1) AND if we assume that the lure is more prominent than the target, we would be able to capture the HSD2018 RT patterns. But we do not see any arguments for why the lure should be considered more prominent than the target, so practically speaking these RT pattern still remain problematic for the LV05 model. Thus, HSD2018 account and evidence pose interesting challenges for memory accounts. We think, however, that in their current state they are still too weak to completely rule memory models out. First, HSD2018 explanation relies on and covers only acceptability judgment task. However, the absence of ungrammaticality illusions has been reported in studies relying on other experimental techniques, and it may not be straightforward to extend HSD2018 account to cover these (as the authors themselves discuss). HSD2018 argue that self-paced reading might be accommodated relatively easily: instead of deciding on whether the sentence is grammatical or ungrammatical, people could decide on whether to press or not to press the button to uncover the next word. Eye-tracking might be harder to accommodate, since it is less clear what are the decisions that people are making during reading. Potentially it could be something along the lines of “stay on the 46 current word / make a progressive saccade / make a regressive saccade” etc. But it is clear that the decisions are unlikely to be binary and the model would have to be further complicated to allow for this. Finally, EEG results (e.g Xiang et al., 2008; Tanner et al.), might be the trickiest to cover, since it is not immediately clear how to talk about observed ERP patterns in terms of underlying decision making processes. More generally, HSD2018 model only accounts for agreement data. One could claim that it should do so, since lure-match effects in subject-verb agreement and reflexive resolution are separate phenomena; this is the route Sloggett (2017) takes. From the perspective we take in this thesis this is rather a disadvantage. It is not clear how the model would apply to reflexives resolution. First of all, as Jäger et al. (2017) suggest, there is very little evidence for illusions of ungrammatical- ity in reflexive resolution (in comparison, the same analysis does provide evidence for illusions of ungrammaticality in agreement attraction). Second, it is not clear whether Marking & Morphing account could explain lure-match effects in configu- rations used to study reflexives resolution. Often the lure is structurally higher than the target (e.g. it is located in the matrix clause and the target - in the embedded complement clause): thus, in order for the features of the lure to influence the tar- get, we would need to say that the features information from the lure can percolate downwards. As Wagers et al. (2009) point out, Eberhard et al. (2005) do indeed make this assumption. But even in this case, the lure is more structurally distant from the target than in the agreement cases, where the lure is typically embedded in a PP inside the subject NP. Given that Marking & Morphing account assumes 47 that the influence of the percolating information reduces with structural distance, in the reflexives cases the lure may be too far from the target to noticeably affect its representation. Finally, it is not clear how Marking & Morphing would explain the fact that lure-match effects are not observed in 1-feature target mismatch configu- rations, but are observed when the target mismatches the reflexive in two features. It would be easier for the information from the lure to bring the mismatching target closer to the match with the reflexive in 1-feature mismatch conditions, and thus, the opposite pattern of the results would be predicted. To sum up, HSD2018 proposal shifts the focus from “structure-defeasible” cue- based models back to “structure-strict” representational ones, and posits interesting problems for memory accounts. On the other hand, it also gives less universal coverage than the framework offered by cue-based models. I do not feel that the proposal is developed enough to strongly sway the odds against cue-based models, but it may be an important first step towards alternative accounts. As we will see now, the same tendencies are evident in Sloggett (2017) account of the reflexive data. 1.4.2 Reflexives processing is qualitatively different from agreement The second argument against uniform application of “structure-defeasible” memory models suggests that different dependencies (and in particular - subject- verb agreement and reflexive resolution) rely on qualitatively different mechanisms. This argument has been present in the literature for some time. Originally, it was 48 based on the fact that lure-match effects are rarely observed in reflexives resolution, as well as on some data indicating that the search for the antecedent may not be parallel, as cue-based models assume, but rather serial (Dillon, 2011; Dillon et al., 2013; Dillon, 2014). We will not discuss the previous arguments in detail, focusing instead on two recent studies. Jäger et al.’s meta-analysis provides evidence for qualitative differences be- tween dependencies. It suggests that for reflexives there is no evidence in the avail- able data for lure-match effects in grammatical sentences33. Second, in ungrammat- ical sentences, the evidence suggests that people are slower when the lure matches the reflexive. Both empirical patterns contradict the predictions of the LV05 model. Moreover, both patterns differ from the results of the analysis on the subject-verb agreement data. This discrepancy might be indicative of the underlying differences between dependencies. We consider the evidence provided by Jäger et al. (2017) as only tentative due to several caveats of their analysis. First, as the authors point out themselves, it is essentially observational, therefore any conclusions should be confirmed using experimental studies specifically designed with this goal in mind. Second, the quan- titative meta-analysis relied only on first-pass reading times for the eye-tracking studies, because “it is the most commonly reported eye-tracking measure in the psycholinguistic literature and arguably reflects early cognitive stages of dependency formation” (Jäger et al., 2017, p.323). However, lure-match effects in reflexive reso- 33In a way, reflexives are being more catholic than the Pope (subject-verb agreement) - it appears that for reflexives, illusions of ungrammaticality are truly absent 49 lution appear to be quite variable in timing. E.g. Parker et al. (2017) only observed effects in first-pass reading times in two out of three experiments. Similarly, in Sloggett (2017) the time course of lure-match effects varied across experiments: in two of them, the effects were observed relatively early, (first-pass or regression path times at the critical region), in the other two, the effects became evident later (total times at the critical region or regression path at the spillover region). It is not quite clear whether such differences are due to random variability across participants or experimental items, or whether they reflect meaningful differences in the reflexive resolution process. Whatever the case, an analysis based only on first-pass RTs will miss some of the effects reported in the literature. Finally, Jäger et al. (2017) re- view does not include Sloggett (2017) data, presumably because it was not available at the time of writing. Sloggett (2017) showed that lure-match effects observed by PP2017 can indeed be reliably observed (modulo the discussion about the exact con- figurations in which they arise). It is possible that with these data the quantitative analysis would have given different results. Sloggett (2017) The evidence for qualitative differences between subject-verb agreement and reflexives presented by Sloggett (2017) is more extensive. He builds on and expands empirical observations made by Parker and Phillips (2017). In contrast to the retrieval error explanation for agreement attraction, Sloggett argues that the observed lure-match effects in reflexives processing are a manifestation of a completely grammatical anaphora resolution strategy, namely, logophoric inter- pretation of reflexive pronouns. Logophors are pronouns which refer to the person 50 “whose speech, thoughts, feelings, or general state of consciousness are reported” (Clements, 1975)34. Sloggett’s hypothesis is motivated by three crucial observations: 1. Lure-match effects are not observed when the lures are subjects of perception verbs, even if the target mismatches the anaphor in two morphological features. However, the identity of the matrix verb does not matter if the lure refers to the only center of consciousness in the sentence. 2. Lure-match effects are eliminated or at least weakened if the target is an indexical pronoun like “I” or “you”. 3. As often as in 30% of the cases, when asked a comprehension question probing the reflexive resolution, people choose the answer compatible with the lure being the intended antecedent, even when the target fully matches the reflexive in morphological features. 34The term “logophor” was initially introduced by Hagège (1974) to describe morphologically distinct pronouns in a number of languages spoken in Africa (e.g. Ewe, Yoruba, Igbo; see Culy (1994) for an extensive list of relevant languages). However, the use later was extended to denote non-clause-bound reflexive pronouns in a number of languages (e.g. Japanese zibun, Chinese ziji, Icelandic sig). In this second sense, the term has also been applied to English, to describe the pronouns in configurations like: “Max said that the queen invited both Lucie and himself for tea.”(e.g Reinhart and Reuland, 1993). Notice that a priori it is not guaranteed that “logophors” in African languages and non-clause-bound reflexives rely on the same set of constraints (or on the same processing routines), thus one must be careful not to conflate the facts pertaining to the two phenomena. 51 Here is how logophoric hypothesis can explain these patterns. First, lure match sensitivity to the type of the embedding verb appears to conform to a pattern reported for other languages. Culy (1994) suggested the following implicational hierarchy for a number of languages spoken in West Africa: Speech < Thought < Knowledge < Direct perception. That is, if a language allows subjects of verbs on the right end of the scale to be taken as antecedents for logophors, it will allow the same for the subjects of verbs lower on the hierarchy. The first pattern is in line with the Culy hierarchy; and the fact that the type of the verb does not matter if the lure is the only animate entity in the sentence is also naturally accommodated: if there is a single conscious antecedent in the discourse, the logophor “has no choice”. The second pattern closely resembles “person blocking”, a syntactic phenomenon observed in Chinese long-distance binding of the reflexive “ziji”. As Huang and Liu (2001) discuss, “ziji” allows antecedent outside of its local domain, as in (13a). However, an intervening first or second person pronoun may block this resolution option (13b)35. Huang and Liu argue that these effects arise when “ziji” is used as a logophor, and not as a syntactic reflexive. Finally, the third piece of evidence is explained almost for free: if resolving the reflexive to the lure is a representation of a grammatical strategy, we should expect lure-match effect to appear even in grammatical sentences. (13) a. Wo/nii danxin Lisij hui piping zijii/j. I/youi worry Lisij will criticize SELFi/j. 35This is not the only environment where blocking effects appear, but it is the most relevant one for the current discussion. 52 I/you worry that Lisi might criticize me/you/herself. b. Zhangsani danxin wo/nij hui piping ziji∗i/j. Zhangsanii worry I/youj will criticize self∗i/j. Zhangsan worries that I/you might criticize my/yourself. To formalize this hypothesis, S2017 uses a modified version of PP2017 cue-based retrieval account. On the memory retrieval side, the two accounts are virtually identical, with one crucial difference: S2017 makes his model “structure-strict”, by assuming that structural features act as gating features and only grammatical antecedents are ever retrieved. As far as I can tell, he does not discuss how this difference is implemented in the model (higher weights on structural cues? different cue combinatorics scheme?) However, the main explanatory power of S2017 comes from a suggested modi- fication to the grammar of English. Sloggett suggests that a phonologically null op- erator, OPlog, is located in the left periphery of complement clauses (see Charnavel and Sportiche, 2016, for a related proposal). By hypothesis, it tracks perspective centers of the utterance. Each appropriate36 antecedent is mapped onto one of the three perspective “roles” from Sells (1987): source, self and pivot. The first role corresponds to the participant acting as the source of information, the second - to the participant whose mental state is reported and the third one - to the participant whose (physical) location is assumed as the reference point. The roles form a hier- archy, such that a participant takes on all the roles below the role that is assigned to him/her. E.g. if somebody acts as self, then he also takes the role of pivot, but 36Minimally, an animate entity. 53 not that of source. OPlog obligatorily refers to the highest specified role on this hierarchy. I.e. OPlog will refer to whatever participant bears the role of source; if there are no such participants, it will look for self and, failing that, for pivot. Finally, S2017 suggests that these roles are assigned probabilistically. E.g. speakers are most likely to be mapped onto source, but may sometimes get mapped onto self; the reverse could hold for indexical pronouns. Notice that binding by OP 37log is syntactically local with respect to the re- flexive. Thus, the non-local binding is just an appearance, created by the special referential properties of the operator. But if OPlog is a licit binder, why are non- local antecedents so inaccessible to native speakers? S2017 explains this in terms of the feature composition of the operator: it is assumed to be underspecified for φ-features. This ensures that it is a worse match to the retrieval cues compared to the overt target antecedent. As such, it is less likely to be retrieved; importantly, it does not mean that its retrieval is impossible, just that it will happen only rarely, perhaps, when stochastic noise will happen to bias memory activations in the right direction. Here is how this formalization would derive the three crucial patterns I men- tioned in the beginning of the section. The effect of the embedding verb stems from the way participants are mapped onto Sells’s roles. Speech verb subjects are likely to be sources, so OPlog will most often take them as antecedents. On the other 37Or at least “more local” than for the lure: OPlog is situated in the same clause with the pronoun. However, it does not occupy a typical position for an antecedent, since it is not located in an argument position. 54 hand, perception verb subjects are more likely to be pivots. Under the assumption that only one participant can be mapped on any given role, by the time the reflex- ive is encountered, the control of the pivot will be taken by the embedded subject. Since there will be no higher roles specified, this is what OPlog will refer to, and non- local interpretation will not be available. Person blocking effects are explained in a similar way. S2017 suggests that although speech verb subjects tend to be mapped onto source’s and indexical pronouns - onto self, sometimes the mapping gets re- versed due to its stochastic nature. It is exactly in these cases that we will observe person blocking effects. Finally, since the resolution process is assumed to follow grammatical constraints, it should come as no surprise that people are able to access non-local interpretations even when the (overt) local target completely matches the features of the reflexive. 1.5 Parker & Phillips (2017) vs. Sloggett (2017) 1.5.1 Empirical observations We have seen that the recent work has suggested reasons to shift in the direc- tion of “structure-strict” models applicable to isolated linguistic phenomena. They rely on two types of arguments: a) agreement attraction data does not constitutes evidence for (and potentially is evidence against) memory models; b) reflexive res- olution relies on different mechanisms and thus there is no reason to capture it and subject-verb agreement with the same kind of “structure-defeasible” model. While this shift may turn out to be in the right direction, I consider the 55 universality of “structure-defeasible” memory models too attractive to easily give it up. Thus, my goal in this thesis is to look for further evidence supporting or contradicting the recent claims. In order to do so, I will focus on the two accounts of lure-match effects in reflexive resolution, suggested by Parker and Phillips (2017) and Sloggett (2017), which I have already discussed. In this last section of the introduction, I discuss whether any of the empirical facts present a fundamental problem for the accounts and whether we have strong reasons to prefer one over the other. Altogether, I take the following 7 empirical observations from the previous research to be important (I mark in parentheses the studies they are coming from): E1 In configurations like The boy said [that the girls like himself] people read sentences with feature matching lures faster if the target mismatches the reflexive in two features. This does not happen when the target fully matches the reflexive. (PP2017, S2017) E2 The speed up also does not occur when the target mismatches the reflexive in a single feature. (PP2017) E3 Lure-match effects also appear in configurations like The boys [that the girl liked] praised herself in the cases when the target mismatches the reflexive in two features. It is not known whether the degree of target match modulates lure-match effect in these cases. (PP2017) E4 Lure-match effects are not observed when the lures are subjects of perception 56 verbs, even if the target mismatches the anaphor in two morphological features. (S2017) E5 However, the identity of the matrix verb does not matter if the lure refers to the only center of consciousness in the sentence. (S2017) E6 Lure-match effects are eliminated or at least weakened if the target is an indexical pronoun like “I” or “you”. (S2017) E7 In an interpretation task, people answer interpretation questions in a way which suggests that they treat the lure as the antecedent as often as in 30% of the cases, even when the target fully matches the reflexive in morphological features.(S2017) PP2017 interpret findings E1, E2 and E3 as evidence for faulty memory access in a cue-based memory. S2017 agrees that E2 should be explained in terms of the properties of the memory access. However, he assumes that retrieval operations only ever return structurally licit antecedents, i.e., his account is “structure-strict”. S2017 suggests that the majority of the facts (except for E3) should be seen as evidence for a grammatical strategy of reflexives resolution. In his explanation, a phonologically null operator in the left periphery of complement clauses tracks discourse prominent entities and can create appearances of non-local reflexive binding. 1.5.1.1 PP2017 PP2017 account naturally accommodates observations E1, E2 and E3. Prima 57 facie, it would appear that it is unable to explain E4, E5 and E6, since they suggest that lure-match effects may depend on things other than the representation of the lure, the target and the reflexive. However, I will argue that in fact PP2017 account could explain these observations. I will start with discussing whether the symbolic component38 of the ACT-R model can capture S2017 effects. To capture E4, we would have to ensure that the information about the verb enters the cue-matching process in some way. Presumably, it does not do it directly: the parser is likely not looking for verbs when it tries to resolve a reflexive. But we could encode the information about the verb on its subject, e.g. by using a feature like “+ subject of a speech verb”. The reflexive would include this feature in the set of retrieval cues, biasing retrieval against subjects of other types of verbs. While possible, I consider this approach too clumsy to be our first choice for a few reasons. First, it requires some additional processing: during the encoding stage, the parser has to retrieve the subject at the verb and update the subject’s representation with the verb type information39. Second, the information about whether the verb is a speech verb or not may only be useful for logophoric pronouns. But at the point where it has to be encoded, the parser has no idea of whether a reflexive pronoun will come up at all, even less about whether it will receive logophoric interpretation. Thus, often the information will be encoded, but not used, and if we think that memory space available to the parser is limited, using such redundant encodings 38I.e. operations defined in terms of features, procedural rules etc. 39Although the overhead may be minimal, if we assume that the subject is retrieved at the verb for integration purposes anyway. 58 would seem like an inoptimal strategy to adopt. Third, the feature I suggest is essentially privative. It is fine if we just want to distinguish between speech and perception verbs. But the Culy (1994) hierarchy that S2017 experiments are inspired by includes other verb types, e.g. verbs of thought and knowledge. Experiments would be required to see whether these verbs support or block lure-match effects, but if at least one more class of verbs does support them, a privative feature of the sort I suggested would not work anymore. Instead, a more general feature like “+ a subject of logophoric verb”40 would be needed. This would not be too far fetched in languages with grammaticized logophoric pronouns, but would more questionable in English. Alternatively, we could use multiple features, one per verb type (e.g. “+ subject of perception verb”, “+ subject of knowledge verb” etc.). But in this case the set of retrieval cues would have to include several verb-type related features, leading to a situation where no antecedent would be a perfect match to the set of retrieval cues. One may also wonder whether it is only the types of verbs on the Culy (1994) hierarchy that even put a feature on their subject. If it is the case, the parser should essentially have access to cross-linguistic generalizations (“these verbs are logophoric in other languages, so I need to pay attention to them in English”). If all verbs put some information about their semantic class on the subject, this again raises the question of whether the encodings are optimal from the point of view of memory space usage. A second approach would be to avoid using features on the verb’s subject, 40Somewhat sloppily, I use “logophoric verbs” to mean “verbs allowing their subjects to serve as antecedents for logophoric pronouns” 59 instead accessing the verb’s information directly. The following antecedent retrieval algorithm could work: first, retrieve a suitable NP (e.g. matching features of the reflexive). Then, using the information on this NP, retrieve the verb it is the subject of, check whether the verb is the verb of speech or perception. Finally, depending on the result, either return or do not return the NP as the reflexive antecedent. It can not be the algorithm which is attempted first, otherwise it would exclude grammatical local and c-commanding antecedents which happen not to be subjects of speech verbs. Therefore, it is more plausible as a repair algorithm, for the cases when the parser fails to find a licit local antecedent. Interestingly enough, as we discuss in section 5.1.3, some patterns in the data may indeed indicate that lure- match effects are due to repair processes. Overall, the fact E4 could be captured even at the symbolic level of ACT-R. It may be harder to capture the fact E5 (the type of embedding verb is irrelevant if the target antecedent is inanimate) while staying on the same level. I discuss some possibilities below but conclude that they may be too convoluted. Basically, we would have to manipulate the representation of the lure (in my first approach) or the processing procedures (in my second approach) based on whether the lure is the only consciousness center in the sentence. This could be done, but would require introducing yet another feature, something like “+ conscious”. During anaphora resolution, the parser would retrieve all antecedents with such feature, and then, if there is more than one, either change the representation of the lure 41 or to change 41Which would require retrieving it accurately, which may not be trivial in a “structure- defeasible” model 60 the processing strategies by removing the check for verb type. While this is all possible, capturing E5 in symbolic terms appears too ad hoc. As I discuss slightly later, though, these effects could be captured rather naturally by the sub-symbolic level of the model (i.e. in terms of the continuous activation values). Capturing the fact E6 at the symbolic level is also problematic. We could assume that indexical pronouns are underspecified for gender and mismatch the re- flexive in only one feature, person. However, as S2017 discusses, this may not go through. Sloggett argues that this contradicts the observation that in his experi- ments sentences with indexical pronouns were judged as slightly less acceptable and were read slightly slower than sentences with it as the target (which presumably mismatches himself/herself in two features, animacy and gender). Alternatively, we could assume that the parser is aware of the presence of an indexical pronoun in a structurally intervening position with respect to the lure. But that would either mean that the parser tracks indexicals AND their position relative to other elements of the sentences, which is a priori implausible, or that it first accurately retrieves the target (again, not unproblematic given that it mismatches the set of retrieval cues), determines that it is an indexical and for some reason stops looking further, even though the retrieved target mismatches the reflexive. Finally, capturing E7 may be most problematic. In grammatical sentences the target always matches the morphological features on the reflexive at least as well as the lure does, and in addition only the target matches the structural cues, even if they are down-weighed. We do not expect to see any considerable amount of lure retrievals in such configurations even when the lure fully matches the morphological 61 features on the reflexive, even less so when it does not. However, S2017 reports roughly 30% of the answers compatible with non-local reflexive resolution in the first case, and roughly 25% in the second. This pattern is hard to explain solely by the dynamics of memory system, unless one assumes that the amount of noise in the system is so high that it ignores a perfectly matching item 30% of the time. It seems odd to assume that memory access is so inefficient: if it performs that poorly in perfect conditions, what would it performance be in real-life situation with potentially less certainty about what the correct outcome is? Summing up, explaining the empirical effects observed by S2017 purely in terms of memory processes is not impossible, but is rather challenging. But notice that these approaches stayed on the symbolic level of the ACT-R model. Now I would like to argue that it is potentially easier to explain all of the empirical facts if we take a look at the subsymbolic level, i.e. activation dynamics. In a basic ACT-R model, the only way to select an item is to make it more active than all the other competitors. Usual ways of doing so include raising baseline activation due to frequent and/or recent retrievals and getting the boost from matching retreival cues during memory access. As I have discussed, none of these mechanisms would be able to differentially change the lure’s activation based on context (e.g. the type of the embedding verb). However, it is possible that activation can be directly affected by other factors as well, e.g. linguistic prominence (Engelmann et al., draft4). One could hypothesize that subjects of speech verbs are more prominent in the discourse being sources of information. If the boost in activation from this source is sufficient, only the subjects of speech verbs may act as efficient lures. One could also argue 62 that animate entities are more prominent than inanimate - that would explain why the type of the embedding verb does not matter if its subject (the lure) is the only animate entity in the sentence. Finally, it is quite plausible that first and second person indexical pronouns are more prominent than third-person entities, being more central to the communicative situation - this could explain person blocking effects. Thus, discourse (or other related kind of) prominence might in principle explain effects for which S2017 postulates OPlog. I have to admit that these speculations remain only that - speculations - unless we are a) able to quantify the hypothesized changes in prominence and map them to specific values within an ACT-R model and b) show by simulations that the tentative predictions I make above indeed hold. To conclude, empirical facts that S2017 uses as a strong empirical evidence for his model could potentially be explained even without recurring to OPlog and the claim that lure-match effects in reflexive resolution have grammatical nature. Explicit simulations and more theoretical work are required to test whether my way of accounting for these effects within a “structure-defeasible” cue-based model is viable. 1.5.1.2 S2017 S2017 account can accommodate all of the empirical facts listed above, except for E3. S2017 crucially relies on a null operator tracking prominent discourse entities in the left periphery of complement clauses. However, in Parker and Phillips (2017) Exp.2 the lure is embedded in a relative clause. Without OPlog present, only the 63 local target should be returned during retrieval, and no lure-match effects should be observed. Sloggett has to fall back on an ad hoc explanation. He suggests that these effects represent a case of sub-command binding, a phenomenon occurring in Chinese. In sub-command configurations, an animate NP embedded inside an inanimate subject may still bind a reflexive. If this explanation turns out to be false, we will have to conclude that even if some lure-match effects reflect logophoric interpretations of the reflexives, some others still arise as a result of processing errors. That alone would be enough to argue for “structure-defeasible” models (following PP2017 and contra S2017). Could the OPlog account be extended to cover the lure-match effects from lures inside relative clauses? In principle, nobody prevents us from saying that OPlog is located in the left periphery of all clauses, not just complement ones. This possibility would be supported by Charnavel and Huang (2018) who argue that sub-command binding is actually an instance of logophoric binding. If that is the case, and if one assumes that all instances of logophoric interpretations are mediated by something like OPlog, S2017 account would be able to explain PP2017 findings in a way analogous to the explanation of the fact E5: since in their stimuli the embedded lure was the only animate entity, it would also be the only possible logophoric antecedent. But on the other hand, this account would fail to explain the contrast in (14 Huang and Liu (2001): since, by hypothesis, OPlog is able to refer to Zhangsan in the first case, it is not clear why it would not be able to do so in the second. It would be hard to argue that OPlog is only present in the first case: by the time the parser reaches the embedding NP, the semantics of which 64 licenses the logophoric interpretation42, OPlog would have already been introduced to the parse. Thus, while postulating OPlog in all clauses could improve empirical coverage of S2017 account, it would also introduce new problems. In addition, as S2017 himself notices, it would go against the tendency for logophoric antecedents to be subjects of attitude predicates. I will discuss these issues in more detail in Chapter 2. (14) a. Zhangsani de baogao biaoshi tamen dui zijii mei xinsin. Zhangsan DE report indicate they to self no confidence Zhangsans report indicates that they had no confidence in self. b. *Zhangsani de shibai biaoshi tamen dui zijii mei xinxin. Zhangsan DE failure indicate they to self no confidence Zhangsans failure indicates that they have no confidence in him. With respect to the fact E7, although S2017 uses it to support his account, this empirical observation may be even more problematic for him than for PP2017. According to Sloggett’s hypothesis, the silent logophoric operator does not carry any φ-features. Thus, while in the PP2017 scenario the target had a single feature advantage over the lure (only the target being the reflexive’s clausemate), in S2017 scenario the target is a better match than OPlog in at least two features - gender and number (and potentially person and/or animacy). Thus, one would have to assume an even noisier retrieval system than in PP2017 scenario to account for these patterns. 42As Huang and Liu (2001) put it: “This is because [the first sentence] implies that Zhangsan himself indicates that they had no confidence in him. (If his report indicates P then he indicates P.) No similar implication holds of the unacceptable [second sentence].” 65 1.5.2 Empirical coverage: summary To sum up, neither account covers all of the available data. Nor is any a clear winner. The original PP2017 account fails to explain the evidence for factors other than the number of matching features affecting lure-match effects. However, I have suggested a way in which this account could be extended to cover most of the prob- lematic evidence. The S2017 account covers most of the data to begin with, but extending it to cover the data from the non-c-commanding lures is not straightfor- ward. It either requires introducing an additional mechanism, not tightly related to the main proposal, or else complicating the interpretation of other empirical facts. In addition, I have argued that the interpretation data reported by S2017 may be problematic for both accounts. 1.5.3 My experiments The main body of the thesis is an attempt to tease the two accounts apart using additional evidence. I report the following experimental work. First I investigate one of the weak points of the S2017 account. Chapter 2 reports the results of an experiment investigating the configurations with the lures inside relative clauses, which Sloggett’s account does not cover directly. S2017 argues that in this instance another grammatical option is used (sub-command binding, which I discuss in more detail in the corresponding experimental chapter). This grammatical option crucially requires the head of the relative clause to be inanimate, and indeed, this is the configuration PP2017 had. Therefore, I am going to test same 66 configurations but with animate RC heads. To preview, the results will be taken to support PP2017 account. Second, given this support, I investigate further predictions of PP2017 account to validate the conclusions from Chapter 2. The experiments by PP2017 do not provide reliable evidence regarding the use of c-command information. Their results are compatible either with a model which faithfully encodes c-commands but does not manage to use this information to rule out illicit antecedents, and a model which encodes c-command information only approximately (thus not covering relevant cases) but follows this approximation faithfully. I attempt to distinguish between these two possibilities by considering the case of quantificational (QP) lures. As I discuss in detail in Chapter 3, experimental evidence suggests that QPs in the wrong position (non-scoping/non c-commanding) cannot bind pronouns like “him” (Kush, 2013). From the syntactic point of view, binding works in the same way for pronouns and reflexives. If the c-command information fails to uniquely direct the parser’s decisions, as PP2017 suggest, we should observe lure-match effects even from non-c-commanding QP lures. If, on the other hand, some approximation, such as Kush et al. (2015) accessible is used, non-c-commanding QP lures will not elicit lure-match effects. Third, and finally, I perform a series of two experiments addressing an issue I ran into: I failed to find strong evidence for lure-match effects even from NP lures. Thus, I perform a direct replication of PP2017 Experiment 3 to determine the robustness of the effects they report. These replications also help to verify whether extra-syntactic factors, such as (non-)nativeness of experimenter, might have affected the results. If such factors do affect lure-match effects, it would 67 be more readily compatible with PP2017 account. Two experiments addressing replication issues are reported in Chapter 4. 68 Chapter 2: Interference effects from non-c-commanding lures I am going to start the thesis with investigating the only strong evidence against S2017 account: lure-match effects for the lures embedded inside a relative clause (we will call them “RC-lures” for short), e.g.“the students” in “The tea that the students drank calmed themselves” (Parker and Phillips, 2017, Exp.2). This evidence is problematic for S2017: his account crucially relies on a null operator in the left periphery of complement clauses to account for lure-match effects and says nothing about relative clauses. To save the situation, S2017 suggests an ad hoc explanation: perhaps, the lure-match effects observed with RC-lures reflect a gram- matical strategy of sub-command binding (see below). While different in substance, this explanation is similar in spirit to the main S2017’s account: it attributes lure- match effects to a grammatical mechanism, and not to a faulty processing routines. It is important to empirically test this suggestion. If it turns out that S2017 is in- correct and we have to invoke processing errors to cover that evidence, the support for his account weakens. At the very least, it would have to be modified to explain why in some cases the structure does accurately constrain the search and in others, very similar cases, it does not. In what follows, I briefly explain the phenomenon of sub-command binding and discuss my experiment testing S2017’s hypothesis. 69 Sub-command binding is a grammatical phenomenon observed in Chinese. Several empirical facts are relevant for this discussion. First, the anaphor ziji re- quires animate antecedents (Tang, 1989)1, as demonstrated by (1). (1) a. Wo taoyan ziji. I dislike ANA I dislike myself. b. Xiaomao zai tian ziji de lian. little cat is lick ANA DE face The kitten is licking its own face. c. *Men guanshang le ziji. door close PER ANA The door closed itself. Second, ziji can have a non-c-commanding NP (say, NP1) as its antecedent, as long as this NP1 is embedded within another NP2 which does c-command ziji - this phenomenon is referred to as “sub-command binding” (2). Third, sub-command binding is only possible if the embedding NP2 is inanimate (3) Tang (1989). (2) a. [[Zhangsani de] jiaoao]j hai le zijii/∗j Zhangsani DE pride hurt PER ANAi Zhangsani’s arrogance harmed himi b. [[Zhangsani zuoshi xiaoxin de] taidu]j jiu le zijii/∗j yiming Zhangsan do thing careful DE attitude save PER ANA one life Zhangsan’s cautious attitude saved him. (3) a. [[Zhangsani de] baba]j dui ziji∗i/j mei xinxin. Zhangsan DE father to ANA no confidence Zhangsan’si fatherj has no confidence in himself∗i/j 1At least, this is the accepted generalization. But see the discussion of Charnavel and Huang (2018) later in this section. 70 b. [Zhangsani pengdao de] nage ren]j dui ziji∗i/j mei xinxin Zhangsan meet DE that-CL man to ANA no confidence The mani that Zhangsanj met had no confidence in himself∗i/j To capture these patterns, Tang (1989) argues that in Chinese the structural re- quirements on the anaphor’s binders should be slightly relaxed. She suggests the following binding principles for the Chinese: A. β sub-commands α iff 1. β c-commands α or 2. β is an NP contained in an NP that c-commands α or that sub- commands α, and any argument containing β is in subject position. B. A potential binder for α is any NP which satisfies all conditions of being a binder of α except that it is not yet coindexed with α. C. A reflexive α can be bound by β iff 1. β is coindexed with α, and 2. β sub-commands α, and 3. β is not contained in a potential binder of α. The sub-command stipulation obviously captures the structural requirement. The condition C3 together with the definition B2 helps to explain why sub-command binding is only possible in (2), but not in (3): only in the first case the embedded antecedent is NOT contained in a potential binder of ziji. 71 Parker and Phillips (2017) only used inanimate matrix subjects when investi- gating RC-lures. Given this, Sloggett suggests that the sub-command binding can underlie lure-match effects from RC-lures observed by Parker and Phillips (2017). As S2017 himself points out, if this hypothesis is correct, the lure-match effects should not be observed (or at least be observed to a smaller degree) with the ani- mate matrix subjects. Investigating such configurations is the primary goal of the current chapter. However, before turning to the presentation of the experiment, I discuss a potential theoretical counter-argument to S2017 suggestion. Charnavel and Huang (2018) argue that the cases of sub-command binding in fact represent instances of logophoric interpretation of the pronoun. They observe (following Huang and Liu (2001)) the following contrast: (4) a. Zhangsani de baogao biaoshi tamen dui zijii mei xinsin. Zhangsan DE report indicate they to ziji no confidence Zhangsans report indicates that they had no confidence in self. b. *Zhangsani de shibai biaoshi tamen dui zijii mei xinxin. Zhangsan DE failure indicate they to self no confidence Zhangsans failure indicates that they have no confidence in him. They argue that the contrast arises because in (4a) the matrix subject creates con- ditions for a logophoric interpretation: namely, it allows for an implication that Zhangsan’s perspective is conveyed by the report, while no such implication is pos- sible in (4b). So, only in the first case Zhangsan is the perspective holder in the sentence, which makes it a possible antecedent for a logophoric pronoun. Charnavel and Huang (2018) support their claim with the results of an ac- 72 ceptability judgment study. They first show that ziji is acceptable with inanimate antecedents, in contrast to generally accepted claims in Tang (1989): sentences like (5a) received high ratings (4.69 on average, with the judgments being made on a scale between 1 (worst) to 6 (best)). Then they show that sentences with inanimate sub-commanding antecedents (5b) receive (significantly) lower ratings as compared to (5a) (3.35 on average). From this they conclude that apparent sub-command binding results from the logophoric interpretation of the pronoun: logophoric pro- nouns require their antecedents to be animate, while ziji does not. Thus we only expect the inanimate sub-commanding antecedent to cause judgments to degrade if ziji is being interpreted logophorically. (5) a. [Zhe ke shu de shuguan]i tai chen, ya wan le zijii. this CL tree DE tree crown too heavy, burden bent ASP REFL [The crown of this tree]i is too heavy. Iti bent itselfi. b. *[Zhe ke shu]i de guoshi ya wan le zijii. this CL tree DE fruit press bent ASP REFL The fruits of [this tree]i bent iti. If Charnavel and Huang (2018) conclusions are correct, PP2017 results have to reflect an extra-grammatical strategy: RC-lures should only be accessible when the matrix subject creates appropriate logophoric context. However, Parker and Phillips (2017) used stimuli like “The broken zipper that the skilled tailors tried to fix pinched themselves” or “The safety net that the brave policeman used saved themselves”. Arguably, in such sentences the embedding NPs do not make the embedded NPs prominent perspective holders (in any case, no more than “failure” in (4b)). Thus, 73 the lure-match effect Parker and Phillips (2017) can not be reduced to the properties of syntactic configuration. While Charnavel and Huang (2018) arguments are interesting, we think that they are insufficient to reduce all cases of sub-command binding to instances of logophoric interpretations. First, Tang (1989) gives the following example: (6) a. [[Zhangsani de] babaj de] qian]k bei Ziji∗i/j/∗k de pengyou touzou Zhangsan DE father DE money BEI ANA DE friend steal le. PER Zhangsan’s father’s money was stolen by his friend. In this example it is again not clear how “money” would encourage to treat “Zhangsan’s father” as a prominent perspective center, in the way that “report” in “Zhangsan’s report” does. Second, as Huang and Liu (2001) discuss, logophoric interpetation of ziji is subject to person-blocking effects (i.e. if a first or second person pro- noun intervenes between ziji and another antecedent, ziji obligatorily refers to the first/second person pronoun) (7). However, such effects are not present in cases of sub-commanding antecedents (8). (7) a. Zhangsani dui wo shuo zijii piping-le Lisi. Zhangsan to me say self criticize-Perf Lisi Zhangsani said to me that hei criticized Lisi. b. *Zhangsani shuo wo piping-le zijii Zhangsan say I criticize-Perf self Zhangsani said that I criticized himi (8) Zhangsani de biaoqing gaosu woj [zijii/∗j shi Zhangsan DE expression tell me self is innocent 74 wugude]. Zhangsanis [facial] expression tells me that hei is innocent. These counter-examples to Charnavel and Huang (2018) conclusions still make it possible for S2017 hypothesis to be viable. Therefore, I now turn to its experimental evaluation. 2.1 Experiment 1 2.1.1 Participants 53 members of University of Maryland participated in the experiment for a class credit or a payment of $12 (11 M, 42 F; age range: 18-27; mean age: 20.1, SD: 1.76). The experimental session took around 45 minutes, including setup and calibration. I excluded data from 3 participants: two failed to finish the experiment due to problems with calibration and for one the recorded reading patterns were too erratic to be used. The remaining 50 participants entered the analysis2. 2.1.2 Materials The study used 2x2 factorial design with target animacy (whether the ma- trix subject was animate or inanimate) and lure match (whether the lure matched or mismatched the reflexive in features) as factors. Overall, I constructed 24 sets 2I aimed for 60, to have double the number of participants in Parker and Phillips (2017) Exp.2 which I model this study after; however, I couldn’t reach that goal due to time limitations 75 of 4 items, exemplified in Table 2.1. The target antecedent always mismatched the reflexive in features (animacy and number for sentences with inanimate matrix sub- jects, and gender and number for sentences with animate matrix subjects); thus, all critical sentences were ungrammatical. The items with inanimate matrix subjects were borrowed directly from Parker and Phillips (2017) Exp.2 materials; I made sure to select items which did not contain pronouns or gaps in the spillover regions, in order not to force additional memory retrievals. The items with animate ma- trix subjects were constructed from scratch. Thus, the two subsets of items were not lexically aligned; moreover, the critical pronouns was “themselves” in the first group and “himself/herself” in the second3. While this design choice potentially introduces more noise to the comparisons, I felt it was justified: it gives us both the opportunity to test the effect of interest (lure-match effects from lures in relative clauses with animate RC heads) and to replicate previous findings (which is impor- tant to establish the reliability of PP2017 findings, given that no other study has investigated similar configurations). In addition to the sentences with reflexives, which were of the primary interest, I used a control set of 24 agreement items borrowed from Wagers et al. (2009). An example is given in (9). A standard 2x2 design was used, with the factors being 3Changing the pronouns from “themselves” to “himself/herself” is inevitable for the sentences with animate matrix subjects. Since the target is animate, we need a reflexive which can differ from the target in gender and number to be able to create a 2-feature mismatch configuration. But “themselves” does not carry gender, so if we used “themselves”, we would only be able to induce a 1-feature mismatch (in number) with the target. 76 lure number and verb number. Thus, I had 6 items per conditions. Head nouns were always singular. The lure and the verb were always separated by an adverb to avoid potential confounds4. (9) The key to the cell(s) unsurprisingly was/were rusty from many years of disuse. Finally, I had 66 fillers, falling in the following categories. 12 fillers had the same structure as experimental items with animate matrix subjects, so that people would not only see this construction in ungrammatical sentences. In addition, 6 fillers with the same structure used “themselves” as the reflexive, so that “themselves” was not uniquely associated with ungrammatical items. 6 fillers were of the structure I used in the previous experiments: “NP said that [NP reflexive]”; these sentences were introduced to make sure that the target was not always the linearly farthest NP. 12 fillers started with an inanimate noun heading a relative clause; these were in- troduced to make sure that inanimate subjects were not uniquely associated with ungrammatical sentences. 12 fillers were sentences with no constraints on the struc- ture containing a reflexive pronoun. Finally, 12 fillers were just sentences with no special properties. All of the fillers were grammatical. Overall, I had 24 critical sentences with reflexives (all ungrammatical), 24 4Briefly, people may experience slowdown on “cells” just because it’s plural, and thus arguably more complex morphologically and semantically. If the critical verb immediately follows the noun, this slow-down can affect the reading times on the verb and potentially mask intrusion effects. See Wagers et al. (2009) for more details. 77 Animate targets Lure match The skilled businesswomen that the inexperienced manager crit- icized at the meeting blamed himself for the inaccurate report. Lure mismatch The skilled businesswomen that the inexperienced secretary crit- icized at the meeting blamed himself for the inaccurate report. Inanimate targets Lure match The thank you letter that the helpful secretaries received praised themselves for a great job. Lure mismatch The thank you letter that the helpful secretary received praised themselves for a great job. Table 2.1: Experiment 1 materials example Targets are underlined, lures are bolded. control agreement attraction sentences (12 grammatical) and 66 fillers (all gram- matical), for a total of 108 sentences. Thus the grammatical-to-ungrammatical ratio was 2:1. All sentences were accompanied by a forced choice Yes/No question. 2.1.3 Procedure For the purposes of randomization, reflexive and agreement sentences were treated as same. The 48 stimuli were distributed into 4 lists in a Latin Square design, for 6 agreement and 6 reflexive sentences per condition. Fillers were added to these lists, which were then pseudo-randomized with the constraint that no more than 2 experimental items occur in a row. Each experimental list started with 7 practice items. 78 The items were presented on an LCD screen with the resolution of 1280x720 in a 12-point fixed width font (Courier). The maximum length of a line fitting on the screen was 139 symbols, so all items fit on a single line. Eye movements were recorded using EyeLink 1000 tower-mounted eye-tracker. The distance between the tower and the screen was 37 inches. The sampling rate was 1000 Hz. Participants had binocular vision during the experiment, but only the samples from one eye5 were recorded. Participants’ heads were immobilized using a chin rest and a forehead restraint. In the beginning of the session, the eye-tracker was calibrated using 9-dot calibration display. Calibration was repeated throughout the experiment as necessary. Each trial started with a fixation mark on the left of the screen. Participants had to fixate it in order for the stimuli to appear on the screen. After having read the sentence, participants pressed a button in order to display a question related to the sentence. Participants answered the question by pressing one of two associated buttons. After that, the next trial started. 2.1.4 Analysis I analyzed two regions of interest in each set of conditions: critical and spillover. For agreement sentences, the critical region included the verb and the following word (two following words, if the second word was a determiner). For reflexive sentences, the critical region included the reflexive and the last three letters of the preceding 5Normally, the right one, unless tracking the left eye was the only option giving a stable cali- bration. 79 word. This was done to increase the chances of having at least one fixation in the critical region, following Kush et al. (2015) and S2017. The spillover region was defined for both types of stimuli as two words following the critical region (three, if the second word was a determiner). Spaces were included in the region to the right. In each of the regions, I looked at the following eye-tracking measures: first pass, the sum of all fixations before the region is exited to the right or to the left; regression path, the sum of all fixations on the region from the moment it is first entered from the left and to the moment it is exited to the right, including fixations in the previous regions; total time, the sum of all fixations on the region. I chose these measures following S2017. PP2017 analyze first-pass, regression path, right bound (the sum of all fixations on the region from the moment it is first entered from the left and to the moment it is exited to the right, NOT including fixations in the previous regions) and re-read (the sum of all fixations on the regions after it has been exited once). However, I decided not to include right bound and re-read in the analysis to reduce the number of comparisons I was making. Prior to statistical analysis, the data were manually preprocessed using Eye- Doctor6 to remove blinks and ensure that fixations align with the text. Fixations shorter than 80 ms or longer than 1000ms were automatically rejected before calcu- lating eye-tracking measures using custom scripts7. Missing values were discarded form the analyses. See section 4.1.6 for a detailed discussion of how analyses choices might affect the final conclusions. In particular, rejecting missing values (instead of 6https://blogs.umass.edu/eyelab/software/ 7https://github.com/UMDLinguistics/EyePy 80 replacing them with zeros, as PP2017) might lead to more precise model estimates. I chose to log-transform reading times to bring the models’ residuals closer to normality. The transformed values were analyzed with linear mixed effects mod- els in R (R Core Team, 2014) using lme4 package (Bates et al., 2015b). Separate models were fit to the data from reflexive and agreement conditions. The following fixed effects were specified (numerical values in parentheses indicate contrast cod- ing coefficients; sum coding was used for all fixed effects). Agreement conditions: target match (match = -0.5 vs. mismatch = 0.5), lure match (match = -0.5 vs. mismatch = 0.5) and their interaction. Reflexive conditions: target animacy (animate = -0.5 vs. inanimate = 0.5), lure match (match = -0.5 vs mismatch = 0.5 in features with the reflexive) and their interaction. A fixed effect was taken to be significant if the absolute value of the associated t statistic was ≥ 2 (Gel- man and Hill, 2007). Models random effects structure was fully specified, including random intercepts and slopes for each fixed effect per subjects and items, following recommendations by Barr et al. (2013). If the model with the maximal random effect structure did not converge8, I simplifed it by removing random slopes. To be internally consistent, I always dropped the random effects in the order, shown in Table 2.2. I start with removing the interactions as the most complex term (see, e.g. Bates et al. (2015a), p.6, for a claim that this can be considered standard ap- 8As determined by diagnostics reported by lmer. One has to access the internal structure of the model fit and look at the information contained in m@optinfo$conv$lme4, where m is the model object. 81 Agreement Reflexives verb.num : lure.num | item target.animacy : lure.match|item verb.num : lure.num | participant target.animacy : lure.match|participant verb.num | item lure.match | item lure.num | item lure.match | item verb.num | participant target.animacy | participant lure.num | participant target.animacy | participant Table 2.2: Order of random effect structure simplification in Exp.1. lme4 syntax is used: “:” indicates interactions and “—” separates grouping factors. Ef- fects were removed from the model, starting from the top. proach9). Then, I remove random effects by item, making an assumption that the variability among items is smaller than among participants, and random slopes by items would make a smaller contribution to the model. As a final step, all models were automatically assessed to check whether any random effects had correlations of 1 or -1, suggesting that a given random effects structure may be too complicated to estimate given the data (Bates et al., 2015a). Such random effects were dropped from model specification, starting with the interactions 10. 2.1.5 Results Mean question answering accuracy was 91%. Fig.2.1 and 2.2 show the observed reading times means for agreement and reflexive sentences respectively. The full set 9The following StackExchange discussion may also be of interest: https://stats. stackexchange.com/questions/323273/what-to-do-with-random-effects-correlation- that-equals-1-or-1 10Admittedly, this is a somewhat simplistic approach to deal with overly complex random effect structures. See Bates et al. (2015a) for a more principled way to identify redundant random effects in lmer models. 82 of model coefficients is reported in Appendix C. I will start the discussion with the results for control agreement sentences. I found evidence for the effect of grammaticality (ungrammatical sentences being read slower): the main effect of target match was significant in all three reading mea- sures at the critical region and in regression path at spillover. The effect of lure match was significant in regression path and total times in the critical region: when the lure matched the verb in number, the sentences were read faster. Interestingly, these effects were not moderated by a significant interaction: matching lures were read faster regardless of whether the sentence was grammatical or not. This means that I did not find enough evidence for grammatical assymetry and the data is con- sistent with the presence of both illusions of grammaticality and ungrammaticality. The target match x lure match interaction did reach significance but only in regression path at the spillover region. Pairwise comparisons confirm that the in- teraction is driven by the simple main effect of lure match within ungrammatical sentences: sentences with the lure matching the verb were read faster. This pattern of results is interesting: the interaction in the spillover region is indicative of the classical attraction pattern (only illusions of ungrammaticality arise). On the other hand, the main effect of lure match and no interaction at the critical region is in line with the conclusions of Hammerly et al. (draft.april.2018): both illusions of grammaticality and ungrammaticality can be observed for subject-verb agreement. I return to this pattern in the discussion section. Now turning to reflexive conditions, three effects reached statistical signifi- cance. First, two main effects of target animacy in total times at the critical 83 Figure 2.1: Experiment 1. RT means for agreement sentences. Columns correspond to eye-tracking measures: fp - first-pass, rp - regression path, rp - total times. Rows correspond to ROIs. Errors bars represent standard error of the mean, adjusted for participant variability (Cousineau, 2005; Morey, 2008) Figure 2.2: Experiment 1. RT means for reflexive sentences. Columns correspond to eye-tracking measures: fp - first-pass, rp - regression path, rp - total times. Rows correspond to ROIs. Errors bars represent standard error of the mean, adjusted for participant variability (Cousineau, 2005; Morey, 2008) 84 and spillover regions: sentences with animate targets were read slower. Second, the main effect of lure match in total times at spillover: sentences with matching lures were read faster. This last effect provides some evidence that a) lures within RCs can provoke interference and b) this effect does not depend on the animacy of the matrix subjects, not supporting Sloggett’s suggestion. I will argue that these con- clusions are tentatively correct, although I will provide more fine-grained discussion below. Multiple comparison corrections Von der Malsburg and Angele (2017) suggest that corrections for multiple comparisons should be applied in eye-tracking studies when several models are fit to different regions of interest and eye-tracking measures. Their simulations suggest that Bonferroni correction is an appropriate one and does not result in losing overly much power. In the current experiment I fit 2 (reflexive and agreement) x 2 (ROIs) x 3 (measures) = 12 models. Bonferroni correction would result in the following corrected α: 0.05/12 = 0.004. The corresponding t-value for a two-sided test is ±2.8711. Almost all significant effects survive the correction with two exceptions: in agreement sentences, the effect of lure match in regression path at the critical region; in reflexive sentences, main effect of lure match in total times at the spillover region. 11Approximated from normal distribution following Jäger et al. (2015). The corresponding R command is qnorm(0.004/2). 85 2.1.6 Exploratory analysis In my main analysis I followed the procedure from S2017 since I consider it better along several dimensions (see Chapter 4 for discussion). However, I want to check whether the choice of analysis procedure matters. Thus, in addition to the main analysis, I also perform a set of exploratory analyses to better understand the alignment between my results and those by Parker and Phillips. First, I calculated average reaction times for right-bound and re-read times: these measures were not included in the main analyses to reduce the number of statistical comparisons, but they would allow for a closer comparison of average RT patterns with PP2017. Sec- ond, I analyzed the extended set of ROIs with the same procedure that PP2017did: the critical region was defined as including the reflexive alone, without any material to the left; missing values were replaced with zeros; no trimming of exceedingly long reading times was performed. I have also performed additional exploratory statistical analyses. In addition to the main model, I look at simple main effects of lure match by levels of tar- get animacy: in Parker and Phillips (2017) design the lure-match effect was a simple main effect, while in my case I am looking at main effects and interactions. The coefficients for statistical models are presented in Fig. 2.3. Each dot in the plot corresponds to a β̂ value from some model. We consider models fit to 4 different datasets, resulting from the variation of pre-processing choices. These 4 datasets are plotted along the Y axis of each plot. Each line in the plot corresponds to a variant of analysis, with the variable treatments delimited by underscores. The first 86 component corresponds to the study; the second component corresponds to the crit- ical region markup (“nonext” - critical region includes only the reflexive itself; “ext” - critical region includes the reflexive and three characters to the left; the second component corresponds to missing values treatment (“narm” - remove, “nazero” - replace with zero). Each column corresponds to a fixed effect in a given region: TA - target animacy; LM - lure match; TA x LM - their interaction; prws A/IA - pairwise comparisons, corresponding to simple main effects of lure match within target animacy conditions. The first five columns display estimates obtained for the critical region, the last five panels - estimates for the spillover region. Notice that the coefficients are not easily interpretable on their own, since they are on log scale. Thus, I will only discuss what inferences one would make, if one was simply looking at the coefficients and checking whether they are significantly different from zero. I take coefficients with |t| > 2 to be significantly different from zero, and mark them in red. I discuss first what conclusions Parker and Phillips (2017) would have made based on my data. We need to look at the “nonext na.zero” version of analysis (i.e., critical region comprising only the reflexive itself with missing values replaceed with zeros), the “Prws IA” column (since PP2017 only had that single condition) and all eye-tracking measures except for total times. In their analyses, PP2017 found significant lure-match effect in first-pass, right-bound and re-read times at the critical region and re-read times at spillover. As can be seen from Fig. 2.3, in my data only the last effect is replicated. Turning to other possible analysis variants, we can see that the Target an- 87 Figure 2.3: Sensitivity analysis: Model coefficients in Exp.1 Errors bars represent standard deviation of the estimates. See text for further description of columns and row labels. imacy x Lure match interaction virtually never reaches significance, except in regression path at spillover for analysis variants where missing values are replaced with zeros. Main effect of lure match is mostly detectable when one chooses a critical region comprised of reflexive alone (all eye-tracking measures except total times if one removes the missing values, and right-bound and regression path if one replaces them with zeros). If one chooses an extended critical region, the only significant effect of lure match is observed in right-bound times. At spillover, the effect is observable at total times for the variants of analysis removing missing values - this is what I reported in the main analysis. Simple main effect of lure-match for sentences with animate matrix subjects 88 is only significant at regression path if one chooses to look at reflexive only and to remove missing values. Simple main effects of lure-match for sentences with inanimate matrix subjects are a little bit easier to show up as significant: at the critical region, they appear in total times if one looks at the reflexive only and removes the missing values. At the spillover, they appear in regression path for the variants of analysis replacing missing values with zeros and in total times for all variants of analyses. 2.1.7 Discussion I start the discussion with the control agreement attraction sentences. I have found reliable evidence for agreement attraction, which suggests that the experiment worked as intended and interference effects can be detected in the data I collected. Interestingly, I did not always observe the classical pattern: lure-match effects only within ungrammatical sentences. I did observe it in regression path at the spillover; however, at the critical region there was evidence for lure-match effects within both grammatical and ungrammatical conditions. That is, I observed both illusions of grammaticality and ungrammaticality, supporting the claims by Hammerly et al. (draft.april.2018) that grammatical asymmetry is observed only in a proportion of the experiments. Also in line with their conclusions is the fact that the current experiment had the lowest grammatical-to-ungrammatical ratio among all my ex- periments which included agreement conditions (2:1 vs. 3:1 in Exp.3 and 5:1 in Exp.4). Hammerly et al. (draft.april.2018) observed that as the proportion of un- 89 grammatical items increases, the illusions of ungrammaticality are more likely to be observed. Notice though that in their study, illusions of ungrammaticality were not observed even when the grammatical-to-ungrammatical ratio was 1:1 (even lower than in the current experiment), and only appeared when the ratio lowered to 1:2. Thus, while the direction of the influence between my experiment and Hammerly et al. (draft.april.2018) appears to be similar, the cut-off points for the emergence of ungrammaticality illusions do not agree. Therefore, I prefer not to make theoretical claims about the cause of the symmetrical lure-match effect I observe, only noting that the effect is, indeed, symmetrical, contra what the accepted generalization is. This symmetry bears on the main question I address in my thesis. As I dis- cussed in Chapter 1, the cue-based model by Lewis and Vasishth (2005) is not able to accommodate such patterns. While the addition of the prominence correction suggested by Engelmann et al. (draft4) would allow the model to capture those patterns, it is not at all clear whether there are linguistic reasons for applying the prominence correction (i.e., why would we think that “cabinets” is more prominent than “key” in “The key to the cabinets are...”?) Thus, symmetrical agreement at- traction patterns are a point against Lewis and Vasishth (2005) model. Notice that this model underlies both PP2017 and S2017 explanations for the reflexive data, so if we end up doubting it, we will have to doubt these accounts as well. It is especially so for PP2017, who explain the lure-match effects in terms of faulty memory search. It is less critical for S2017: while he uses the cue-based architecture to capture the fact that lure-match effects are only observed in sentences with targets mismatch- ing the reflexive in two features, the lure-match effect itself does not rely on the 90 properties of cue-based memory search. In principle, any search method which only searches the local domain of the reflexive and mostly returns the overt target, but oc- casionally returns the logophoric operator, could work for S2017 model. (Although it would still have to explain somehow why this other search method is more prone to return the logophoric operator when the target mismatches in features). Thus, symmetrical agreement attraction may be slightly more problematic for PP2017. Now I turn to reflexive conditions, which are of main interest. The central goal of the experiment was to further investigate the lure-match effects from lures inside a relative clause (“RC-lures”). This configuration is interesting, since this is the only instance of lure-match effects not covered by the S2017 account. As such, it may be the only evidence for lure-match effects emerging from ungrammatical dependency resolution and for the sentence processing not respecting structural constraints. I were asking two questions. First, how strong is the evidence for the existence of lure-match effects from RC-lures? Unlike with the effects with c-commanding lures, that have been replicated in several experiments, the RC-lures configuration has only been tested once by Parker and Phillips (2017). If it turns out that the effects were artifactual, they cannot be used as an argument against the S2017 ac- count. Second, does the animacy of the matrix subject affects the lure-match effect? If the lure-match effect truly reflects ungrammatical resolution of the dependency, it should not. On the other hand, if, as S2017 suggested, the effect reflects a gram- matical phenomenon of sub-command binding, we should only observe it when the matrix subject is inanimate. I discuss these questions in turn. The first question I ask is - have I managed to replicate PP2017 and do we have 91 reasons to believe that lure-match effects from RC-lures in “inanimate” conditions are real? I argue that the experiment does provide the evidence for this, albeit limited. On the one hand, statistical analysis does confirm the presence of lure- match effects in total times at spillover. An exploratory analysis also shows that if I had followed Parker and Phillips (2017) procedure, I would have found a simple main effect of lure match within sentences with inanimate matrix subjects in the regression path at the spillover - this is in line with PP2017 results, who also find such an effect. It also appears that regardless of the pre-processing decisions, at least some effect indicative of interference would come out significant. I have to note, though, that these conclusions should be taken with a grain of salt. First, the main effect of lure match in total times I observed in my primary analysis would not survive the Bonferroni correction for multiple comparisons. If I were being conservative, I would not claim that I found evidence for the interference. Furthermore, it may be worrying that the prima facie inconsequential choices one makes during pre-processing (i.e. whether the critical character does or does not include the three last characters of the word preceding the reflexive) affect when and what exact effects indicative of the interference are becoming significant. To us, this indicates that the amount of evidence available in the data is not overwhelming, and more highly powered experiments are needed to make better conclusions. One final worry is that the patterns I observe do not completely align with what Parker and Phillips (2017) report. They observed significant lure-match effects in four eye- tracking measures in the critical region, while I did not. Also, as Fig.2.4 shows, numerical magnitudes of the lure-match effects are about twice as small than what 92 PP2017 report. I will defer the discussion of these numerical differences until later chapters, where such differences emerge again and are investigated in more detail. Figure 2.4: Experiment 1. Comparisons of the interference effect magnitudes in reflexive conditions with Parker and Phillips (2017). Interference effect is calculated as the difference between lure match and lure mismatch conditions. Errors bars represent standard error of the difference of the means (calculated under the assumption that RTs in lure match and lure mismatch conditions are not cor- related. This is likely false, since they come from the same subjects, thus, the SEs are overestimates). Now that I have tentatively concluded that my results are roughly in line with PP2017 conclusions, the next question is: do I have any evidence for the difference for the lure-match effect from RC-lures within “animate” conditions? This is the critical question to test S2017 subcommand hypothesis: if it is correct, I should not find any lure-match effects in “animate” conditions. The evidence appears to speak against S2017 hypothesis, although again, in a rather limited way. The only statistically significant effect pointing out in this direction is the main effect of lure match in the total times at spillover, not accompanied by a significant target animacy x lure match interaction. This results suggests that 93 there is not enough evidence to conclude that lure match effect behaves differ- ently depending on the animacy of the matrix subject12. Numerical values suggest similar conclusions. Fig.2.4 shows the magnitude of lure-match effect in “animate” and “inanimate” conditions along with the effects from the original PP2017 study. We can see that regardless of the pre-processing procedure, lure-match effects are present in the “animate” conditions. The lure-match effect in conditions with inan- imate matrix subjects is small, as I have already discussed. On the contrary, in the sentences with animate matrix subjects the effect is around 50ms in magnitude, which, as we will see, brings it within the range of interference effects I observe in the rest of my experiments. In the early eye-tracking measures (first-pass, right-bound, spillover) it aligns rather closely with the original lure-match effect from PP2017. In the late measures (re-read, total times) it rather patterns together with the lure- match effect in the sentences with inanimate heads. I take these patterns to suggest that lures within an RC do provoke lure-match effects even when the matrix subject is animate. This conclusion holds if I adopt the PP2017 pre-processing routine, although in this case the magnitude of the effect in animate conditions goes down. Just the presence of lure-match effect in “animate” conditions does not neces- sarily contradict S2017’s hypothesis: the effect could be merely reduced, not elim- 12If we look at the patterns of the interference effects in Fig.2.4, we will notice that this effect is likely driven uniquely by the lure-match effect within “inanimate” conditions. Exploratory pairwise comparisons confirm this impression. This would support the S2017 subcommand hypothesis. However, these impressions come from exploratory analysis, and as I discuss below, other patterns in these analysis still suggest that lure-match effects do arise in “animate” conditions. 94 inated, when the matrix subject is animate. However, the numerical patterns do not seem to support this possibility. Almost in all measures in both pre-processing variants (except for late measures with PP2017 preprocessing) the lure-match effect in “animate” conditions is equal or bigger to the effect in “inanimate” conditions. Overall, I conclude that the experiment provides some evidence against S2017’s sub-command explanation, although the evidence is merely suggestive and should be verified in a highly powered confirmatory experiment. 95 Chapter 3: Interference effects with quantificational lures In the previous chapter we have shown that S2017 sub-command explanation for relative clause lures is likely wrong, and at least some lure-match effects might be better explained as instances of fallible memory access. This serves as a support for PP2017 account, which claims that structural cues do not have special status in the system: they do constrain memory access to some degree, but can be outweighed by other sources of information. In this chapter I am going to further test PP2017 account by looking at quantificational (QP) lures. The main question I address here is: are all structural cues treated equally by the parser? In particular, I will argue that available evidence tells us little about how c-command information is used, and I will report three experiments designed to fill this gap. In this chapter, I am going to adopt Chomskyan Binding theory view, in which possible antecedents should c-command the anaphor and be local to it in some relevant sense. For the purposes of my experiments, locality can be reduced to clause-mateness. This simplification has an additional advantage: it makes it easy to represent locality as a cue in content-addressable architecture; but theoretical descriptions of locality conditions are more involved and can be trickier for cue- based models to accommodate (see, e.g., Cunnings et al. (2015) for an exploration 96 of less canonical configurations). Having clarified these assumptions, what evidence does PP2017 provide for the nature of these structural constraints? Their evidence, undoubtedly, suggests that locality constraints are being used but do not uniquely guide memory access: without making this assumption, we would have hard time explaining the contrast between the absence of lure-match effects in 1-feature target mismatch conditions and their presence in 2-feature target mismatch sentences. The situation with c-command is more complicated. Prima facie, Exp.2 (where lures were located inside a subject relative clause) suggests that c-command acts fails to uniquely guide retrieval as well. However, this pattern of results does not tell us anything about whether or how c-command information was used. First, the same pattern of results could hold if the parser never even included c-command information in the set of retrieval cues and instead only relied on lo- cality. In Exp.2 PP2017 only looked at 2-feature target mismatch configurations. There, locality would fail to rule out a non-local antecedent, and if no c-command information were used to further restrain memory access, both c-commanding and non-c-commanding lures could be potentially retrieved. While theoretically possible, we consider this possibility unlikely, based on empirical evidence from other depen- dencies. Antecedents c-commanded by the pronoun appear to be excluded from consideration, in accordance with Binding Theory Principle C (Kazanina et al., 2007; Aoshima et al., 2009; Kazanina and Phillips, 2010). Similarly, Kush et al. (2015) show that non-c-commanding quantified NPs do not appear to induce lure- match effects in pronouns resolution. This evidence suggests that at least some 97 way of representing c-command constraints should be available to the parser. Since reflexives resolution is also constrained by c-command, a priori it might be weird to assume that the parser can represent such information but prefers not to use specifically with reflexives. However, the fact that c-command is used by the parser in some manner does not necessarily imply that it is represented in such a way that it faithfully captures all and only licit c-command relations in the grammar. As we will discuss, faithfully representing c-command relations in cue-based memory models is not straightfor- ward. Accordingly, various approximate encodings have been suggested (Alcocer and Phillips, 2012; Kush et al., 2015). Some of these approximations (e.g. Kush et al. (2015) accessible feature) are formulated in such a way that in the config- urations used by Parker and Phillips (2017) they would not render the lures illicit. If this is the case, it would be hard to talk about the feature failing to categorically constrain processing: it would do so, only the constraints it would follow would not perfectly map to c-command constraints defined in the grammar. This conclusion would not align with the PP2017 account of lure-match effects which assumes that any structural cue can fail to categorically restrict retrieval. Thus, it is important to empirically evaluate the sensitivity of lure-match effects to the c-command status of the antecedent. In what follows, I will first discuss the approaches one might use to represent c-command information, and then show how we could experimentally differentiate them. 98 3.1 Representing c-command in cue-based models C-command is defined as follows (Reinhart, 1976): a node A c-commands a node B if neither A or B dominate each other, and the first branching node which dominates A dominates B. From the point of view of Chomskyan Binding Theory, reflexives are bound by their antecedents, and for that, a c-command relation be- tween the binder and the bindee is necessary. Thus, a priori we might expect that the parser will be able to encode and use c-command information. On the other hand, c-command is inherently relational: one has to know the position of two nodes in a tree in order to figure out whether one of them c-commands the other. And as Alcocer and Phillips (2012) discuss, representing relational infor- mation in an efficient way in cue-based architectures may not be straightforward: one cannot just use a feature like “+ c-commanded” or “+ c-commander”, it would have to be much more specific, something like “+ c-commanded by NP17”. Given the number of potential c-commanders a given node can have, if one attempts to encode these relations explicitly, the size of the representation and the processing effort required to update the information on the older nodes as the new nodes come in may quickly get out of hand. Additionally, as Alcocer and Phillips note, ex- haustive encoding might be inefficient in restricting the range of constituents being considered during memory retrieval. To demonstrate this, they discuss a scenario in which each node in the tree carries a complete list of its c-commanders. When a c-commander needs to be retrieved, this list (or its individual components) is included in the set of retrieval 99 cues. However, this could lead to several problems. In ACT-R model, the activation for the items which do not match some of the retrieval cues is decreased, in proportion to the number of the mismatching cues. With this encoding approach, any single item would only match one of these cues, receiving a sizeable mismatch penalty for all the others. This activation drop could lead to longer retrieval times or even retrieval time-out. In addition, retrieving an incorrect item would be easier: a constituent might have many c-commanders utterly irrelevant to the dependency at hand, such as complementizers or functional projections. Yet they will receive an activation bump and might end up being misretrieved. A related approach (not discussed by Alcocer and Phillips (2012)) would be to annotate each chunk with a list of phrases it is c-commanding. When the parser reaches the reflexive, it would use its ID to create a single retrieval cue, e.g. “+ c-commands NP17”. This approach would escape some of the problems from the previous one, but would also bring new issues. The mismatch penalty problem of the previous approach would be circumvented, since only a single c-command-related retrieval cue would be used. However, the efficiency of this cue will be very low, practically making it all but useless1: due to the fan effect among all of the items matching the c-command cue, activation boost to any single one will be rather small. 1That is, if we accept the standard assumption that cues are combined additively. If instead we assume that they are combined multiplicatively, so that a mismatch on any single cue drops the activation significantly, we will avoid the fan effect problem. But we will also make the structural cues rather powerful, potentially removing ways for ever receiving a structurally inappropriate constituent. This may or may not be desired: e.g. this change would fit rather well in the S2017 model, but not the PP2017 one. 100 The problems of contacting completely irrelevant items and of spreading redundant information across the representation are shared with the previous approach. The above challenges to exhaustive encoding of c-command information would suggest that if one wants to efficiently represent relational information in a cue- based system, one would have to rely on approximations. Alcocer and Phillips (2012) suggest two such approximations. The first one relies on a binary “command- path” feature, which every node carries. It is switched to on just in case a node c-commands the node which is currently being introduced to the parse, otherwise it is set to off. This encoding schema has two disadvantages. First, it does not allow the parser to use c-command information to make attachment decisions. Second, more importantly, it only allows the system to know which nodes c-command the node which is just being introduced. It does not let the parser know which nodes the current node c-commands (which may be necessary in head-final languages), not does it allow to figure out c-command relations for nodes which are already part of the parse (which might be needed in special configuration, e.g. those involving phonologically null pronouns, whose existence is only recognized when the following node is being processed). A second heuristic by Alcocer and Phillips (2012) relies on “dominance spines”. A dominance spine is a chain of nodes dominating one another along a right branch in the tree. Every time a left branch is created, it is assigned a new dominance spine index. In order to figure out whether the two nodes are standing in a c-command relation, one just needs to know whether their parents share the same dominance spine. In our case, when a parser reaches the dependency tail, it would look up the 101 dominance spine index of the tail’s parent and look for other nodes whose parents also have this index2. The limitations of this approach are as follows. It does not capture certain kinds of c-command relations; specifically, when a node is embedded too deeply in a left branch, the dominance spine heuristic will fail to capture its c- command relations with the nodes which c-command the top of the branch. In addition, this approach still has (arguably minor) problems with potentially leading to retrievals of irrelevant c-commanding nodes. All of the approaches above, while differing in details, have one thing in com- mon: they are trying to capture c-command relations proper, and the encoding schemas they suggest could be used by any kind of mechanism requiring c-command information. This generality is an advantage of these approaches and it brings them in closer alignment with grammatical theories. It also leads to the situation where c-command information fails to be a strong filter, making only minor contributions to constraining memory access due to fan effect (at least as long as one assumes that retrieval cues are combined additively). This may be seen as another advan- tage, if we take empirical evidence from subject-verb agreement, reflexives and NPIs to indicate that, indeed, c-command information may fail to strictly constrain the range of retrieved constituents. It would be more of a disadvantage to hypotheses which postulate that structural information helps to categorically filter out illicit dependency elements (e.g Sloggett, 2017). A different kind of approach is advanced by Kush et al. (2015). They ac- 2In order for this schema to work, each node would also have to encode some information about its parent, including the parent’s dominance spine 102 knowledge the complications with exhaustive encoding of relational information in cue-based models and also use an approximation. However, the feature they sug- gest, accessible, is designed to capture a very specific use case of c-command: binding by quantifiers. In a series of experiments, Kush et al. (2015) show that the parser is sensitive to the binding constraints on quantificational antecedents. First, they compare sentences like (1a) where the QP c-commands the pronoun to sentences like (1b), where it does not. They show that people experience more processing difficulties in (1b) as compared to (1a), as indexed by a slow-down in reading times in first-pass and right-bound times in the spillover region. This is in- terpreted as evidence for people not considering the grammatically inaccessible QP as the antecedent and either trying to bind the pronoun to a gender-mismatching “Kathi” or trying to coerce a sentence-external referent. These results show that the parser can distinguish between grammatically appropriate and grammatically inappropriate QP antecedents. (1) a. Kathi didnt think any janitor liked performing his custodial duties, but he had to clean up messes left after prom anyway. b. Kathi didnt think any janitor liked performing his custodial duties when he had to clean up messes left after prom anyway. Second, Kush et al. also show that the parser appears to rule out inappropriate QPs categorically: people appeared to experience similar processing difficulties in both (2a) and (2b), despite the fact that in the second case the QP antecedent 103 fully matches the pronoun in features. That is, QP lures do not appear to induce lure-match effects. Similar findings have been reported by Cunnings et al. (2015). (2) a. The troop leaders that no boy scout had no respect for had scolded her after the incident at scout camp. b. The troop leaders that no girl scout had no respect for had scolded her after the incident at scout camp. To capture these patterns, Kush et al. suggest that a single feature (“±accessible”) is used to mark the status of potential referents. NPs are always + accessible for retrieval. QPs, however, start out as + accessible, but are switched to - accessible as soon as the parser detects the edge of their c-command (or scope) domain. This implementation avoids several problems from Alcocer and Phillips (2012) (fan effect would be reduced due to the fact that only NPs and QPs carry the accessible feature; feature update requires minimal computations (unlike in some cases of “command-path”)) at the cost of generality (this approach would not be able to rule out binding by non-c-commanding NPs). Before we turn to our experiments, we briefly address one more issue. As we have discussed, Kush et al. (2015) findings do not allow to decide whether accessi- ble tracks c-command or scope, because these two things are perfectly confounded in their stimuli. However, recent experiments by Moulton and Han (toappear) may disambiguate between these two possibilities. They suggest that at least in some cases (including configurations examined by Kush et al. (2015)) c-command 104 is the relation tracked by the parser. They use feature mismatch paradigm to see whether scoping but not c-commanding QPs would be accessed during pronoun res- olution. They show that in configurations like (3a), where the QP “each boy” does c-command the pronoun, people are faster to read the pronoun if it matches the QP in gender. However, in (3b), where the QP scopes over but not c-commands the pronoun, no such effects were observed. This can mean that only c-commanding QPs are considered by the parser. (3) a. It seems each boy brought fresh water from the kitchen quickly right before he/she went on an early break. b. After each boy brought fresh water from the kitchen quickly it seems that he/she went on an early break. c. After the boy brought fresh water from the kitchen quickly it seems that he/she went on an early break. A follow up experiment ruled out the possibility that the lack of the gender mis- match effect from scoping but not c-commanding QPs is not related to the reduced prominence of QPs within adjunct clauses: referential NPs in the same position (as in (3c)) did elicit gender mismatch effects. Moulton and Han (toappear) inter- pret these results as meaning that scope alone cannot explain constraints on QPs antecedents, and c-command is an important component of these constraints3. 3They also provide results which are hard to explain if we think that c-command alone is used. They show in an interpretation study that in both (3a) and (3b) people choose co-varying interpretation equally frequently, about 60-65% of the times. Whether the QP does or does not 105 3.2 Outline of the experiments The distinction between the general approaches by Alcocer and Phillips (2012) and the specific one by Kush et al. (2015) brings us back to the interpretation of the data from Parker and Phillips. To remind, they observed lure-match effects from non-c-commanding lures in configurations like (4), which prima facie could be used as evidence for the c-command acting as a violable constraint. However, this interpretation will depend on the encoding schema we choose. If a general strategy of the sort discussed by Alcocer and Phillips (2012) is used, Parker and Phillips (2017) conclusions will be supported4. If, on the other hand, the parser relies on a task-specific approximation a-la Kush et al.’s accessible, Parker et al. lure match effects from Exp.2 would rather indicate that the parser faithfully follows whatever encoding is available to it. Since NPs are always + accessible, the parser would have legitimate “right” to choose them as antecedents, given that locality constraints have already been overridden. That is, the choice is between a parser which follows a faithful representation less than perfectly, or a parser which faithfully follows a less than perfect representation. (4) The soothing tea [ that the nervous students drank ] calmed themselves down after the test. I try to tease these two possibilities apart in three experiments. I do this c-command the pronoun does not seem to affect the judgments. 4To the degree that we can also rule out Sloggett (2017) explanation. 106 by looking at QP lures in c-commanding and non-c-commanding positions. If the system is relying on domain-general c-command information which sometimes fails to perfectly guide retrieval, we will expect to observe lure-match effects from the QPs, regardless of where they are located. If, on the other hand, the system is faithfully following a more domain-specific feature like accessible, the c-command status of the lures will determine the outcome: we would only expect to observe lure-match effect from c-commanding QP lures. The presence or absence of lure-match effects from QP lures may also be informative for Sloggett (2017)’s model. Empirically, QPs appear to be acceptable as logophoric antecedents across languages (we have been able to find examples from Icelandic (Sells, 1987, p.467), Chinese (Huang and Liu, 2001, p.165), Yoruba (Adesola, 2006, p.2091)). As as (5) shows5, QPs antecedents are also acceptable in the sub-command configuration in Chinese. Given these facts, S2017 account would predict lure-match effects from c-commanding lures and from sub-commanding lures embedded under an inanimate noun. For sub-commanding lures embedded under animate nouns (as in our experiments), no lure-match effects should be observed. (Notice that our Exp.1 has already suggested that sub-command explanation is not on the right track. However, since the evidence there was mostly based on numerical patterns, we discuss the corresponding predictions for S2017 account as well, to be able to determine whether further evidence is (in)compatible with it). (5) a. mei yi-ming jizhei xie de baodao dou hai-le zijii every one-CL reporter write MOD report ADVQuant harm-PFV self 5I would like to thank Nick Huang, Chia-Hsuan Liao and Yu’an Yang for these judgments. 107 The report that every reporteri wrote harmed himi b. mei yi-ming yuangongi huode de huahong zuizhong dou every one-CL employee receive MOD bonus ultimately ADVQuant hai-le zijii hurt-PFV self The bonus that every employeei received ultimately hurt himi A potential counter-example to the cross-linguistic facts above comes from Postal (2006), who briefly discusses examples like (6)6. These examples suggest that in En- glish, long-distance antecedents of reflexives cannot be quantificational. If we think that English-specific evidence should receive priority over cross-linguistic evidence, S2017 account would predict no lure-match effects from QP lures regardless of their position. (6) a. *Every/No woman1 claimed that HERSELF1, most people could never understand. b. *Every/No woman1 claimed that they had praised no one but herself1. c. *Every/No woman1 claimed that they had defaced carvings of herself1. d. *Every/No recent president1 claimed that the Queen was inferior to himself1. e. *Every/No woman1 claimed that it was HERSELF1 that people should vote for. f. *It was HERSELF1 that every/no woman claimed that people should vote for. 6Caps indicate strong stress. 108 g. *Every/No woman1 claimed that there was still HERSELF1 for people to vote for. h. *Every/No woman1 claimed that Bob wanted to interview Carl and herself1. i. *Every/No woman1 claimed that as for herself1 she would prefer to drink beer. j. *Every/No waitress1 claimed that workers like herself1 deserved a raise. The rest of the chapter reports three experiments. In the first one I consider configurations like (7a), to determine whether QP lures embedded within relative clauses - i.e. in a non-c-commanding position - provoke lure-match effects. To preview the results, I did not find any interference from QP lures, supporting the possibility that a process-specific feature like accessible is being used. The in- terpretation of these results is complicated by the fact that NP lures did not pro- voke any lure match effects either, contra results reported in Parker and Phillips (2017). To determine possible reasons for this, in the next two experiments I look at c-commanding lures, as in (7b). If accessible is, indeed, in use, we should observe interference effects. Indeed, I did observe numerical patterns suggestive of lure-match effects in both QP and NP stimuli, although these patterns were not supported by statistical analysis. I suggest several possible reasons for this, which lead to the follow-up experiments discussed in Chapter 4. (7) a. The stuntmen [ that no/the actress had worked with ] introduced herself to the rest of the cast. 109 b. No/the understanding doctor would complain that expectant mothers distress himself during stressful medical examinations. 3.3 Experiment 2 The goal of my first experiment is to check whether the parser faithfully obeys structural relational constraints as specified in the grammar; in particular, whether it obeys the c-command/scope restrictions on QP antecedents. I contrast two hy- potheses. The first one is that c-command/scope constraints are implemented only approximately (in particular, I choose Kush et al. (2015) accessible feature hy- pothesis), but the parser strictly follows this approximation. If this hypothesis is true, I will not observe lure-match effects from non-c-commanding/non-scoping QP antecedents. The second hypothesis is that the parser does not follow the c-command/scope constraints, regardless of whether they are implemented approxi- mately or in strict accordance with grammatical generalizations7. If this hypothesis is correct, I will observe lure-match effects from non-c-commanding/non-scoping QP lures8. Only the second scenario would constitute strong support for Parker and Phillips (2017) claim that structural features are of limited use to the parser when it has to select suitable antecedents. 7A third possible hypothesis – that the c-command/scope constraints are implemented in full accordance with the grammar and the parser DOES faithfully obey them appears to be ruled out by Parker and Phillips (2017) Exp.2. 8Notice that we would have made the same prediction if we assumed that c-command/scope information is never used in the reflexive resolution at all. However, a priori this possibility seems unlikely, and I do not pursue it further. 110 3.3.1 Participants 32 members of University of Maryland (13 M, 19 F; age range: 18-28; mean age: 20.2, SD: 2.15) participated in the experiment for a class credit or a payment of $10. The experimental session took about one hour, including setup and calibration. Data from 7 participants were excluded: one because of a software failure; two because they did not manage to complete the experiment in an hour; two because the preprocessing suggested that the data is extremely noisy; one because of an experimenter’s error; one due to low question answering accuracy (64% correct). 3.3.2 Materials Critical sentences were modeled after Parker and Phillips (2017, Exp.2) and Kush et al. (2015, Exp.2)9. An example of a critical sentence is given in (8). An example of a full set of stimuli along with accompanying contexts (see details below) is given in Table 3.1. (8) The stuntmen that the actress had worked with introduced herself to the rest of the cast. In all critical sentences the target was the subject of the matrix clause; the lure was the subject of a relative clause modifying the target; the reflexive was the object of the matrix verb. Thus, the lure is neither local relative to the reflexive nor 9I would like to thank Julia Buffinton for her great help with constructing the materials and providing native speaker judgments on them. 111 c-commanding it. The study used 2x2 factorial design with lure type (NP or QP) and lure match (the lure matching or mismatching the anaphor in gender) as factors. The target always mismatched the anaphor in gender and number, thus, all the critical sentences in the experiment were ungrammatical. I used nouns with both defi- nitional (“woman”) and stereotypical (“secretary”) genders. The gendered nouns came partly from previous studies (Parker and Phillips, 2017; Osterhout et al., 1997), and partly from a native speaker judgments. Target NPs were always plural, lures were always singular. Half of the reflexives were masculine, and half were feminine. QP lures were always created with the negative quantifier “No”. This quantifier was chosen to make sure that binding is indeed the only linking option (see Kush (2013); Kush et al. (2015) for a discussion of how other quantifiers, like “every”, sometimes allow for alternative readings which resemble binding, but do not behave exactly like it). Additionally, I made sure that the interpretation of the reflexive is not biased towards one of the potential antecedents: matrix verbs denote an action which can plausibly be performed by the target on the target itself (in case of grammatical binding of the reflexive; e.g. stuntmen can introduce themselves) and by the target on the lure (in case of the ungrammatical binding of the reflexive; e.g. stuntmen can introduce the actor/actress to others). The presence of a subject relative clause makes the sentences sound less natural outside of a context. It suggests that the matrix subject is selected from a larger class, in such a way that only the selected members have the property denoted by the relative clause. But this larger class is never mentioned explicitly, which may 112 tax the processing, if people try to come up with it on the fly. This may distort the reaction times for the critical sentences and mask potential intrusion effects. To avoid this, I add two context sentences which explicitly introduce the larger class, to make the relative clause modification of the critical sentence subject more felicitous. The contexts are structured in the following way. For the NP lures, we first indicate the existence of a group of people which the target NP will be related to (stuntmen in Table 3.1, and a salient individual (actor/actress)). We then describe an action/attitude of that individual which sub- divides the target group in two. In the example above, some stuntmen have worked with one of the actors/actress, and some have not. This allows us to single out stuntmen based on the property of (not) having worked with the actor in question. For the QP lures, we indicate existence of two groups of people (e.g., stuntmen and actors/actresses). Then, we describe a situation in which some (but not all!) members of the second group have a relation/attitude to some members of the first group. Importantly, there must be some members of the first group that no member of the second group has relation to. E.g. in QP Lure conditions in Table 3.1 some stuntmen have worked with some of the actors, but there are stuntmen who haven’t worked with any of the actors. The existence of this latter subgroup of stuntmen allows us to felicitously use a relative clause with a negative quantifier to single this subgroup out. As I mentioned above, all of the critical sentences are ungrammatical. Thus, if I don’t find any interference effect, I am not able to check whether the experiment worked as intended. In order to remedy this, I included a control set of 12 agreement 113 attraction items, mostly adapted from Wagers et al. (2009) (see the description for Experiment 1 for more details). Thus, I had 3 items per conditions10. Agreement attraction sentences were embedded into short contexts as well. NP lures The action movie had a very large cast, only some of whom knew each other from previous films, so they had a round of introduc- tions. Most of the stuntmen had already worked with one of the actors/actresses, so they decided to talk first. Lure match The stuntmen that the actress had worked with introduced her- self to the rest of the cast. Lure mismatch The stuntmen that the actor had worked with introduced her- self to the rest of the cast. QP lures The action movie had a very large cast, only some of whom knew each other from previous films, so they had a round of introduc- tions. Most of the stuntmen had already worked with some of the actors/actresses, but some of the stuntmen were novices, so they decided to talk first. Lure match The stuntmen that no actress had worked with introduced her- self to the rest of the cast. Lure mismatch The stuntmen that no actor had worked with introduced her- self to the rest of the cast. Table 3.1: Experiment 2 materials example Targets are underlined, lures are bolded. Finally, I used a variety of fillers type to make the manipulation less noticeable and to prevent people from adopting an unusual reading strategy, which is possible if they notice that all sentences with reflexives are ungrammatical. Similar to other 10This is fewer than usual, but the size of the experiment did not allow to include more. 114 sentences in this experiment, filler sentences were embedded into short contexts. Some of the fillers contained grammatical violations which could occur in any of the three sentences of the text. Appendix E provides detailed information on the fillers used. Overall, I had 24 experimental stimuli with reflexives, 12 control agreement stimuli and 66 fillers for a total of 102 items. Each item consisted of two context sentences and one critical sentence, for a total of 306 sentences. The grammatical-to- ungrammatical ratio was roughly 3.5:1 if counting individual sentences and roughly 1:2 if counting items (I count any item containing at least one ungrammatical sen- tence as ungrammatical). All of the items in the experiment were accompanied by a forced-choice Yes/No question to make sure that the participants are paying attention while reading. 3.3.3 Procedure The procedure was mainly the same as in Exp.1 with the following minor differences. Since the stimuli consisted of three sentence contexts, they did not fit on a single line. The maximum length of a line fitting on the screen was 139 symbols, so all items were broken down to several lines of texts (in most of the cases three, but sometimes more). I ensured that critical and spillover region did not occur immediately before or after the line break. Each experimental list started with 5 practice items. . 115 3.3.4 Analysis Analysis procedure was identical to that in Exp.1 with the following differences due to the design of the experiment. Separate models were fit to the data from reflexive and agreement conditions. Agreement data analysis was identical to that in Exp.3. The following fixed effects were specified for the model fit to the reflexive data (numerical values in parentheses indicate contrast coding coefficients; sum coding was used for all fixed effects): Lure type (NP = -0.5 vs. QP = 0.5), Lure match (match = -0.5 vs mismatch = 0.5 in features with the reflexive) and their interaction. Models random effects structure was fully specified, including random intercepts and slopes for each fixed effect per subjects and items, following recommendations by Barr et al. (2013). If the model with the maximal random effect structure did not converge, the random effects were dropped in the order specified in Table 2.2 (replacing target animacy withlure type to account for differences in the experimental design). 3.3.5 Results Mean question answering accuracy was 92%. Figures 3.1 and 3.2 show the observed reading times means for agreement and reflexive stimuli respectively. Sta- tistical analyses indicate that there is very little evidence for any effects: the only effect which reaches significance is the main effect of lure match in the agreement stimuli at the critical region in regression path. The positive coefficient means that on average, logRTs are smaller (i.e. people are faster) in lure match conditions 116 Figure 3.1: Experiment 2. RT means for agreement sentences. Columns correspond to eye-tracking measures: fp - first-pass, rp - regression path, rp - total times. Rows correspond to ROIs. Errors bars represent standard error of the mean, adjusted for participant variability (Cousineau, 2005; Morey, 2008) Figure 3.2: Experiment 2. RT means for reflexive sentences. Columns correspond to eye-tracking measures: fp - first-pass, rp - regression path, rp - total times. Rows correspond to ROIs. Errors bars represent standard error of the mean, adjusted for participant variability (Cousineau, 2005; Morey, 2008) 117 regardless of whether the target matches or mismatches the verb. Multiple comparison corrections As in the previous experiments, I fit 12 mod- els, thus the Bonferroni corrected α-value remains the same: 0.004. The correspond- ing t-value for a two-sided test is ±2.87. If I apply this correction, no statistical comparison in the experiment reaches significance. Thus, after Bonferroni correc- tion the intrusion effects I observed in agreement sentences do not receive statistical support. 3.3.6 Discussion Agreement attraction stimuli I used agreement attraction sentences to con- firm that the experiment worked as intended. I did find evidence for the intrusion effect at the critical region: lure match led to faster RTs in regression path at the critical region, which would indicate that the manipulation was successful. As in Experiment 1, the evidence suggested that this facilitation happened in both gram- matical and ungrammatical sentences. This further strengthens Hammerly et al. (draft.april.2018) conclusions that the absence of ungrammaticality illusions may be artifactual. That said, the grammatical-to-ungrammatical sentence ratio was higher than in Experiment 1 and much higher than in Hammerly et al. (draft.april.2018) ex- periments, thus it is not clear whether their explanation in terms of grammaticality bias can be used to explain these data. It may be worrying that statistical support for the lure-match effect in agree- ment attraction is not very strong: I only observe it in one region and reading time 118 measure, and even there it disappears if I apply multiple comparisons correction. However, it is not inconsistent with previous eye-tracking studies. Both Dillon et al. (2013) and Parker and Phillips (2017) report facilitation in only one measure (total times and re-read times, correspondingly). It is also the case that I had fewer data points than the other studies: only 3 per condition per participant. This might have made it more complicated for the effects to reach statistical significance. Overall, I conclude that the data from agreement sentences did provide evi- dence for intrusion effect, and thus indicates that the experiment worked as expected, and it is possible to observe lure-match effects in my data. Reflexive stimuli Let us now turn to the main question of interest — intrusion effects in the reflexive sentences. I did not find any statistical evidence for intrusion effect in reflexive resolution with QP lures. This fact alone could indicate that the parser faithfully follows constraints on binding by quantifiers, contra PP2017 and in line with Kush et al. (2015) accessible account. However, this interpretation is complicated by the absence of reliable interference effects in the sentences with NP lures. This may be worrying, since the stimuli were structurally identical to those used in Parker and Phillips (2017), and thus a priori I might expect to observe a similar effect. Thus, before making my final conclusions about QP stimuli, I investigate the lure-match effects in NP stimulli in more detail. To better understand the alignment between my study and PP2017, I perform a set of exploratory analyses similar to those in Experiment 1. Fig.3.3 shows the size of the interference effect in these exploratory analyses. Since there is not as 119 much information in this figure, I change the plot layout: eye-tracking measures are plotted along the x-axis, rows correspond to ROIs, and columns — to pre-processing variants. Averages for PP2017 data stay the same in both rows. In the critical region my main pre-processing procedure yields interference effects which are consistently smaller than in PP2017 data and which concentrate around 0. When I adopt their pre-processing procedure, the effect in my data shifts towards negative values (indicating facilitation) and the difference with PP2017 reduces but does not disappear completely. The difference remains biggest in re- read times — the only measure in which PP2017 did find statistically significant facilitation. In the spillover region, my effects consistently concentrate at or above zero regardless of the analysis procedure. If anything, this would be indicative of inhibition, rather than facilitation observed in PP2017 data. Overall, Fig.3.3 suggests that in my data the effect of lure match in reflexive sentences is consistently smaller than in PP2017. There are several possibilities of why this could be the case. First, it could be that my experiment did not work at some basic level. However, I do not think this is the case. Basic reading effects are visible in the data: people’s reading times increase and the proportion of first-pass skips decreases as the length of the words increases. Question answering accuracy is reasonably high: except for one person which I excluded from the analysis, everybody has at least 83% of correct answers, and most people are above 90% accuracy. Finally, I did observe the expected patterns in the agreement attraction sentences, although statistical evidence for them was limited. Second, it could be that the presence of sentence contexts has affected 120 Figure 3.3: Experiment 2. Comparisons of the interference effect magnitudes in reflexive conditions with Parker and Phillips (2017). Interference effect is calculated as the difference between lure match and lure mismatch conditions. Errors bars represent standard error of the difference of the means (calculated under the assumption that RTs in lure match and lure mismatch conditions are not cor- related. This is likely false, since they come from the same subjects, thus, the SEs are overestimates). people’s reading strategies. E.g. it might have made people process the sentences more deeply and tobe more discriminating in the choice of antecedents. Or it might have just led to higher fatigue - people had to read three times more sentences compared to PP2017 experiment. Third, it might be that the intrusion effects from non c-commanding lures are more fragile, e.g. because it’s harder to ignore c-command information. However, most of these concerns are assuaged by the data from Experiment 1. It relied on the exact same stimuli as PP2017 did, with no additional contexts, and still observed very small numeric effects of interference. Fig.3.4 shows the magnitude of the interference effects in NP and QP lure conditions from the current experiment and from Experiment 1. As in the other places in the thesis, I look 121 at two pre-processing variants in order to check whether they have any noticeable effect on the conclusions I make. Figure 3.4: Experiment 2. Comparisons of the interference effect magnitudes in reflexive conditions with Parker and Phillips (2017). Interference effect is calculated as the difference between lure match and lure mismatch conditions. Errors bars represent standard error of the difference of the means (calculated under the assumption that RTs in lure match and lure mismatch conditions are not cor- related. This is likely false, since they come from the same subjects, thus, the SEs are overestimates). I will compare the stimuli most similar between the two experiments - those with animate matrix subjects. Unfortunately, the results do depend on the pre- procesing procedure. If one chooses the procedure from S2017 (extended critical region, removed missing values), the interference effects from the current effects are smaller in magnitude than in Experiment 1 by roughly 50 ms, which puts them in the vicinity of zero. If, on the other hand, one chooses the PP2017 procedure (critical region comprises only the reflexive, missing values are replaced with zeros), the effects from the two studies align rather closely. Perhaps the only conclusion I am comfortable making here is the following. To the degree that we believe 122 that Experiment 1 provided evidence for interference in conditions with animate matrix subject, we should also believe that smaller interference effects in the current experiment are not stemming simply from the fact that I used animate matrix subjects. I.e. it is unlikely that the sub-command hypothesis by S2017 could be used to explain the reduced magnitude of the effects. Returning to the QP stimuli, their lure-match effects quite consistently go in the opposite direction as compared to the NP stimuli in both the current experi- ment and Experiment 1. In most cases, lure-match effects from QP lures are around 0 or positive (indicating inhibitory interference), while lure-match effects from NP stimuli are generally negative11. I interpret this as suggesting that NP and QP lures differentially affect processing. E.g. it could be the case that people can access non- c-commanding NP, but not QP lures. If these conclusions are on the right track, it would mean that PP2017 claim about structural cues failing to categorically restrict dependencies resolution is overly general and that some structural information does play a gating role. It may be that a feature like Kush et al. (2015) is used to approx- imate c-command relations. In combination with my previous conclusions about the presence of lure-match effects in “animate” conditions, it may mean that some kinds of structural information, such as (approximated) c-command can accurately guide retrieval, while some other kinds (like locality) cannot. In order to validate these conclusions, in the next two experiments I turn to configurations with c-commanding NP and QP lures. If my conclusions about the 11And in cases when they do float around zero, the QP lure-match effects are still different, going in the positive direction 123 representation of c-command are correct, I expect to observe lure-match effects from both NPs and QPs. 3.4 Experiment 3 3.4.1 Participants 38 members of University of Maryland participated in the experiment for a class credit or a payment of $10 (13 M, 25 F; mean age: 20.2, SD: 1.2). The experi- mental session took around 45 minutes on average, including setup and calibration. Data from 3 participants were excluded due to poor quality (high amount of trials with trackloss / missing data points in the critical regions). 3.4.2 Materials 24 experimental sentences, modeled after Parker and Phillips (2017, Exp.3)12 were included in the experiment. An example of a full set of stimuli is given in Table 3.2. Several modifications were made to the design in order to address the concerns from Experiment 2. The main idea was to create conditions which would maximally favor the resolution of the dependency to the QP lures. First of all, I put the lures in a position which makes them good binders from the grammatical point of view: subject position of the matrix clause, with both the target and the reflexive being 12Many thanks to Hanna Muller for helping with the stimuli construction and providing native speaker judgments. 124 inside a complement clause. This configuration ensures that the QP c-commands the reflexive, and the only thing stopping the QP from being linked to the reflexive are locality restrictions on the reflexive. Second, I got rid of accompanying contexts, shortening the experiment and reducing the risk of participants being overstrained. As in the previous experiment, all experimental stimuli with reflexives are ungram- matical. The properties of the sentences were similar to those in Experiment 2. All lures and reflexives were always singular, targets were always plural. Matrix verbs were verbs of speech or belief. In the QP conditions, matrix predicates had the form “would V” (“No X would V that. . . ”), while in the NP conditions, the predicates were of the form “did not V” or “would never V”. This was done to make the sen- tences in the two sets more parallel in meaning. There was equal number of feminine and masculine reflexives. As in the previous experiment, I tried to make sure that the target and the lure are equally plausible antecedents for the reflexive, given the verb of the embedded clause. E.g. in Table 3.2 expectant mothers can plausibly dis- tress themselves (target antecedent) or the doctor/nurse (lure antecedent). Spillover regions were aligned in structure. The first word after the reflexive was always a preposition, optionally followed by a determiner, then by an adjective or nominal modifier. I made sure that spillover regions did not contain null elements or gaps like in “. . . humiliated himself trying to get a spot”, since the presence of a gap could initiate additional memory access and retrieval operations. I also tried to make the sentences as natural out of context as possible. In particular, I tried to construct situations in the following way: they describe some- 125 thing that a certain group would never do or think. The group has to be selected in such a way that this generalization naturally follows from the membership in the group alone. E.g. we can expect that understanding doctors will be patient; this does not necessarily hold for all doctors. We found that such subgroups are often well defined by adjectives which describe a function (flower girls, cleaning ladies), social status / power (powerful, influential), common sense (reasonable, discrete). As in the previous experiment, I include an additional set of agreement at- traction sentences. The same conditions as in Experiment 2 were used, but the overall reduction of the experiment size allowed us to include 24 control items (6 per conditions). Additionally, I used 96 filler sentences of different types. They included sentences with grammatical reflexives bound by NP and QP antecedent and sentences with QP subjects. All of the fillers were grammatical. Overall, the experiment had 144 sentences and grammatical-to-ungrammatical ratio was 3:1. All sentences were accompanied by forced-choice Yes/No questions. 3.4.3 Procedure The procedure was identical to Experiment 1. 3.4.4 Analysis Same analysis procedure as in Experiment 2 was used. 126 Figure 3.5: Experiment 3. RT means for agreement sentences. Columns correspond to eye-tracking measures: fp - first-pass, rp - regression path, rp - total times. Rows correspond to ROIs. Errors bars represent standard error of the mean, adjusted for participant variability (Cousineau, 2005; Morey, 2008) Figure 3.6: Experiment 3. RT means for reflexive sentences. Columns correspond to eye-tracking measures: fp - first-pass, rp - regression path, rp - total times. Rows correspond to ROIs. Errors bars represent standard error of the mean, adjusted for participant variability (Cousineau, 2005; Morey, 2008) 127 NP lures Grammatical, Lure match The understanding doctor would not complain that expectant mothers distress himself during stressful medical examinations. Grammatical, Lure mismatch The understanding nurse would not complain that expectant mothers distress himself during stressful medical examinations. QP lures Grammatical, Lure match No understanding doctor would complain that expectant mothers distress himself during stressful medical examinations. Grammatical, Lure mismatch No understanding nurse would complain that expectant mothers distress himself during stressful medical examinations. Table 3.2: Experiment 3 materials example Targets are underlined, lures are bolded. 3.4.5 Results Mean question answering accuracy was 94%. Figures 3.5 and 3.6 show the observed reading times means for agreement and reflexive stimuli respectively. Starting with the control agreement sentences, I did find statistical support for multiple effects. In the critical region, the main effect of target match reached significance in all three eye-tracking measures; the positive coefficient indicates that on average target match conditions were read faster than target mismatch con- ditions. This is the grammaticality effect. This main effect was qualified by a significant target match x lure match interaction in regression path. Pairwise comparisons confirmed that this effect was driven by the classical agreement attrac- tion pattern: lure match conditions were read faster only within target mismatch 128 conditions. I also observed a trend for this interaction in total times, but it did not reach significance there. In the spillover, I observed a significant main effect of target match and a significant target match x lure match interaction in regression path. These effects receive the same interpretation as in the critical region. Turning to reflexive conditions, I found no reliable evidence for any effects, as in Experiment 2. Multiple comparison corrections I follow the same logic as in Experiment 2, correcting for 12 comparisons, which results in critical t-value of ±2.87. All the main effects reported above survive this correction, and neither of the interactions does. 3.4.6 Discussion Agreement attraction stimuli As in Experiment 2, I used agreement attraction sentences to confirm that the experiment worked as intended. This time I received a stronger support for this being the case. First of all, I observed reliable grammat- icality effects at the critical region in all eye-tracking measures. More importantly, I also observed the target match x lure match interaction in regression path at the critical and spillover regions, which is indicative of the presence of agreement attraction. Together with the trend in total times, I take this as evidence that peo- ple did behave as expected at least with subject-verb agreement. I note that these interactions do not survive Bonferroni correction for multiple comparisons, but I 129 return to further discussion of this issue after presenting the results for reflexive sentences. Reflexive stimuli The main question I addressed in this experiment was: do c- commanding QP lures give rise to lure-match effects? There is very little indication that they do. Even if we ignore the lack of statistically reliable effects and look at numerical magnitudes of the interference effects, most of them are very close to zero, with the exception of total times, where I observe lure-match effect of roughly 50 ms. The magnitudes of the interference effects are rather similar between the stimuli with NP and QP lures, suggesting that they were treated similarly by the parser. The interpretation of these results is complicated by the fact that I observe almost no evidence for lure-match effects from NP lures, contrary to what PP2017 and S2017 report. In this respect, this experiment is similar to Experiments 1 and 2; but now this lack of the effects may be more worrying: both PP2017 and S2017 consistently observed lure-match effects from c-commanding lures in multiple experiments. This again raises the possibility that my current experiment had some confounds which affected the manifestation of the lure-match effects. If this is the case, the data may be uninterpretable. As before, I conduct more detailed comparisons with the original study (Parker and Phillips (2017) Experiment 3). This exploratory analysis followed the outline and modifications to the pre-processing procedure described in Experiment 2. I also included data from (S2017) Experiment 1 (conditions with speech verbs) in 130 comparison, since their stimuli are structurally identical to ours. The comparisons are presented visually in Fig. 3.7. We can observe that regardless of the pre- processing procedure, interference effects for NP lures at the critical region are consistently smaller than in the original studies by PP2017 and S2017. Figure 3.7: Experiment 3. Comparisons of the interference effect magnitudes in reflexive conditions with Parker and Phillips (2017). Interference effect is calculated as the difference between lure match and lure mismatch conditions. Errors bars represent standard error of the difference of the means (calculated under the assumption that RTs in lure match and lure mismatch conditions are not cor- related. This is likely false, since they come from the same subjects, thus, the SEs are overestimates). As in Experiment 2, I can readily rule out the possibility that the experiment did not work at a basic level: I did observe he expected effects of length on RTs and first-pass skipping probability. I also did find evidence for agreement attraction, suggesting that the experimental procedure was fine and allowed me to detect lure- match effects. However, my stimuli did include an important confound. I did not control the type of embedding verb, and only half of the verbs were speech verbs. This may 131 be critical: as S2017 demonstrates, in configurations similar to mine lure-match effects are only observed with embedding verbs of speech, and not perception. If in my stimuli lure-match effects were effectively elicited only by half of experimental sentences, it could lead to the reduction ot the average magnitude of the effect. I rule this confound out in the next experiment and delay further discussion of patterns in QP sentences until I have done that. 3.5 Experiment 4 3.5.1 Participants 40 members of University of Maryland participated in the experiment for a class credit or a payment of $10 (19 M, 21 F; range: 18-26; mean age: 20.3, SD: 1.44). The experimental session took around 45 minutes on average, including setup and calibration. I excluded data from 2 participants: one because of software issues, another one due to extremely low proportion of correct responses (11%). 3.5.2 Materials I had two sets of 24 experimental items, following the structure from Experi- ment 2. The examples of the items are given in Table 3.3. One of the sets had QP lures, and was of main interest; the other one had NP lures and was used as control. The QP set was based on the stimuli from Experiment 3 with two differences. First, all matrix verbs were verbs of communication. Most of the verbs (18 out of 24) were taken from S2017. The change of the verbs required changing some of 132 NP lures Grammatical, Lure match The worried doctor reported that the delirious hiker talked to himself in the operating room before the procedure. Grammatical, Lure mismatch The worried midwife reported that the delirious hiker talked to himself in the operating room before the procedure. Ungrammatical, Lure match The worried doctor reported that the delirious mothers talked to himself in the operating room before the procedure. Ungrammatical, Lure mismatch The worried midwife reported that the delirious mothers talked to himself in the operating room before the procedure. QP lures Grammatical, Lure match No shy girl would mention that the prom queen embarrassed herself in the school cafeteria after class. Grammatical, Lure mismatch No shy boy would mention that the prom queen embarrassed herself in the school cafeteria after class. Ungrammatical, Lure match No shy girl would mention that the schoolyard bullies embar- rassed herself in the school cafeteria after class. Ungrammatical, Lure mismatch No shy boy would mention that the schoolyard bullies embar- rassed herself in the school cafeteria after class. Table 3.3: Experiment 4 materials example Targets are underlined, lures are bolded. the stimuli to maintain the naturalness of the scenarios. Second, in addition to un- grammatical sentences with the target mismatching the anaphor in two features, the experiment included grammatical sentences, with the target matching the anaphor. Thus, two factors were manipulated: target match (the anaphor either matches 133 the target fully or mismatches it in gender and number) and lure match (rhe anaphor either matches the lure fully, or mismatches it in gender). The NP set con- sisted of a subset of stimuli from S2017, adapted virtually verbatim13. Similarly to the first set, all verbs were speech verbs, and the same two factors were manipulated. Overall, I had 24 sentences with QP lures (12 grammatical), 24 sentences with NP lures (12 grammatical) and 96 fillers (adapted from Experiment 2; all grammatical), for a total of 144 sentences. Thus the grammatical-to-ungrammatical ratio was 5:1. As in the previous experiments, all sentences were accompanied by a forced choice Yes/No question. 3.5.3 Procedure The procedure was identical to Experiment 1. 3.5.4 Analysis I used the same preprocessing procedure as in Experiment 1, and analyzed the same regions (critical, spillover) and measures (first pass, regression path, total time) of interest. The definitions of the regions were the same as in Experiment 1 13The only change was the following. In two sentences, I exchanged the genders of the targets in grammatical and ungrammatical conditions. E.g. if the original combinations of target-anaphor were “actor - himself” and “actresses - himself”, I replaced it with “actress - herself” and “actors - herself”. This was done to counter-balance the number of feminine and masculine reflexives in the experiment. I do not think that this change affects the results of the experiment, since presumably what matters is the gender-match between the noun and the anaphor, not the lexical identity of the noun. 134 for the QP stimuli; for the NP stimuli, I followed the region mark-up reported in Sloggett and Dillon (in prep). NP and QP stimuli were analyzed separately (since the materials were not lexically matched) with the same modeling procedure as in Experiment 3. 3.5.5 Results Mean question answering accuracy was 90%. Fig.3.8 shows the observed read- ing times means. Statistical analyses provide robust evidence for the grammaticality effect: main effect of target match reached significance in the regression path and total times across virtually all regions of interest for both NP and QP conditions (except regres- sion path in the critical region of NP sentences). The positive sign of the coefficient indicates that on average, 2-feature target mismatch conditions were read slower than target match conditions. However virtually no evidence for the interference effect was found. Two interactions have reached significance: NP stimuli, first-pass times at the critical region and QP stimuli, total times at the spillover region. Pair- wise comparisons indicate that the first difference is likely driven by slower reading times in lure match conditions within target match conditions only. In the second case neither of the pairwise comparisons was significant, and Figure 3.8 suggests that the interaction effect is driven by the cross-over pattern of the means: lure match conditions are slower in target match sentences, but faster in target mismatch ones. 135 Multiple comparison corrections The number of the models I am building in this experiment is the same as in the previous one, thus, the critical t-value for Bonferroni correction remains the same: ±2.87. Grammaticality effects mainly survive the correction (only three do not: NP stimuli, total time at the critical region; QP stimuli, first pass and regression path at the critical stimuli). On the other hand, neither of the observed interactions (NP stimuli, first pass at the critical region and QP stimuli, total time at the spillover) survives the correction. 3.5.6 Discussion The primary question that this experiment addressed was: is the type of the embedding verb responsible for the lack of reliable lure match effects from c- commanding lures that I observed in Experiment 3? The answer appears to be negative. Despite ensuring that all of the embedding verbs were verbs of communi- cation, and despite using a subset of stimuli from S2017, which have already been shown to provoke lure-match effects, I failed to find statistical support for interfer- ence effects. The only two statistically significant effects I observed are not in line with PP2017 and S2017. The interaction effect in the first-pass times at the critical region may serve as a weak evidence for intrusion effects in NP sentences. However, the intrusion pattern observed there does not match that reported by PP2017and S2017. If anything, it is rather more similar to the effects reported by Badecker and Straub (2002, Exp.3) and Patil et al. (2016): lure-match conditions are read slower 136 than lure-mismatch in grammatical sentences. Such effects are predicted by some memory models of sentence processing(see, e.g. Jäger et al., 2017), but there is a discussion in the literature whether they should be treated as evidence for memory retrieval problems (see section 1.2.2 for more details). For now, I am hesitant about whether these effects are meaningful and if they are, why the same set of stimuli would give rise to facilitation in some studies and inhibition in others. Similarly, the interaction I observed in total times at the spillover regions for the QP sentences is likely driven by the crossover pattern of the means alone. Neither of the pairwise comparisons reaches significance, and the t-value for the interaction is right at the critical threshold. Incidentally, numerically the pattern of reaction rimes was similar to the one I observed in NP stimuli in first-pass at the critical region: lure match conditions are slower in target match sentence but faster in target mismatch . But since the pairwise comparisons did not reach significance, right now I prefer not to ascribe meaning to these patterns. As in the previous experiments, I perform more detailed comparisons of my results with PP2017 and S2017. Fig.3.9 shows the magnitude of the interference effect in 2-feature target mismatch conditions with NP lures. The results look virtually identical to the previous studies: at the critical region, interference effect is much smaller in my case than in either PP2017 or S2017, and the choice of the pre- processing procedure has practically no effect. We can also notice that lure-match effects are somewhat more robust in comparison with Experiment 3: they more consistently reach the magnitude of at least 50 ms, and sometimes they get even bigger. This suggests that while the type of the embedding verb is not responsible 137 for the lack of statistically significant results, it might have still played a role in reducing the average size of the interference effects. Controlling for the embedding verb type allows for a cleaner comparison of the NP and QP stimuli. As in Experiment 3, they align rather closely: interference effects go in the same direction and have comparable magnitudes, regardless of the choice of the preprocessing procedures. I take this as an indication that both lure types were treated by the parser similarly. That said, the similarity is not absolute: in regression path (both pre-processing procedures) and re-read times (S2017preprocessing only), the interference effect for NP stimuli is twice as big as in QP sentences. This may indicate that for some reason QP lures are less accessible to the parser. However, given that this difference at least partially depends on the analysis procedure and given that I do not have reliable evidence for interference within either pre-processing procedure, I prefer to remain conservative and conclude that the current dataset does not provide strong evidence for differential interference effects with NP and QP lures. Overall, based on the numerical patterns I tentatively conclude that a) lure- match effects are provoked by NP and QP in roughly the same degree; b) this fact, together with the findings of no lure-match effect from non-c-commanding QP lures in Experiment 2 suggests that a feature similar to Kush et al. (2015) accessible categorically controls the access to reflexives antecedents. These conclusions are somewhat dependent on my interpretation of the reduced magnitude of the inter- ference effect in the stimuli with NP lures. If it turns out that I have missed some important confound, the conclusions above may not hold. I discuss these issues in 138 more detail in the general discussion. 3.6 General discussion In this chapter I have presented three experiments attempting to clarify Parker and Phillips (2017) conclusions about the nature of structural constraints, focusing on c-command. I contrasted two possibilities: c-command is encoded in a process- agnostic fashion, or c-command is encoded as a process-specific feature, like Kush et al. (2015) accessible. To differentiate between these possibilities, I investigated QP lures in both c-commanding and non-c-commanding positions. The evidence for lure-match effects in sentences with QP lures was extremely weak across all three experiments. The only statistically significant effect which could be interpreted as evidence for interference was observed in Exp.4 in total times at the spillover. Even there the pairwise comparisons were not significant, not allowing to conclude that the overall interaction indeed reflected lure-match effects. Additionally, this interaction effect stopped being significant after Bonferroni cor- rection for multiple comparisons. Thus, the best evidence for intrusion I have comes from the numeric pattern of the means. In Exp.2 I observed that the numeric patterns of the RTs were different for QP and NP lures. While NP lures mostly provoked facilitation, QP lures provoked inhibition (i.e. sentences with matching lures were read slower). I interpreted these patterns as indicating that the parser treats non-c-commanding NP and QP lures, and this may be evidence for the use of accessible. If this conclusion were correct, 139 we should be able to observe interference from both NP and QP lures when they c-command the reflexive. The numerical patterns observed in Experiments 3 and 4 were consistent with this prediction: both NP and QP c-commanding lures appeared to elicit interference effects. The evidence was clearest in Experiment 4: lure-match effects in NP and QP stimuli had similar direction and magnitude, with the effects reaching 50 ms in most eye-tracking measures regardless of the pre-processing procedure I used. These patterns were somewhat weaker in Experiment 3: effect sizes were smaller and some of the effects were grouped around zero. However, the alignment between NP and QP stimuli was largely the same. I took this to mean that the parser treated the two types of lures similarly. Weaker evidence in Experiment 3 could be explained by the properties of the stimuli I used: in Experiment 3 only have of the predicates I used were verbs of communication, while in Experiment 4 all of them were. Since S2017 showed that lure-match effects most consistently appear when the lures are subject of communication verbs, the reduced proportion of such verbs in Experiment 3 could lead to the reduction of the average lure-match effect sizet. If these conclusions are correct, these data would support the hypothesis that accessible is used as a proxy to c-command in reflexives resolution. A second possible interpretation of my QP data is that the effects are not real, and the numerical patterns I observe are due to random noise. This interpretation would basically mean that for some reasons QPs are not efficient lures. If this is the case, my data cannot be used to make any conclusions about how c-command information is encoded in the system. Exactly why QPs would be bad lures is 140 not clear, but I discuss several speculative possibilities. The first has been already mentioned: QPs may be bad lures not inherently, but because they lead people to weigh structural information more highly during reflexive resolution. This variant assumes that PP2017 story is correct, since in S2017 account structural features are already perfectly constraining retrieval to select the appropriate antecedents. It would be quite easy to test: e.g. one could look at the sentences with quantificational targets, e.g. “The actress said that the/most directors like herself”. If the presence of a QP does indeed change processing strategies, we would only observe interference from “the actress” with referential targets (“the directors”). A second possibility for why QPs might be bad lures assumes that S2017 account is correct. In this case it may be the case that QPs make bad logophoric antecedents. E.g. OPlog might not be able to efficiently track the kind of entities that QPs introduce in the discourse. If this is the case, QPs would not lead to interference in any configurations considered by S2017. This would be the conclusion compatible with the data from Postal (2006) I mentioned in the introduction to this chapter, although it would go against cross-linguistic evidence. Deciding whether the QP effects are real (and thus - should be taken into account at all) depends on the interpretation I assign to the evidence coming from the sentences with NP lures. The evidence for intrusion in the sentences with NP lures was also very weak, unlike in previous studies by PP2017 and S2017, but similarly to my Experiment 1. What do we make of this fact? There are several questions we need to address. First, do we have any reason to believe that the previous findings are not reliable? I do not think so. The only explanation I can 141 think of how a nonexistent effect can receive statistical support is a Type I error. While it is not inconceivable that all of the previous findings are due to Type I errors, it is rather implausible, since the intrusion effects look very similar in the previous experiments, at least in terms of direction and the magnitude (the timing of the effects appears somewhat more subject to variability for unclear reasons). If reliability of the previous finding is not a concern, the next question is: why did I not observe statistically supported evidence for intrusion effects in my study? To begin with, it could be that I have some basic problems with experimental procedure and my data cannot be trusted at all. I do not think this to be the case for several reasons. On a basic level, people did exhibit basic reading effects (longer words provoking increase of reading times and decrease of first-pass skips). Question answering accuracy was generally high14. Next, I did find support for interference effects in subject-verb agreement sentences. Finally, it is not the case that no statistically significant effects were observed at all. Reliable grammaticality effects were observed in Exp. 3 and 2, and this suggests that people were sensitive to the grammatical properties of the input. If the general experimental procedure is fine, the next question is: are there any systematic confounds which could have weakened intrusion effects in reflexive stimuli, or made them disappear altogether? I have already ruled out two such confounds. First, numerically small effects in Exp.2 from the lures embedded inside a relative clause could be due to the fact that the embedding nouns were animate. It would be consistent with the hypothesis advanced by Sloggett (2017): lure-match 14Mean response accuracies: Exp.1 — 92%, Exp.2 — 94%, Exp.3 — 90%. 142 effects from lures within a subject relative clause represent a case of sub-command binding, similarly to ziji in Chinese (Huang and Liu, 2001). However, my results from Chapter 2 do not support this possibility: there, I observed lure-match effects from lures inside a relative clause regardless of the embedding NP animacy. Also notice that the sub-command explanation would not account for reduced lure-match effects in Exp. 3 and 4: in these experiments the lure in fact c-commanded the reflexive, being the subject of the matrix clause which embedded the complement clause containing the reflexive. Second, I have considered the effect of the embedding verb type (speech vs. perception) on the lure-match effect from c-commanding lures. This factor appeared to be of some importance: Experiment 4, which controlled for it, showed more consistent and numerically bigger lure-match effects. However, even there the effects were smaller than in the previous studies. This is even more telling since I have used a subset of the stimuli from S2017, which have been shown to produce lure-match effects. Thus, the verb type alone cannot account for the reduction of the magnitude of lure-match effects in comparison with the previous results. I identify two further possibilities, which I broadly classify as extra-linguistic context effects. The first possibility is that merely the presence of QP stimuli in the experimental materials might have affected lure-match effects in the whole experi- ment. E.g. the presence of QPs might have forced the parser to weigh structural features more highly, since they may be more important for dependencies involv- ing QPs. The second possibility has to do with the properties of the experimental procedure, namely, the language status of the experimenter. In my case, the partic- 143 ipants were instructed in non-native English, and it is possible that they might have adjusted their processing strategies. There exists evidence that such adjustments can in fact happen. E.g. in an EEG study on Dutch Hanuĺıková et al. (2012) gender violations elicited a P60015, if the stimuli were read by a native Dutch speaker, but not when they were read by an L2 Turkish speaker. It is conceivable that something similar happened in my experiments, although additional qualifications are neces- sary to explain why grammaticality effects were observed, if we think that people corrected for non-nativeness of the experimenter. One could say that adjustments were selective, such that violations were ignored in only one of the two morphologi- cal features I manipulated (number and gender). In this case, the sentences would essentially behave as one-mismatch sentences for which Parker and Phillips (2017) only found grammaticality effects. Another possibility compatible with Parker and Phillips feature weighting account would be that the adjustment effectively down- weighted morphological features; in this case, the speakers would have to solely rely on syntactic features, which would predict only grammaticality effects without in- teractions. Yet another possibility is that repair processes are adjusted. It has been argued that agreement attraction arises as a result of repair and not initial misre- trieval (e.g Lago et al., 2015). The evidence is based on the fact that attraction seems to only arise in a small proportion of trials with longest reaction times, while grammaticality effects are seen in a larger proportion of reaction times, including shorter ones. If people are able to selectively suppress repair (e.g. if they realize that with non-native speech repair will have to happen too often), we could expect 15ERP component, often associated with grammatical violations. 144 to only see grammaticality effects. It is unclear whether this last explanation would work for reflexives. Notice that as discussed the confounds only suggest that something might have gone wrong. They do not tell us whether the effect is real, but weakened, or whether the confounds affected the experiment to such degree that the effects were gone altogether, and the numerical patterns represent just noise. To figure this out, I could compare the patterns of my results to the previous findings. Presumably, if the effects in my study are real, they would be similar to the effects observed in the previous studies. If my effects are not real and the patterns of the means just reflect random noise, we might expect the patterns to be incongruent, either between themselves, or in comparison with the previous literature, or both. In the latter case the random nature of noise does in principle allow the situation where the effects are not real, but by chance alone the means pattern consistently with the previous studies. I did perform such comparisons, but will defer the discussion until the end of the next chapter, when more relevant data, including those from direct replications of previous studies, will have been discussed. To summarize, I tentatively concluded that the data from QP lures indicate that c-command information is represented in an approximate way. However, these conclusions are uncertain due to the lack of statistical evidence for interference effects from NP lures, which may indicate that some unknown factor has confounded my results. If this is the case, it may not be safe to make any conclusions about QP stimuli at all. The lack of lure-match in NP stimuli is also worrying, since it may indicate that the interference effects in reflexive resolution may be spurious or at 145 least subject to higher degree of variability than PP2017 and S2017 results could lead to believe. In order to better understand the reason for my non-replication, I attempt two direct replications of PP2017 findings, which I discuss in the following chapter. 146 Figure 3.8: Experiment 4. RT means for agreement sentences. Columns correspond to eye-tracking measures: fp - first-pass, rp - regression path, rp - total times. Rows correspond to ROIs. Errors bars represent standard error of the mean, adjusted for participant variability (Cousineau, 2005; Morey, 2008) Figure 3.9: Experiment 4. Comparisons of the interference effect magnitudes in 2-feature target mismatch conditions with PP2017and S2017. Interference effect is calculated as the difference between lure match and lure mismatch conditions. Errors bars represent standard error of the difference of the means (calculated under the assumption that RTs in lure match and lure mismatch conditions are not cor- related. This is likely false, since they come from the same subjects, thus, the SEs are overestimates). 147 Chapter 4: Replicability of lure-match effects in reflexive resolution In the previous chapter I reported experiments aimed at understanding the role of c-command information in the reflexive resolution. The focus was on the behavior of quantificational lures - I predicted that only if PP2017 is correct, we would see lure-match effects. I did not find any reliable evidence for such effects, while the numerical patterns indicated that PP2017 approach may not be on the right track. However, these conclusions were weakened by the fact that I found no detectable interference from referential lures, using stimuli similar or identical to those used in the previous studies. It is not clear whether lure-match effects did not receive statistical support because of some confounds, which could have affected the effect sizes. This means that the QP results from the previous chapter cannot be used to make strong conclusions about my main question until I have a better understanding of what could have caused the smaller lure-match effect magnitudes. As I have discussed in the previous chapter, it is possible that the lack of lure match effects is due to people adjusting their processing strategies in response to some property of the experiment, e.g. the presence of QPs in the materials or the non-nativeness of the experimenter. Ruling out these confounds is the primary goal of this chapter: Experiment 5 addresses the first concern, and Experiment 6 - the 148 second one. In addition to helping with interpretation of the earlier findings, these ex- periment may also be used to distinguish between PP2017 and S2017 accounts. Adjustment of processing strategies in response to experimental context factors is relatively easy to explain if we adopt PP2017 account: one could say that people re-weigh their retrieval cues, giving more priority to structural information, and with this new set of weights even a mismatch in two morphological features is not enough to outweigh structural features. On the other hand, S2017 account would have hard time accommodating this finding. To remind, the account postulates that lure-match effects arise because people retrieve a structurally accessible null operator in a proportion of cases. In order to account for (hypothesized) effects of the experimental context, we would have to come up with a mechanism which would reduce the preference for choosing logophoric interpretation depending on the broader (extra-linguistic) context properties. The only straightforward way of doing this would be to somehow ensure that OPlog is less accessible for retrieval. One way of doing it would be to lower the degree of match of OPlog with the retrieval cues. There are two ways of doing this: modifying featural representation of OPlog and modifying the set of retrieval cues. The first way is rather implausible. OPlog is hypothesized to by φ-deficient and to only carry a structural feature, encoding both c-command and locality information. It is hard to imagine that OPlog will or will not include information about its clause- hood depending on the context. The second way could in principle work; however, in this case, it would imply not using structural cues to guide retrieval. If this 149 were to happen, lures could be accessed directly, and we would still expect to see lure-match effects even in context hypothesized to produce processing adjustments. Alternatively, one could try to raise the activation of the overt target to the degree where it would always outcompete OPlog. Again, I do not think it is plausible that the featural composition of the target will be affected by the context, and it is not clear what additional cues specific only to the overt target could be added to the retrieval cues set. Thus, this possibility seems to fail as well. Finally, one could hypothesize that the activation of OPlog and/or overt tar- get is changed because the change in experimental context affects their linguistic prominence (see the discussion in section 1.5.1.1). I might see how the presence of QP sentence could do it: perhaps, in a sentence with a QP lure and an NP target the latter becomes more prominent because it refers to a specific entity; then, the comprehenders would have to make this shift even for the sentences with the NP lures. It is not clear to me how the (non-)nativeness of the experimenter would change linguistic prominence of OPlog or overt targets. 4.1 Experiment 5 In this experiment I attempt a direct replication1 of PP2017 Exp.3 in order to assess whether the presence of QP sentences in the stimuli set of my Exp.2-4 could have affected the results. I use their exact materials with minor corrections (typos etc., as specified below). I also collect a sample which is twice as big as theirs (48 1I would like to thank Dan Parker for generously sharing the eye-tracking scripts, materials and data from his studies. 150 vs. 24 people) which gives better statistical power and reduces the likelihood of over-estimating the magnitude of the effects. 4.1.1 Participants 52 members of University of Maryland community (38F, 14M; age range: 18- 25, average age: 20.1) participated in the experiment for a class credit or a payment of $10. The experimental session took about one hour, including setup and calibra- tion. Data from four participants were removed: two participants did not manage to complete the experiment in an hour, and two participants gave incorrect answers to more than 30% of comprehension questions. 4.1.2 Materials Stimuli and fillers were directly adopted practically verbatim from PP2017 Exp.3 (I literally used the same eye-tracking script). The examples of the stimuli are given in Table 4.1. The only modifications I made consisted in correcting typos and inconsistencies in the stimuli sets (the full list of modifications can be found in Appendix D). Overall, the stimuli set had 36 experimental sentences and 72 fillers. 24 out of 36 experimental sentences were ungrammatical, all fillers were grammatical, giving a ungrammatical-to-grammatical ratio of roughly 1:4. 4.1.3 Procedure The procedure was identical to Experiment 1. 151 Target match, Lure match The talented actor mentioned that the attractive spokesman praised himself for a great job. Target match, Lure mismatch The talented actress mentioned that the attractive spokesman praised himself for a great job. Target mismatch (1 feature), Lure match The talented actor mentioned that the attractive spokeswoman praised himself for a great job. Target mismatch (1 feature), Lure mismatch The talented actress mentioned that the attractive spokeswoman praised himself for a great job. Target mismatch (2 features), Lure match The talented actor mentioned that the attractive spokeswomen praised himself for a great job. Target mismatch (2 features), Lure mismatch The talented actress mentioned that the attractive spokeswomen praised himself for a great job. Table 4.1: Experiment 5materials example Targets are italicized, lures are bolded. 4.1.4 Analysis In my initial analyses, I followed PP2017 in choice of regions and measures of interest, and statistical comparisons. Although I do have concerns about their pro- cedure, which I will discuss later, I decided to first follow their procedure as closely as possible to create a baseline comparison between results reported in PP2017 and my replication attempt. Three regions of interest (ROIs) were defined: precritical, which included the embedded subject and predicate (i.e. four words before the reflexive pronoun); critical, including only the reflexive pronoun itself; and spillover, includ- ing two words after the pronoun. Four eye-tracking measures of interest (MOIs) 152 were analyzed: first-pass, right-bound, regression path and total times. Definitions were given earlier. Prior to statistical analysis, the data were manually pre-processed using Eye- Doctor2 to remove blinks and ensure that fixations align with the text. Fixations shorter than 80 ms or longer than 1000 ms were automatically rejected before cal- culating eye-tracking measures using custom scripts3. Then, reading times were log-transformed and missing observations were replaced with zeros. A linear mixed effect model with target.match, lure.match and their interaction were then fit to the data. Treatment coding was used with baseline level being target: match, lure: mismatch to every ROI and MOI. I will refer to this model as “full model”. Additionally, a series of smaller models was fit to subsets of conditions. One model, including the same predictors, was fit to ungrammatical conditions only (baseline level was recoded to target: one mismatch, lure: mismatch). The interaction term from this model was used to assess the difference in interference effect between one-mismatch and two-mismatch sentences. I will refer to this model as “ungrammatical-only” model. Two more models corresponding to pairwise comparisons were fit separately to the ungrammatical target: one mismatch and target: two mismatch conditions, to assess simple main effects of lure match. No corrections for multiple comparisons were performed. 2https://blogs.umass.edu/eyelab/software/ 3Avaliable at: https://github.com/UMDLinguistics/EyePy. Notice that PP2017 relied on the previous generation of these scripts, so this pre-processing step was not exactly identical across the two studies. 153 target.match : lure.match | item target.match : lure.match | participant target.match | item lure.match | item target.match | participant lure.match | participant Table 4.2: Order of random effect structure simplification, Exp.5. Effects were removed from the model, starting from the top. If the model with the maximal random effect structure did not converge4, I simplify it by removing random slopes. PP2017 did not specify their procedure in much detail, stating only: “random slopes for items or participants were removed”. To be internally consistent, I always dropped the random effects in the order shown in Table 4.2. As a final step, all models were automatically assessed to check whether any random effects had correlations of 1 or -1. The procedure was described in more detail in section 2.1.4. 4.1.5 Results Mean question accuracy was 91%. Observed patterns of mean reaction times are shown in Fig.4.1. Numerical values for the means are given in Appendix A. Fig.4.1 is laid out as follows. Columns correspond to reading times measures (see figure description for details). They have to be arranged in two separate sub- panels because the time scales are quite different between early and late measures (especially in the critical and spillover regions), and plotting them all in the same 4As determined by diagnostics reported by lmer. One has to access the internal structure of the model fit and look at the information contained in m@optinfo$conv$lme4, where m is the model object. 154 Figure 4.1: Mean RTs in PP2017 (red) and my replication (blue). Error bars represent standard error of the mean. Conditions: tm/tomm/ttmm - target match/one-mismatch/two-mismatch; lm/lmm - lure match/mismatch. Columns corre- spond to eye-tracking measures: fp - first-pass, rb - right-bound times, rp - regression path, rr - re-read times. Rows correspond to ROIs: precritical, critical, spillover. 155 plot would make it harder to see the patterns; the sub-panelling does not have any other purpose except making the exposition clearer. The rows correspond to ROIs; we will be mostly interested in the central row, corresponding to the critical region (i.e., the reflexive). X axis corresponds to different conditions (see figure description); each condition is associated with two RTs: red dots represent means from PP2017, blue dots - from the current replication. We will be mostly interested in comparing two rightmost conditions in each facet - they correspond to 2-target mismatch conditions, where we expect to see interference effects. If PP2017 findings are replicated, we expect to see that the RTs estimate is much higher in the rightmost condition, indicating that lure mismatch sentences are read slower than lure match sentences within 2-target mismatch conditions. I draw attention to two things in these plots. One is that there is less un- certainty in the replication, as suggested by the smaller error bars - this is to be expected, since I had twice as many participants as PP2017 did. Second, in both studies the estimates in the critical region follow similar patterns, but the replication estimates seem to be less extreme (e.g. they appear to be bigger than small PP2017 estimates and smaller than large PP2017 estimates). As a result, the numerical magnitude of the interference effect goes down in the replication, although it is still appears to be present in the pattern of the means. Fig.4.2 shows the size of the interference effect in target: two mismatch conditions, calculated as the difference between lure: match and lure: mis- match conditions. Negative values indicate that lure: match conditions are read faster, an indication of facilitatory interference. Since there is not as much infor- 156 Figure 4.2: Interference effect in target: two mismatch conditions (lure match - lure mismatch) in PP2017(red) and my replication (blue). Error bars represent standard error of the difference of the means. Columns correspond to ROIs. Rows correspond to eye-tracking measures: fp - first-pass, rb - right-bound times, rp - regression path, rr - re-read times. mation in this figure, I change the plot layout: eye-tracking measures are plotted along the x-axis, and plot facets correspond to ROIs. Critical region - the central facet - is of most interest, since this is where PP2017 observed the interference effect most consistently. One can see that interference effects in the replication are much smaller than in the original PP2017 study. Even the biggest one, observed in re-read times, is roughly twice as small as the corresponding estimate from PP2017. Let us now turn to the results of the statistical analysis. For ease of compar- ison, I present model estimates from two experiment side-by-side on Fig.4.3. The dots correspond to β̂ values (“Estimate” values from lmer output). In this figure I am mostly interested in showing which effects reach statistical significance (coef- ficients for which |t| > 2 are represented with square markers) and whether these 157 effects survive multiple comparisons correction (triangles indicate coefficients which stop being significant after applying the correction; I discuss this issue further down the text). Intercept estimates are not represented, because they are numerically much bigger and including them in the overall plot forces the x-axis to stretch, so that minor differences between other estimates become less noticeable. As in Fig.4.1 columns correspond to eye-tracking measures, and rows - to ROIs. The size of the estimate is displayed on the X axis (since the models were fit to log-transformed RTs, the estimates are numerically small). Different effects from the model are plotted along the Y axis (see figure description for more details). I will focus my discussion on the critical region (central row), since this is were the lure-match effects were the clearest in PP2017, and this is what PP2017 base their conclusions on. For the full models, I can expect three estimates to reflect the interference effect: • Main effect of lure match, which would indicate that reading times on the reflexive differ for matching and mismatching lures, if I compare target: match and target: two mismatch conditions; • target match x lure match interaction for target: one mismatch con- ditions. It would indicate that reading times are differentially affected by lure mis(match) in target: match and target: one mismatch conditions. • Similarly for target match x lure match interaction for target: two mismatch conditions. In the replication data, only two effects of interest reach significance. The 158 Figure 4.3: Model estimates in PP2017 (red) and my replication (blue). First five effects come from the full model (in order: main effect of target match in 1-mismatch conditions; in 2-mismatch conditions; main effect of lure match; two inter- actions of lure match and target match for 1-mismatch and 2-mismatch conditions correspondingly); next two effects are the simple main effects of lure match; finally, the last effect is the interaction from the ungrammatical-only model. The order of the coef- ficients is the same as in PP2017, to make comparisons across papers easier. Error bars represent standard error of the estimate. Coefficients for which |t| ≥ 2 are represented with squares if they remain significant after a Bonferroni correction (with the corrected |t| = 3.34), and with triangles, if they don’t. first one is the pairwise comparison within target: one mismatch conditions in regression path, indicating that the pronoun is on average read faster in lure: match conditions. The second effect is the pairwise comparison within target: two mismatch conditions in total times, which receives a similar interpretation. This is suggestive of the lure-match effects. However, for neither of those cases, the interaction term for the full model or the interaction term for the ungrammatical- only model reach significance. Together with the absence of main effects of lure 159 match this suggests that the evidence for lure-match effects is at best weak. Compare these results with PP2017 data. They do find significant target match x lure match interactions for target: two mismatch conditions in re- gression path and re-read times. It indicates that in target: match and target: two mismatch conditions reading times are differentially affected by lure match. Pairwise comparison within target: two mismatch conditions reach significance across all ROIs. This indicates that within target: two mismatch the pronoun is read faster if it matches the features of the lure. Finally, the interaction term from ungrammatical-only model also reaches significance across all ROIs. This indicates that the speed-up associated with matching lures has different magnitude in tar- get: one mismatch and target: two mismatch conditions. PP2017 interpret these statistics as providing evidence for two claims: first, that interference effects do exist, and two, that they only appear in ungrammatical sentences with target noun mismatching the reflexive in two features. Generally, I agree with this inter- pretation. But before I proceed, I address two issues with PP2017 analysis which might affect the conclusions I reach. Lack of multiple comparisons corrections Despite carrying on quite a lot of comparisons, PP2017 do not report any multiple comparisons correction. This is somewhat traditional in eye-tracking studies, but, as von der Malsburg and Angele (2017) argue, this is not optimal, and if one does run analyses for multiple ROIs and MOIs, these multiple analyses should be corrected for. von der Malsburg and Angele (2017) simulations indicate that Bonferroni correction is not overly conservative 160 and can be applied in such situations. I use it and assess whether PP2017 and my conclusions hold after a correction for multiple comparisons. For a given combination of ROI and MOI PP2017 fit 4 models. One could argue that the full model should be counted as two comparisons for the purposes of multiple comparison corrections: we are interested in three effects, coming from this model, but we will be making only two inferences based on them: whether we have evidence for interference (main effect of lure match), and whether we have evidence for the interference differing between grammatical and ungrammatical conditions (two target match x lure match interactions). Ungrammatical-only model and the simple main effects models correspond to three more comparisons. Thus, we are making at least five inferences per a combination of ROI and MOI. Overall, PP2017 analyze 3 ROIs and 4 MOIs, which leaves us with 5x3x4 = 60 comparisons. Given the original α = 0.05, a Bonferroni corrected α = 0.05/60 = 0.00083. The corresponding t-value5 for a two-sided test is ±3.34. For my data, neither of the significant effects remain significant after applying the correction. In contrast, for PP2017 data many of the significant comparisons reported for the critical region remain significant after this correction, except the fol- lowing four (indicated with triangles in Fig. 4.3): two interaction effects for the full model in regression path and re-read time, and two interaction effect ungrammatical- 5Approximated from normal distribution following Jäger et al. (2015). The corresponding R command is qnorm(0.00083/2) 161 only model in first-pass and re-read times6. Choice of preprocessing procedures The most notable pre-processing choice by PP2017 I do not agree with is the decision to replace missing values with zeros. Although PP2017 are certainly not the first to make this decision (Sturt, 2003; Cunnings and Felser, 2013; Cunnings et al., 2014; Cunnings and Sturt, 2014, see, e.g.), there are reasons for which it may be problematic, and I discuss some of them below. First of all, at a risk of stating the obvious, missing values are not the same as zero reading times. In the eye-tracking experiments, there are at least two ways in which missing values can arise: people did not fixate on a given region, or the fixation was not recorded due to technical difficulties. I may be willing to ignore this source of noise and assume that all missing values represent the lack of fixation. But even if we make this assumption, literally assuming that the dataset includes RTs of 0 ms does not really make sense. Fixations below 50ms are likely non-existent in actual reading (Rayner, 1998, plot at p.376), so any and even if they did exist, the preprocessing procedure PP2017used7, removed any fixations shorter than 80 ms, so we do not expect to find any smaller values in the data. Now, one may argue that ”the region was not read” is essentially the same as 6One of the main effects of target match in re-read times also stops being significant. How- ever, we are not interested in this effect, so I do not discuss it further 7PP2017 do not directly report doing that, but Parker (2014), which is the original source of these experiments, does. Given that the average RTs reported in both sources are the same, it is reasonable to assume that the pre-processing procedures were the same, or very similar, as well. 162 ”the region was read for 0 ms”, and amounts to the difference in wording, but it does not. If we consider the way reading times may be generated in an eye-tracking experiment, this difference will be clearer. For a given word, people either make a fixation on it or not, with a certain probability. If people did fixate the word, they look at the word for a certain time. Of course, how these specific decisions are made is probably a very complicated process depending on multiple factors. But we believe that this general two-step description of the generative process is not unreasonable. So we are dealing with at least two distributions: one determining the probability of fixation, and another one - determining the duration of a fixation, given that it has occurred8. So the statement ”the region was not read” corresponds to a situation where one drew 0 from the distribution specifying the probability that the region is re-read9, and the statement ”the region was read for 0 ms” corresponds to a situation where one drew 1 from the probability of re-read distribution, and then drew 0 from the distribution of re-read times. Notice that we assume that the distribution of re-read times even includes 0, which it may not, given the data I mentioned above on the fixations duration. A better model for the data of such structure would reflect the structure of the generative process and could be constructed in two steps: first, use logistic regression to predict whether a certain word will or will not be skipped; then, for 8It is also the case that not all missing values result from people skipping a region. E.g. sometimes they may result from a failure of the equipment to record a fixation. So one could also assume distributions for the probability of such events, e.g., specifying the probability of eye-tracker losing track for a particular fixation 9Assuming 0 corresponds to “no re-read”. 163 the words which were not skipped, use linear regression or any other appropriate model to predict reading times for these words ((see e.g. Gelman and Hill, 2007, p.126) for a similar approach). It may be argued in response to the above that the goal of statistical analyses (in these particular experiments) was not to come up with a good model of the process that generated the data, but rather to capture some generalization about them, regardless of whether it’s ecologically valid. E.g. by collapsing missing and non-missing values in the same analysis, one might be asking the question: on aver- age, how much time do people spend re-reading a given region? Which is different from the question: given that people re-read the word, how much time on average did they spent doing so? However, replacing missing values with zeros may have consequences to the inferences as well. Replacing missing observations with zeros will likely drag the estimates down and increase their uncertainty. Now, we do not know whether the values are missing at random. It may be, for example, that people are less likely to re-read the reflexive region in grammatical conditions. If this is the case, fewer zeros will be introduced to the RTs coming from grammatical conditions; and similarly for other possible dif- ferences in re-read rates between conditions. Ultimately, the value and the certainty of the model estimates may be differentially affected in different conditions, and it is not immediately clear whether and in what way this may influence the inferences. (To be fair, the same logic applies to rejecting the missing values: since we do not know whether they are missing at random, rejecting them can bias the conclusions we have. But since replacing missing values with zeros introduce other potentially 164 undesirable effects, I advocate this method of dealing with them. Of course, this is not the only method: e.g. one could try to use imputation procedures of various degrees of sophistication. I do not know whether and in what way that would affect the conclusions I would make.) An additional complication with 0 values is due to the fact that PP2017 con- duct their analysis on log-transformed RTs. Due to non-linearity of log function, small changes on log scale can translate to quite big changes on linear scale. Fig.4.4 demonstrates this: it shows the difference between the observed RTs (red markers) and the RTs predicted by the model fir to the data with missing values replaced with zeros (blue markers)10. The figure layout is mostly analogously to Fig.4.1: experi- 10In order to generate model predictions, I do the following. First, for each fixed effect defined in the model I obtain a sample from Normal(µ̂, σ̂), where µ̂ and σ̂ are the estimate and standard error of the estimate, provided by the model. In this case, this will give us 7 samples: intercept and six other effects. I combine these values according to the contrast coding schema, to get the predicted reading time (on log-scale) for each condition. I.e. the estimate for the intercept would correspond to the reading time in the baseline, target: full match, lure: mismatch condition. If we want to get an estimate for target: two-mismatch, lure: match, we have to sum the four corresponding effects (intercept, two main effects and an interaction), etc. In the end, we obtain six predicted reading times on log-scale, one for each condition, which we exponentiate to convert them back to the linear scale. I repeat the above procedure 3000 times, obtaining 3000 predicted RTs for each condition. I average these predictions and compute their standard deviation. These averages are my estimates of the mean reaction times on the linear scale as predicted by the model. Notice that I do not incorporate uncertainty associated with participant and items into these predictions; essentially, I am predicting reading times for an average participant reading an average item. I am also ignoring variability in individual RTs: I am not trying to see what range 165 mental conditions are plotted along the X axis (see figure description for details), RT estimates are plotted along the Y axis. Columns correspond to eye-tracking measures. The only major difference from Fig.4.1 is in the rows: all the data are coming from the critical region, so now the rows represent datasets (the upper row corresponds to the data from the current replication; the lower one - to PP2017 data.). The main observation I make about this figure is that the RTs predicted by the model are clearly much smaller than the observed RTs. In some cases they are nonsensically small. E.g if we look at the the original study, Target Match, Lure Mismatch condition in the re-read times (“tm-lmm” condition in the figure), we will see that the model predicts average reading times of 20 ms. 4.1.6 Sensitivity analysis In order to assess the influence of analysis decisions on the inferences I can derive from PP2017 data, I conduct a small scale sensitivity analysis. In addition to the two factors discussed above - corrections for multiple comparisons and treatment of missing observations - I consider two more: treatment of extreme values and the definition of the critical region (PP2017 and S2017 differ in these last two things). Here is the summary of the analysis decisions I manipulate: 1. Definition of critical region. I contrast two possible definitions: critical re- gion includes only the pronoun or the pronoun plus three additional characters to the left11. PP2017 uses the first variant, S2017 - the second. of RTs my model predicts, rather what range of mean RTs it predicts. 11Since reflexive pronouns are close-class words, they may be skipped relatively frequently during 166 2. Dealing with missing values. I contrast replacing missing values with zeros and rejecting them. PP2017 adopted the first approach; S2017do not report how they treated the missing values. 3. Dealing with extreme values. I contrast no trimming with fixed threshold trimming, following PP2017 and S2017 correspondingly. For fixed threshold, any values above 2000ms for first-pass and above 4000ms for total times are rejected. 4. Multiple comparisons corrections. I contrast analyses with no correction at all and analyses with correction for 60 comparisons, as discussed earlier. 4.1.6.1 Qualitative analysis I start with discussing numerical patterns in the data. For the purposes of the discussion I focus on the change in the size of interference effects in target: two mismatch conditions, shown in Fig. 4.5. To remind, interference effects is calculated as the RT in the target mismatch lure match condition minus the RT in the target mismatch lure mismatch conditions. Thus, negative values are indicative of the facilitatory interference. The columns correspond to variation in extreme data trimming decisions12; the rows - to missing data removal decisions, the colors - to the combination of studies and critical region mark-up decisions (see figure description the reading. Extending the critical region to the left may reduce the amount of missing data resulting from skips. 12Notice that cut-offs for extreme data points were only established for first-pass and total times data, thus for other eye-tracking measures there is no change across the columns. 167 for details). Different eye-tracking measures are represented along the x axis. When looking at this figure, we are mostly interested in whether the pre- processing decisions affect the size of the interference effects. The answer is: pre- processing choices naturally do have influence on the estimates, but it is not huge, and the patterns of the results remain relatively stable across different pre-processing variants. Choosing a shorter region naturally leads to shorter reaction times, but the intrusion effect remains almost unaffected. Choosing to reject missing values leads to the reduction of the effect size in some cases, not exceeding roughly 30ms. Finally, the trimming of extreme values does not affect first pass, and leads to a reduction in total times (roughly 40 ms for PP2017data and roughly 20ms for the replication). As we will see later, we have reasons to believe that the size of lure-match effect in reflexives lies in the range of roughly 50-100ms, so a reduction of 30 or 40 ms is quite sizeable. Finally, I would like to point out that for all preprocessing variants I observe numerically big intrusion effects in PP2017 and much smaller effects in my replication. This suggests that the discrepancies we observe cannot be explained by di 4.1.6.2 Quantitative analysis Now I turn to the discussion of statistical estimates obtained in different anal- ysis procedures. The primary question I am interested in this section is: do analysis choices influence the inferences we make from the models? I use the same analysis procedure as described earlier, although I automate the simplification of the random 168 effects structure for the non-converging models (it was done manually in the main analysis). Fig. 4.6 displays the estimated coefficients for the models. Each dot in the plot corresponds to a β̂ value from some model, in a way similar to Fig.4.3. We consider models fit to 12 different datasets, resulting from the variation of pre-processing choices and the study the data was coming from. These 12 datasets are plotted along the Y axis of each plot. The names of analyses variants are compositional: each sub-component corresponds to one dimension we manipulated. The first com- ponent corresponds to the study (“repl” - current replication, “pporig” - original study by PP2017); the second component corresponds to the critical region markup (“nonext” - critical region includes only the reflexive itself; “ext” - critical region includes the reflexive and three characters to the left. Notice that only “nonext” variant is available for PP2017); the third component corresponds to the treatment of extreme values (“notrim” - include all values in the analysis, “trim” - trim values exceeding 2000ms in first-pass and 4000 in total times); the last component cor- responds to missing values treatment (“narm” - remove, “nazero” - replace with zero). Sub-panels of the plot display estimates from models fit to different eye- tracking measures. Each column displays model estimates for a given fixed effect: TM1/2 - target match , one/two feature mismatch; LM - lure match; ung int - interaction term from ungrammatical only model; prws1/2 - pairwise comparisons, corresponding to simple main effects of lure match within the corresponding target match conditions. Notice that the coefficients are not easily interpretable on their own, since they 169 are on log scale. Thus, I will only discuss what inferences one would make, if one was simply looking at the coefficients and checking whether they are significantly different from zero. I take coefficients with |t| > 2 to be significantly different from zero, and mark them in red on Fig. 4.6. In addition, I consider how the use of a Bonferroni correction affects the conclusions. Following the logic I described earlier for the main analysis, I correct for 60 comparisons, resulting in a Bonferroni corrected α = 0.05/60 = 0.00083. The estimates which stop being significant are highlighted in yellow in Fig. 4.6. Let me specify a decision algorithm I will use to make conclusions from the models. As far as I can say, this algorithm does not correspond exactly to how decisions were made in PP2017, and may lead to somewhat more conservative con- clusions then reached in the original paper. I will discuss more liberal variants of it as well. I will keep using the model specification from PP2017, although some of the questions we are going to ask might be better addressed with slightly different models. I will use the following decision procedure: 1. First, I will look at three effects in the full models, to answer the question: Do lures affect the processing of reflexive pronouns? Main effect of lure match would suggest that lure match and lure mismatch conditions differ within target match sentences. Two target match x lure match interactions, one for each non-baseline level of target match would indicate that the effect of lure match differs between grammatical sentences and - depending on the interaction - one-mismatch or two-mismatch ungrammatical 170 sentences. If I do find at least one significant interaction, I will declare that I have evidence that interference effects exist, and that they behave differently in grammatical and ungrammatical sentences13; in this case, I will proceed to the following step of the algorithm. If I do not find any interactions, I will only conclude that lure-match effects exist and stop. 2. If I find at least one interaction in the previous step, I will ask the next ques- tion: Does interference behave differently depending on the degree of ungrammaticality? To answer this, I will look at the interaction coefficient from ungrammatical-only model. If it is significant, I will claim that not only interference differs between grammatical and ungrammatical sentences, it also behaves differently in different types of ungrammatical sentences. If the in- teraction term is not significant, I will stop and declare that I don’t have any evidence to believe that the degree of ungrammaticality affects interference effects. 3. If I do find the interaction in the previous step, I will want to resolve it. To do this, I will look at the simple main effect of lure match within target: one mismatch and target: two mismatch sentences. 13Notice that in this model specification, we reach this conclusion somewhat indirectly. We look at differences between grammatical sentences and one-mismatch ungrammatical sentences; similarly for two-mismatch ungrammatical sentences; then, we conclude that grammatical and both types of ungrammatical sentences do or do not differ. A better way to address this question might be to use Helmert contrasts to first compare grammatical conditions to the average or the two ungrammatical conditions, and then compare the ungrammatical conditions between themselves. 171 A more liberal version of the algorithm would allow to proceed to consider the ungrammatical only model even if no interactions in the full model are significant. This would roughly correspond to assuming that we are not interested in whether lure match differentially affects grammatical and ungrammatical sentences, only in whether lure match behaves differentially in ungrammatical sentences depend- ing on the degree of ungrammaticality. Another way to make more liberal decision is not to take Bonferroni correction into account. I will apply the decision algorithm to all eye-tracking measures, and make the corresponding conclusions if for any single measure my algorithm allows us to do so14. I start with analyzing PP2017 data. Main effect of lure match never reaches significance regardless of pre-processing procedure. target match x lure match interactions do reach significance for target: two mismatch conditions in all regions and measures, only if we choose to remove missing values from the data15. These effects survive Bonferroni correc- 14This is yet another degree of freedom in the analysis. As von der Malsburg and Angele (2017) notice, it is somewhat common to make claims about some effect if it is present at any given measure and/or region. In principle, it would be better if we could make predictions about in which exact measure the effect of interest will appear. 15Notice that in the analyses reported in PP2017, this interaction reached significance in re- gression path and re-read times, even though PP2017 chose to replace missing values with zeros. This discrepancy potentially results from the differences in random effect structures. I discuss it on the example of regression path model. As far as I could tell, PP2017 only specified ran- dom intercepts for both subjects and items, while in my case, I additionally had random slopes, 1 + target.match + lure.match | subject. This was the most complex model that converged dur- ing the simplification process described earlier. This difference in random effects specification was apparently enough to push the t-value for the interaction estimate from -2.04 (which would 172 tion in right-bound and re-read times, but not in first-pass and regression path. For the models based on data with rejected missing values (and, if I am being liberal, for the models with missing values replaces with zeros), I can continue to the next step of the decision algorithm and look at the interaction term for the ungrammatical only model. It turns out to be significant in almost all measures, except for re-read time. However, in neither of the reading measures does it survive Bonferroni correc- tion. Thus, if I am being conservative, I would stop here and only be able to claim that intrusion effects do differ between grammatical and ungrammatical sentences, but that I do not have enough evidence for differential effects of lure match de- pending on the degree of ungrammaticality. If I am being more liberal and ignore Bonferroni correction, I can look at the simple main effects of lure match within ungrammatical conditions, which always turn out to be significant. Thus, in a liberal version of the decision algorithm, I would be able to claim that interference effects are only observed in two-mismatch conditions. Summing up, if I ignore multiple comparisons correction, I would be able to reach the same conclusions as PP2017 do; if I am being conservative, I only have strong enough evidence for the difference in the effect of lure match between 2-feature target mismatch and target match conditions, but not the difference between the two ungrammatical conditions. A possible interpretation for this weaker result would be: the experiment provides evidence for interference in reflexive resolution contra previous findings (while the original PP2017 conclusions are able to accommodate previous findings due to the correspond to a significant effect) to -1.58 (non-significant). This suggests that random effects specification has to be carefully considered and preferably reported for all models in a paper. 173 claim of differential interference effect). Overall, it looks like I can reach the origi- nal conclusions of PP2017 regardless of the pre-processing, although one needs to be more liberal in the decision-making when using the models with missing observations replaced with zeros. Let us now turn to the replication data. Similarly to PP2017, the main effect of lure match in the full model never reaches significance. target match x lure match interactions do not reach significance in any analysis variant and eye-tracking measure. Thus, if I am being conservative, I would have to stop here and claim that I do not have strong enough evidence for interference in reflexive resolution. Being somewhat more liberal does not help: even if I look directly at ungrammatical only model, I would not be able to claim that I have found evidence for interference effect, since under no pre-processing scenario does the interaction from the reduced model reach significance. Only if I relax the standards even further and decide the simple main effect of lure match are good enough evidence for interference, I would be able to claim the I found the effect - but only without the correction for multiple comparison. Summing up, for my data I would have to be very liberal to claim that I have enough evidence for interference. Overall, while the sensitivity analysis suggests that the evidence presented by PP2017 may be somewhat weaker than claimed in the paper, the replication study provides even weaker evidence for interference effect (in the conservative case - provides no reliable evidence at all). 174 4.1.7 Discussion The main goal of this experiment was to check whether it was a mere presence of QPs in the set of experimental materials which prevented us from observing reli- able interference effects in the previous chapter. The answer is negative. While the overall patterns of (average) reading times appeared to be similar, the magnitudes of the reading times were less extreme in the replication. The magnitude of the interference effect was at least twice as small as compared to the original study, and was comparable to that observed in my Exp.3-4. The statistical analyses also did not provide a strong support for the existence of interference effects. While the choice of analysis procedure did have some influence on the conclusions (e.g. only allowing for weaker claims in case with missing values replaced with zeros), the overall impression still held: the data from the replication provides less statistical support for the existence of the interference effect than the data from the original study. I also intended to use this experiment to tease apart PP2017 and S2017 ac- counts. If I did observe reliable interference effects here, it could be taken to support PP2017: their account can accommodate the influence of context on the resolution process much more readily than S2017. However, since it does not appear that the presence of QP affected the size of the lure-match effects, I cannot use these data to help distinguish between the two accounts. In the next experiment I address a different possible context effect: the influence of the experimenter language charac- teristics. I will use the same exact materials as in the current experiment, only this 175 time all the data will be collected by a native experimenter. 4.2 Experiment 6 In the previous experiment I failed to find strong support for interference ef- fects from NP lures, despite using the same exact materials as Parker et al. (2017). I did find numerical trends in the predicted direction, but they were not supported by statistics. The only systematic explanation for the lack of replication I can think of is the influence of the experimenter: perhaps, instructions in non-native English affected participants’ processing of ungrammatical sentences. The current experi- ment aims at testing this possibility. It exactly replicates my previous experiment, with a single difference: all the data were collected by native English speakers16. 4.2.1 Participants 35 members of University of Maryland community (23F, 12M; age range: 18- 34, average age: 21.4) participated in the experiment for a class credit or a payment of $12. The experimental session took about one hour, including setup and cal- ibration. Data from two participants were removed: one participant did not pay attention during calibration and was distracted during the experiment; for the other, experimental software could not adequately process the data. Thus the analysis was 16I would like to thank Lalitha Balachandran and Cassidy Wyatt for their help with data col- lection. 176 based on the data from 33 participants17. 4.2.2 Materials The materials were identical to Experiment 5. 4.2.3 Procedure The procedure was identical to Experiment 5. 4.2.4 Analysis The analysis procedure was identical to Experiment 5. Again, I first follow the analysis procedure as outlined in Parker and Phillips (2017), to be as close as possible to the original study. Then I perform an additional set of analyses with a different pre-processing procedure, which I think to be better for the reasons discussed earlier in this chapter (its two main differences are a wider critical region and removing missing values instead of replacing them with zeros). 4.2.5 Results I start with observed patterns of mean reaction times, which are shown in Fig. 4.7. Numerical values for the mean are given in Appendix A. I compare the results from the current experiment (green dots), my previous replication in Exp.5 (blue dots) and the means from the original study by PP2017 (red dots).The layout 17I aimed at 48 in order to have the same power as in the previous replication, but could not reach the goal due to time limitations 177 of the figures is identical to Figs.4.1 and 4.2 from the previous chapter and was described in detail there. Here, I will focus the discussion on the critical region, since this is where PP2017 observed the interference effect most consistently. The overall patterns of means seem to be roughly similar in all three studies. The first thing I note about the numerical values is that for some reason the average reading times are smaller than in both PP2017 and Exp.5. All three experiments look most similar within 2-feature target mismatch conditions in terms of the pattern of the RTs: lure match conditions are read faster than lure mismatch conditions. However, in terms of the magnitude of the interference effect18 in 2- feature target mismatch conditions, the current experiment is closer to my first replication than to the original study (Fig.4.819), and again it is smaller in magnitude than the interference effects from Parker and Phillips (2017). Let us now turn to the statistical analyses. As in Exp.5, I present model estimates from the current experiment and from the original PP2017 experiment side-by-side on the upper panel of Fig.4.9. The layout is essentially the same as in 4.3 from the previous experiment, except that we are presenting two sets of of comparisons in the upper and the lower panels, and the estimates are only coming from the models fit to the data from the critical region. Coefficients for which |t| ≥ 2 are represented with square markers; triangles indicate coefficients which 18As a reminder, calculated as the difference between lure: match and lure: mismatch conditions. Negative values indicate that lure: match conditions are read faster. 19Notice the change in the plot layout: eye-tracking measures are plotted along the x-axis, and columns correspond to ROIs. 178 stop being significant after a multiple comparisons correction. Intercept estimates are not included in the plot. I will focus my discussion on the critical region. To remind, for the full models, I can expect three estimates to reflect interference effect: • Main effect of lure match, which would indicate that reading times on the reflexive differ for matching and mismatching lures, if we compare target: match and target: two mismatch conditions; • target match x lure match interaction for target: one mismatch con- ditions. It would indicate that reading times are differentially affected by lure mis(match) in target: match and target: one mismatch conditions. • Similarly for target match x lure match interaction for target: two mismatch conditions. In the current experiment, no effects of interest reach significance. Two other effects, unrelated to lure match were significant: the two main effects of target match, indicating that ungrammatical sentences with mismatching lures are being read slower than grammatical sentences with mismatching lures. Both these effects survive Bonferroni correction based on the same calculations as in the previous experiment20 Compare these results to the original study and Exp.5. The pattern of results in re-read times is perhaps the most similar between the three. Two main effects 20To remind, I am correcting for 3 ROIs x 4 MOIs x 5 comparisons in each = 60 comparisons, which results in critical α of 0.00083 and the corresponding t-value of ±3.34 179 of target match are most consistent. For one-mismatch conditions, they reach significance in two studies, and remain significant after Bonferronni correction in one of them; for two-mismatch conditions, they reach significance in all three studies and remain significant in two of them after the multiple comparisons correction. The magnitude of the estimates is also very similar. These results suggest that the ungrammaticality of the sentences strongly affects participants. In other eye- tracking measures, it appears that the two replications are closer to each other than to the original study: the magnitude of the estimates in the replications is smaller; very few effects reach significance and none survive the multiple comparisons correction. Exploratory analysis The sensitivity analysis I performed for the previous ex- periment suggested that out of three preprocessing parameters I have manipulated - definition of the critical region, trimming of the extreme values and treatment of missing values - the last one had the biggest impact on statistical estimates. Thus, I briefly discuss what would happen had I chosen to remove the missing values instead of replacing them with zeroes in the main analysis. Fig. 4.10 shows the mean RTs, Fig. 4.11 shows the size of the interference effect and Fig.4.12 shows the coefficients from statistical models. As we can see, the pattern of the RTs and interference effects are not con- siderably affected by the preprocessing routine, except that the average RTs get somewhat more similar between the three studies in regression path and re-read. Statistical patterns also remain roughly the same: the only effects that reach signif- 180 icance are the main effects of target match (i.e. grammaticality effects). They are now significant in all eye-tracking measures, but most of them do not survive the Bonferroni correction. Still, it indicates that there is evidence for grammaticality effects in the data. It is interesting to note that the estimates for the effects which did not reach significant are rather similar across all three studies, even more so than in the main analysis. On the other hand, the estimates for the effects indicative of interference are bigger in the original study as compared to the two replications. This again makes us think that the effects reported by PP2017 may be overestimated. 4.2.6 Discussion The results of my second replication appear to pattern with my first replication and not with the original study. The lure-match effect in 2-feature target mismatch conditions was numerically going in the same direction as in the original PP2017 study, but the magnitude of the effect was at least two times smaller. Statistical analyses do not provide support for lure-match effects. This is like my first replica- tion, and unlike the original, where statistics supported the presence of the effects in all analyzed eye-tracking measures. Overall, the study does not indicate that the language of the experimenter was what caused the absence of lure-match effects in my Exp.2-4. These conclusions are supported by the results of the exploratory data analysis: regardless of the pre-processing procedure I choose, my conclusions hold. This experiment, as the previous one, fails to provide evidence for the influence 181 of experimental context on the processing strategies. Thus, these results do not provide support for the PP2017 account, as I hypothesized they might. However, they provide valuable evidence on the size of lure-match effects we may be expecting in reflexives resolution, and this information may be used to guide power analyses for the future studies. Next section discusses these issues in more detail. 4.3 Interference magnitude across studies In this section I compare the magnitude of the intrusion effect in the few studies which have looked at reflexive resolution in 2-feature target mismatch con- figuration, to see whether the results I have obtained in the studies I have run are unexpectedly small. I start by looking at five experiments, which were maximally close to each other in terms of materials and manipulation: PP2017 Exp.3, its two replications (my Exp. 5 and 6), S2017 Exp.1c (speech verbs), Exp.4 from this thesis. PP2017 Exp.3 and the current experiment used the same materials (experimental sentences and fillers). My Exp.4 overlapped with S2017 Exp.1c in a subset of exper- imental sentences (both experiments included additional conditions which were not similar; the fillers were also different). In all studies, the stimuli shared the following characteristics: • Two clause sentences with the reflexive inside a complement clause; • The matrix predicate is a predicate of report (e.g. “say”, “mention”; etc.)21; 21This parameter was controlled in S2017; it was not in PP2017, but in their materials about 70% percent of the verbs were report verbs, so I consider the materials to be comparable along 182 • The reflexive and lure are always singular; • The lure is the subject of the matrix clause; • The target is the subject of the embedded clause; • The lure can match or mismatch the reflexive in gender; • The target mismatches the reflexive in gender and number 22; • The target and the lure are referential NPs. Fig.4.13 shows the interference effect in 2-feature target mismatch conditions, calculated as the difference between lure match and lure mismatch conditions. Different studies report different reading time measures, that is why there are gaps in the plots. Two observations are of interest. First, intrusion effect is generally much bigger in PP2017 than in the other studies (except in regression path, where the magnitude of the intrusion is similar in PP2017 and S2017). Second, the mag- nitude of the intrusion appears to be reduced in my studies as compared to the corresponding original studies. Importantly, the magnitudes of facilitation from the two replications are very close to each other. This suggests that PP2017 estimates might be inflated, and that my estimates might be closer to the underlying effect. this dimension. 22PP2017 and its replication had an additional condition with the target mismatching the target only in number, but S2017 and Exp.4 from this thesis didn’t have it. Since intrusion effect appears only in the two-mismatch condition, I will not focus on the additional one-mismatch condition in PP2017 and its replication 183 Let us now consider interference effects in all studies investigating reflexive resolution in 2-feature target mismatch conditions with c-commanding lures. In addition to the experiments above, this list includes Exp.3 from this thesis, two more experiments by PP2017 and four more experiments by S2017. The magnitude of interference effect in these studies is displayed in Fig.4.14. The picture is perhaps most consistent in total times: the effect reported by PP2017 completely overshad- ows the other effects, but otherwise the size of the effect is surprisingly similar in experiments by S2017, and the magnitude of the effect from two replications I report is in line with them as well. The effects from Exp.3 and 4 are somewhat smaller, especially the first one. Finally, lure-match effect in S2017 Exp.1 for sentences with perception embedding verbs is almost non-existent, but this difference is accounted by S2017 account (and, as I have argued in Chapter 1, by PP2017 as well). Turning to other reading times measures, the picture is very similar in first-pass time, right- bound and re-read times: the magnitude of the interference effect in PP2017 Exp.3 is bigger than in all other studies, which are relatively similar (with the exception of Exp.3 from this thesis). Regression path shows the most heterogeneous picture, although even here PP2017 effect in Exp.3 is numerically the biggest, contested only by the effect from S2017 Exp.1 for sentences with speech verbs. The comparisons discussed above do suggest that the intrusion effect in PP2017 Exp.3 is inordinately big. I do not think it is due to the particular manipulation used: S2017 Exp.1b and 2b rely on the same manipulation, and yet generally yield numerically smaller effects. The native language of the experimenter, at least in this set of studies, does not uniquely determine the size of the effect either: in both 184 my Exp.6 and S2017 the data were collected by native English speakers, and the intrusion effects are smaller than in PP2017, It thus seems more likely that the dif- ference in the effect size is due to some unmeasured factors; the fact that PP2017 only collected data from 24 participants might have contributed to the surprising size of the effect. As Gelman and Carlin (2014, (a.o.)) discuss, when the power is low, significant effects tend to be overestimates; the magnitude of the overestimation tends to infinity as power tends to zero. While we would need to conduct a formal power analysis to make specific claims, it is true that PP2017 Exp.3 relies on the smallest sample size among all the studies discussed in this section, so presumably has the lowest power and would have the biggest overestimation ratio among them all. Another suggestive piece of information comes from a simulation study by Engelmann et al. (draft4). They investigate the influence of the LV05 model pa- rameters on the size of the lure-match effect predicted in the model (p.13). The magnitude for the majority of the predicted facilitation effects (i.e. the type of ef- fect I am interested in here) lies below 100ms, and the effects of the size reported by PP2017 are rather rare. 4.4 Pooled data analysis The previous exploratory analyses indicated that the numerical lure-match effects I observed might be real but fail to receive statistical support due to the lack of power. In this section I report an exploratory analysis on the pooled data from 185 both replications - this will more than triple the original sample size used by Parker and Phillips (2017). First, I wanted to make sure that the data from the two replications is homoge- neous enough to be pooled. To this end, I run a bootstrapping simulation, sampling 48 subjects with replacement from each of the three experiments - the original one by Parker and Phillips (2017) and the two replications I report here. I draw a thousand samples from each dataset, and calculate the magnitude of the lure-match effect in 2-feature target mismatch conditions for different eye-tracking measures in the critical region. Fig. 4.15 represents the distributions of bootstrapped lure-match effects magnitudes. As in most plots in this chapter, columns correspond to different eye-tracking measures. The magnitudes of the interference effects observed in the simulation are plotted along the Y axis; the height of the distribution at any given point roughly23 corresponds to the number of observed effects of a given magnitude. Colors of the distributions correspond to the dataset which the simulations were based on (see the figure’s legend). We can see that the distributions of possible effect sizes are very similar be- tween the two replications. They look almost identical in regression path and total times; they are not as similar in other eye-tracking measures but are still located and shaped quite similarly to each other. On the other hand, possible effect sizes one can obtain from PP2017 data are bigger and practically do not overlap with the replications distributions. I conclude that the data from the two replications is homogeneous enough to be pooled for an exploratory statistical analysis. 23Although not precisely, since it’s not a histogram, but rather a smoothed version of it. 186 Fig.4.16 shows the results of statistical modeling with the pooled dataset. The layout is similar to Fig.4.3. The data only come from the critical region. As in Exp.6 for the sake of exploration I compare whether the treatment of the missing values would affect my conclusions. We can see that grammaticality effects (main effects of target match) reach significance in all measures at the critical region; this is not surprising given that even in the individual experiment the evidence for the grammaticality effects was quite robust. On the other hand, evidence for lure-match effects still remains rather elusive. In the analysis with missing values removed, the target match x lure match interaction for target: 2-mismatch conditions reaches significance in re- read and total times. Neither of these is significant if I choose to replace missing values with zeros. This is probably due to a larger standard error of the estimate in this case, since the magnitude of the estimates is rather similar between the two analysis procedures. Overall, the analysis of the pooled data suggests that a moderate increase in sample size may boost the statistical power enough to make the lure-match effect more readily detectable. 4.5 General discussion The goal of the two experiments reported in this chapter was two-fold. First, I wanted to determine how reliable my results from Exp.2-4 were. To remind, my question there was whether QP lures elicit lure-match effects in the same way as NP lures do. I failed to find statistical support for interference from either NP or QP 187 lures. Numerically the RTs did go in the right direction (lure match sentences were read faster within target-mismatch conditions only) when the QP lure c-commanded the pronoun, but the magnitude of the effect was reduced as compared to PP2017. This made me worry that the experimental manipulation did not work in some subtle way, and the results were completely uninterpretable. The experiments and analyses reported in the current chapter allowed to partially assuage this worry. In both replication experiments I found a lure-match effect which is numer- ically smaller than that reported in PP2017. As in Exp.2-4, this effect does not reach significance in the statistical analyses. However, the comparison of the effect sizes across multiple related studies suggests that it is PP2017 and not my results which are outliers: effects of about 70-100ms (as in my studies) are more common than those of 200-250ms (as reported by PP2017). Importantly, the magnitude of the lure-match effects in Exp.4 lies in the same range. It suggests that the experiment did, indeed, work as intended. Thus, I conclude that the interpretation of the data I tentatively suggested in the end of the previous chapter holds: c-commanding QP and NP lures elicit similar lure-match effects, suggesting that both types of lures influence the reflexive resolution process. This supports the conclusion that the differential patterns of RTs between NP and QP lures I observed in Experiment 2 are not due to QP being inefficient lures, and may truly indicate the inaccessibility of non-c-commanding QP to the parser. Influence of the context A second question I addressed in this chapter was whether experimental context (in terms of the stimuli set or experimental environ- 188 ment) has an effect on the magnitude of the lure-match effect. I argued that if contextual factors did indeed affect the magnitude of the interference, it would pro- vide support for PP2017 account, which could explain such influence much more easily than S2017. However, the results suggest that the context (at least the two factors I explored) does not influence the size of the lure-match effect. The size of the lure-match effect I observed was consistent regardless of whether the experiment was run with a stimuli set containing QPs, with a set without them, by a native or non-native experimenter. Therefore, there is no direct evidence for preferring PP2017 over S2017 coming from this set of experiments. Lack of statistical significance While the comparison of effect sizes is informa- tive, the six experiments I conducted so far provided extremely limited statistical evidence for the existence of lure-match effects. Given that lure-match effects do appear to be relatively consistent in direction and magnitude across my and some of the previous studies, I think that the lack of statistical support is in big degree due to statistical reasons. As I have already mentioned, under-powered studies tend to produce over-estimates of significant effects (Gelman and Carlin, 2014). A recent report of a large reproducibility project by Collaboration (2015) provides empirical evidence for this claim: out of 100 replication attempts, 97 original effects were sig- nificant at 0.05 level, while only around 35% replication effects were so. Similarly, in 82 out of 99 original studies for which effect size could be calculated showed a bigger effect size than the replication. While this explanation is tempting, and, I think, has substance to it, it can- 189 not be exactly right: S2017 used very similar sample sizes, but consistently found statistically robust effects. I return to this issue in the closing chapter of the thesis. 190 Figure 4.4: Predicted and observed RT values for PP2017 and my replication. Columns correspond to eye-tracking measures: fp - first-pass, rb - right-bound times, rp - regression path, rr - re-read times (all RTs come from the critical region). Rows correspond to datasets (the upper one - current experiment; the lower one - original PP2017 data). Error bars for the observed data represent standard error of the estimate, adjusted for participant variability (Cousineau, 2005; Morey, 2008). Error bars for the predicted data represent standard deviation for the simulated means. Conditions: tm/tomm/ttmm - target match/one-mismatch/two-mismatch; lm/lmm - lure match/mismatch. 191 Figure 4.5: Sensitivity analysis: Intrusion effect (lure mismatch minus lure match ) in 2-feature target mismatch conditions. Columns correspond to variation in extreme values treatment (“notrim” - include all values in the analysis, “trim” - trim values exceeding 2000ms in first-pass and 4000 in total times); rows correspond to missing values treatment (remove or replace with zero); colors correspond to the combination of study and crtical region markup (“nonext” - critical region includes only the reflexive itself; “ext” - critical region includes the reflexive and three characters to the left. Only “nonext” variant is available for PP2017). Error bars represent standard error of the difference of the means. 192 Figure 4.6: Sensitivity analysis: Model coefficients in PP2017 Exp.3 and replication. Errors bars represent standard deviation of the estimates. See text for further description of columns and row labels. Coefficients for which |t| ≥ 2 are represented with red markers if they remain significant after a Bonferroni correction (with the corrected |t| = 3.34), and with yellow markers, if they don’t. 193 Figure 4.7: Mean RTs in PP2017 (red), my previous replication (blue) and my current replication (green). Error bars represent standard error of the mean. Conditions: tm/tomm/ttmm - target match/one-mismatch/two-mismatch; lm/lmm - lure match/mismatch. 194 Figure 4.8: Interference effect in target: two mismatch conditions (lure match - lure mismatch) in PP2017 (red), my previous replication (blue) and my current replication (green). Error bars represent standard error of the difference of the means. 195 Figure 4.9: Model estimates in PP2017 (red) and my replication (blue). First five effects come from the full model; next three represent the simple main effects of lure match and the interaction from ungrammatical-only model. The order of the coefficients is the same as in PP2017, to make comparisons across papers easier. Error bars represent standard error of the estimate. Coefficients for which |t| ≥ 2 are represented with squares if they remain significant after a Bonferroni correction (with the corrected |t| = 3.34), and with triangles, if they don’t. 196 Figure 4.10: Mean RTs in PP2017(red), my previous replication (blue) and my current replication (green). Exploratory analysis: removing missing values instead of replacing them with zeros. Er- ror bars represent standard error of the mean. Conditions: tm/tomm/ttmm - target match/one-mismatch/two-mismatch; lm/lmm - lure match/mismatch. 197 Figure 4.11: Interference effect in target: two mismatch conditions (lure match - lure mismatch) in PP2017 (red), my previous replication (blue) and my current replication (green). Exploratory analysis: removing missing values instead of replacing them with zeros. Error bars represent standard error of the difference of the means. 198 Figure 4.12: Model estimates in PP2017 (red) and my replication (blue). Ex- ploratory analysis: removing missing values instead of replacing them with zeros. First five effects come from the full model; next three represent the simple main effects of lure match and the interaction from ungrammatical-only model. The order of the coefficients is the same as in PP2017, to make comparisons across papers easier. Error bars represent standard error of the estimate. Coefficients for which |t| ≥ 2 are represented with squares if they remain significant after a Bonferroni correction (with the corrected |t| = 3.34), and with triangles, if they don’t. 199 Figure 4.13: Intrusion effect in five studies with most similar design (PP2017 Exp.3, its two replications (this thesis Exp. 5 and 6), S2017 Exp.1c, this thesis Exp.4). Errors bars represent standard error of the difference of the means. Reaction times are shown for the reflexive region. Columns correspond to reading times measures: fp - first- pass, rb - right bound time, rp - regression path, rr - re-read time, tt - total time. 200 Figure 4.14: Intrusion effect in all studies investigating 2-feature target mismatch configurations (PP2017, S2017, the current study). Errors bars represent standard error of the difference of the means. Reaction times are shown for the reflexive region. Columns correspond to reading times measures: fp - first- pass, rb - right bound time, rp - regression path, rr - re-read time, tt - total time. 201 Figure 4.15: Distribution of lure-match effect sizes in 2-feature target mismatch conditions obtained in a bootstrapping simulation. Sampling 48 participants with replacement. Columns correspond to reading times mea- sures: fp - first-pass, rb - right bound time, rp - regression path, rr - re-read time, tt - total time. 202 Figure 4.16: Model estimates for the models fit to the pooled data from Exp.5 and 6. Error bars represent standard error of the estimate. Dot size indicates whether the esti- mate was statistically significant (|t| ≥ 2) (big dots) or not (small dots). Colors correspond to the pre-processing variant used (“parker” - critical region comprises the reflexive alone, missing values are replaced with zeros; “sloggett” - critical region includes the reflexives plus three last characters of the preceding word, missing values are removed). Models are fit to the data for the critical region only. 203 Chapter 5: Conclusions In this thesis I attempted to answer the question: does structural informa- tion constrain real-time sentence processing in a way consistent with grammatical generalizations? Prima facia, the question has already been answered: a wealth of experimental evidence suggests that sometimes structural information appears not to rule out structurally inappropriate antecedents. We know that measures of processing difficulty, such as reading times and acceptability judgments are affected by structurally irrelevant material. Agreement attraction is the primary example, and similar behavior has been reported for NPI licensing and reflexive resolution. However, as I discussed, this evidence can be explained in a multitude of ways, only some of which - in particular, cue-based parsing - do indeed assume that structural information fails to uniquely direct dependency resolution. I have argued that cue- based models should be preferred both on theoretical and empirical grounds: they provide wide empirical coverage and serve as a solid working hypothesis, allowing to easily formulate and test new predictions. In addition, they have been argued to be the only kind of models which explained the absence of ungrammaticality illusions in agreement attraction. If this preference is correct, we have quite good reasons to believe that structurally inappropriate phrases are considered by the human parser. 204 However, several recent studies have raised doubts about whether cue-based models do indeed constitute our best choice. Two papers attack what had been the empirical cornerstone of the argument for cue-based models: their ability to explain the evidence from grammatical agreement attraction sentences. A meta- analysis and computational modeling by Jäger et al. (2017) suggest that cue-based models do not in fact accommodate the reading times patterns in grammatical sentences with agreement attraction. Hammerly et al. (draft.april.2018) go in a similar direction: they suggest that the absence of ungrammaticality illusions in the data is artifactual, and that the cause of agreement attraction lies in a badly constructed sentence representation which is accessed in full agreement with the grammar. Another paper attacks the cue-based explanation for the reflexive data (Sloggett, 2017): instead of processing error account they suggest a model which explains empirical evidence from reflexive resolution in terms of a fully grammatical resolution strategy - interpreting reflexives as logophors. In this thesis I was looking for evidence which could help to support or weaken these recent claims. To do so, I focused on recent findings by Parker and Phillips (2017) and Sloggett (2017). Both of them show that reflexive resolution can be affected by structurally inappropriate antecedents. But the explanations they give are diametrically opposite. Parker and Phillips (2017) suggest that the effects they observed are a result of a processing error - faulty memory access which fails to be uniquely constrained by the structural information. In contrast, Sloggett (2017) argues that these effects constitute evidence for logophoric interpretation of the pronoun, which he hypothesizes to be fully grammatical, although underutilized, 205 strategy of anaphora resolution in English. This approach assumes that structural information always uniquely leads the parser to only consider grammatically appro- priate antecedents. Neither of the accounts fully covers available evidence. Thus, the primary goal of this thesis was to investigate the predictions of these two accounts in order to choose between them (and as a consequence - between “structure-defeasible” and “structure-strict” models). I have reported six experiments: four original and two direct replications of the previous studies. One experiment investigated interference from lures embedded inside relative clauses under animate matrix subject - con- figuration where PP2017 and S2017 accounts make differential predictions. Three experiments investigated the behavior of QP lures, with the aim of better under- standing the role of c-command in on-line reflexive resolution. Two experiments were direct replications of Parker and Phillips (2017) Experiment 3, investigating possible roles of extra-linguistic context in lure-match effects, as well as reliability of the previously reported findings. The experiments I report are quite intimately intertwined: most of them both provide a piece of information which is valuable on its own, and help to narrow down the interpretation of the previous results. In the following section, I summarize my main findings and discuss how they could help us choose between the two types of accounts. 206 5.1 Choosing between PP2017 and S2017 5.1.1 On sub-command binding The first question I asked in my thesis was: do we have reasons to believe that lure-match effects from non-c-commanding lures reported by PP2017 are evidence for the parser accessing structurally illicit antecedents? If they are, it would con- stitute a problem for S2017 logophoric account. To capture them without resorting to a processing problem explanation, S2017 suggested that they represent an in- stance of sub-command binding. This phenomenon occurs in Chinese; an anaphor can be bound by a non-c-commanding animate antecedent if it is contained within a c-commanding inanimate NP. If S2017 explanation were true, the lure-match ef- fects would only appear when the embedding NP was inanimate. The results of Experiment 1 speak against this explanation. I observed lure-match effects regard- less of the animacy of the matrix subject. The only significant effect of lure-match was likely driven by the effect within “inanimate” conditions only. However, at the critical region the magnitude of the effect in the “animate” conditions was mostly as big or bigger than in the “inanimate” conditions, and approached the size of the effect from c-commanding NP lures in my other experiments. Therefore, I tentatively conclude that lure-match effects from non-c-commanding lures are real and are not instances of sub-command binding. Taken in isolation, this finding would support a processing error explanation over the grammatical one, i.e. “structure-defeasible” models over “structure-strict” ones. Such a conclusion 207 would weaken the S2017 logophoric account: if we have evidence that at least in some cases the parser can violate structural constraints (or maybe: cannot help but violate them), what forces it not to do so when resolving the very same dependency in similar configurations? However, as I discuss in the next section, these conclu- sions would be too strong. Instead, as I will argue, they rather support a model in which the parser faithfully and accurately follows constraints that imperfectly align with the grammar. 5.1.2 On the role of c-command In my Experiments 2-4 I focused on QP lures, asking the question: are PP2017 correct in claiming that any kind of structural information can be outweighed by competing factors (such as morphology), or can some syntactic features categorically rule out inappropriate antecedents? Experiment 2 investigated non-c-commanding QP lures. I hypothesized that only if morphology can completely out-weigh structural information would we ob- serve lure-match effects from non-c-commanding QP lures. The results do not ap- pear to support this hypothesis. First of all, I did not find robust statistical support for lure-match effects from QP lures. This is not very telling, since in general my experiments failed to produce statistically significant results in most of the cases. However, QP lures appeared to behave differently than NP lures in numeric patterns. QP lures which matched the reflexive either had no effect or caused a slow-down. In contrast, matching NP lures caused facilitation in both Experiment 1 and 2. 208 I have established that the lack of facilitation is not due to several possible con- founds. It cannot be due to the fact that QPs are not good lures for some reason: c-commanding QP lures which matched the reflexive did cause a speed-up relative to non-matching ones. Second, Experiment 5 and 6 suggest that context factors such as composition of the stimuli set and experimenter’s language do not affect the magnitude of the lure-match effect. Third, small effect sizes cannot be the result of the lures being embedded under animate subjects as S2017’s sub-command hypoth- esis would suggest, as shown by Experiment 1 results. Finally, the fact that QP lures provoked inhibitory, rather than facilitatory interference is reminiscent of the numerical trend observed in Kush et al. (2015) Exp.2. There, people were slower to read pronouns like “he” in sentences like “The troop leaders [that no boy/girl scout respected] scolded him” when the embedded QP matched the gender of the pronoun, suggesting that out observations are not spurious. Experiments 3 and 4 showed that in contrast to non-c-commanding QP lures, the c-commanding ones do cause lure-match effects. The magnitude and the direc- tion of the effects was mostly consistent for NP and QP lures, suggesting that the parser treats the two lure types similarly when they c-command the reflexive. While the magnitude of the effects was smaller than in the original PP2017 study, it was consistent with the two replications I ran in Exp.5 and 6. This contrast between c-commanding and non-c-commanding lures makes me conclude that c-command information is able to successfully restrict the range of possible antecedents. The fact that the parser appears to treat non-c-commanding NP and QP lures differently also suggests that c-command information is represented 209 in an approximate way, perhaps using Kush et al. (2015) accessible. 5.1.3 On the reliability of lure-match effect 5.1.3.1 Magnitude of the effect Throughout the thesis I have observed lure-match effects which were going in the right direction numerically (faster RTs in conditions with matching lures) but were consistently smaller than those reported in the original study by Parker and Phillips (2017). Almost in no case did these effects in my experiments received statistical support. I have ruled out several potential explanations for the reduced magnitude. First, I made sure that the experiments worked on a basic level - as expected, people spent more time reading longer words and skipped them less fre- quently. Grammatical characteristics of the stimuli did affect people’s RTs - quite consistently, I observed grammaticality effects when ungrammatical sentences were read slower than grammatical ones. Moreover, lure-match effect were observed for subject-verb agreement, suggesting that I did not accidentally selected participants immune to such effects. Second, I ruled out potential systematic confounds. Con- trolling for the composition of stimuli set (in particular, the presence of QPs in it) and characteristics of experimenter ((non-)nativeness) did not have an effect on the magnitude of lure-match effects. It is tempting to conclude that the observed discrepancies between my and previous results are likely due to purely statistical reasons. As I have mentioned in Chapter 4, it is well known that in low-powered studies statistically significant 210 effects are more likely to be overestimates (Gelman and Carlin, 2014; Collaboration, 2015). The original studies by PP2017 that I tried to replicate had sample sizes of 24 and 30 people, which is rather low even by psycholinguistics standards. Even my studies which had doubled sample sizes still failed to provide robust statistical evidence. Only in the exploratory analysis on the pooled data, comprising the data from 81 participants, did some lure-match effects reach significance. On the other hand, S2017’s experiment may speak against such interpretation: he used sample sizes comparable to mine and consistently found statistical support for lure- match effects. While in a low-powered setting this could be attributed to chance alone, the consistency with which lure-match effects are supported by statistics in S2017 on the one hand and in my experiments on the other makes me think that additional factors may be at play. I could think of several directions which could be investigated, although I do not have strong hypotheses about how the differences in these parameters would have affected the inferences. One potentially important factor is that both PP2017 and S20171 sentences contained gaps or pronominal elements in the spillover region: e.g. “The flustered nurse complained that the elderly veterans considered herself to be very attractive and made suggestive comments” or “The broken zipper that the skilled tailors tried to fix pinched themselves repeatedly on their fingers.”. It is possible that people could preview the spillover regions with their parafoveal vision, or that the interpre- tation of the reflexive was still on-going while their gaze had already shifted on the 1To the degree that I can see: I only had access to the materials they used in their Experiments 1 and 3 211 spillover. In these cases the presence of an element depending on the reflexive’s in- terpretation could affect the processing. For example, null or overt pronouns might have provoked additional retrievals of the lures and artificially increased their ac- cessibility in S2017’s experiments. Alternatively, given that the interpetation of the reflexive would influence the interpretation of some other element, people might en- gage in additional checks of whether the antecedent they chose is permissible by the grammar. In this case, the intrusion effects might be smaller or absent altogether. The first alternative might receive some support from the data. In my Exper- iment 2 12 out of 24 items had an empty element in the spillover, while in the in Parker and Phillips (2017, Exp.2), using similar stimuli, only 4 out 36 items did. In my Exp.4 we borrowed a subset of NP items from Sloggett (2017, Exp.1). In that experiment, 14 out of 48 items (i.e. roughly a quarter) had an empty element in the spillover. It turns out that I borrowed 9 of such items into my experiment. Since I had fewer NP stimuli (24), those items accounted for almost half of them. It could be that the reduction of the intrusion effect in my case was strong enough to make the substantial evidence for it disappear, while Sloggett (2017, Exp.1) was affected less due to a higher proportion of stimuli without an empty element in the spillover. It would be trivial to check whether this is the case if I had the original data from Sloggett (2017): in that case, I could look at whether there are differences in the effect size between the stimuli I did and did not select. This account would fail to explain the absence of evidence for intrusion in Exp.3: in constructing the stimuli for it, I made sure that the spillover did not contain any empty elements. However, these results may be independently explained by the fact that I used a 212 mix of communication and other types of verbs in the materials for Exp.3. Another possibility could be that the gender stereotypes on the nouns I used were weaker than on the nouns in PP2017 and S2017 experiments. It could lead to them being worse lures and thus producing on average smaller lure-match effects. While possible in principle, I think this explanation is a non-starter: I failed to observe statistically robust effects in two direct replications of PP2017, thus, the strength of the gender stereotype alone can not be the issue. In other experiments I either directly borrowed a subset of PP2017 and S2017 materials, or was heavily influenced by their materials in creating mine. Therefore, I think that the strength of gender stereotypes is not an issue here. It is possible that subtle differences in the experimental procedure might have changed the results. As Hammerly et al. (draft.april.2018) show, the change in wording from “two thirds of the items are ungrammatical” to “the majority of the items are ungrammatical” apparently was enough to change the observed agreement attraction effects. Such biases do not have to be restricted to the instructions and may involve other factors: how often do participants take breaks, how often is re- calibration performed, how does the experimenter interacts with the participants before the experiment starts etc., and may be hard to catch. Overall, I conclude that while low statistical power may be an issue, unrecog- nized biases have likely affected my results. It is hard to make stronger conclusions in the absence of formal or simulation-based power analysis, but I suggest that since lure-match effects may apparently be hard to detect in smaller samples, sample sizes bigger than 40-50 participants are desirable in order to find reliable evidence 213 for interference when looking at reflexives resolution. 5.1.3.2 Timing of the effect The time course of the lure-match effects appears to be variable across the studies. Some experiments (Parker and Phillips (2017) Exp.1,3; Sloggett (2017) Exp.1b and 4b) show detectable lure-match effects as early as in first-pass and regression path at the critical region. Some others do not show reliable effects until later - re-read and total times at the critical region (Parker and Phillips (2017) Exp.2, Sloggett (2017) Exp.2b), or even at spillover in regression path (Sloggett, 2017, Exp.3b). My experiments rather fall in the second group: numerically, the lure-match effects got biggest in re-read or total times at the critical region, and the only instance of a significant main effect of lure match (Exp.1) was detected in the total times at spillover. What is the reason behind such variability? The first possibility is that it is due to some systematic differences in the stimuli set. However, the evidence does not support this hypothesis: in multiple cases the exact same or very similar stimuli set produced variable results. The best example is PP2017 Exp.3, where lure-match effects were observed as early as in first pass at the critical region, and my two replications (Exp.5 and 6) where the magnitude of the effects was biggest in late eye-tracking measures. These studies used the exact same set of stimuli. PP2017 Exp.2 and my replication in Exp.1 had similar variability, with Exp.1 using a subset of the original stimuli. Finally, S2017 observed such variability in his Exp.3b and 214 4b, despite using same critical stimuli. It is possible that individual characteristics of the participants lead to the variability in the timing of the lure-match effects. Cunnings and Felser (2013) in- vestigated the effect of working memory on lure-match effects in reflexive resolution. They only considered 1-target mismatch configurations, and had a rather small sam- ple sizes (32 participants by experiment, in each case split into two groups depending on their working memory capacity), so the results may not be very reliable. For what it’s worth, they report differential effects depending on the working memory span: low-span group was faster to detect the mismatch with the target. The effect first became significant in first fixation duration at the reflexive for the low-span group and in re-read times for the high-span group2. Finally, one could think that the lure-match effects observed in early measures are artifactual. As I have just discussed, for all experiments where lure-match effects were observed in early measures there is a related experiment where they were not. On the other hand, in the late measures lure-match effects are observed more consistently. Such a pattern may indicate that lure-match effect are, in fact, a reflection of a repair strategy. If it is the case, that would constitute evidence against PP2017 account: if a repair is initiated, the original resolution attempt must have failed; it would only fail, if the parser correctly accessed the target; therefore we have to assume that structural information is able to strictly rule out 2The authors suggest that this may reflect the application of least effort strategy from low-span readers, in which they simply select the closest noun as the antecedent instead of fully processing the dependency. 215 ungrammatical antecedents at least during initial resolution attempts (cf. with “the defeasible filter” hypothesis by Sturt (2003)). Several arguments can be made against such possibility. As Vasishth et al. (2012) notice, effects which become apparent in late measures are not necessarily stemming from late processes; instead, they might stem from a process which starts early but unfolds slowly enough to only become apparent later. Thus, simply observ- ing some effect emerging in late measures does not allow to uniquely attribute it to repair processes. Further, S2017 provides three arguments against repair account. First, he notes that some experiments do show early lure-match effects. Second, S2017 discusses the results of his interpretation study, in which people appeared to choose non-local interpretation as frequently as in 30% of the cases even when the sentences were grammatical (i.e. the target fully matched the reflexive in fea- tures). And third, he notices that reflexives with clearly non-local interpretations are sometimes observed in natural production3. Overall, the tendency of the lure-match effects to be more prominent in late eye-tracking measures is interesting, but it lacks sufficient evidence for firmer con- clusions. While it could potentially be problematic for PP2017 account, further studies are necessary to decide whether the tendency does, indeed, exist, and if so, 3While these arguments would help the claim I am making by ruling out the repair explanation of lure-match effects, I tend to be somewhat skeptical about them. As I have just noted, lure- match effects in early measures appear to be less reliable than late effects and might be artifactual. I discuss my concerns about the interpretation of these results in section 5.2. Finally, the third argument is potentially the strongest one, but it assumes that lure-match effects in comprehension and logophoric use of reflexives in production rely on the same mechanisms, which is not given. 216 what are its causes. 5.1.4 Choosing between the two accounts Based on the evidence I have discussed above, I suggest that neither PP2017 nor S2017 account is completely supported. On the one hand, my findings from Experiment 1 suggest that sub-command binding may not be a good explanation for lure-match effects from non-c-commanding lures. On the other hand, the results from Experiments 2-4 suggest that the parser does faithfully follow an approximate version of c-command to categorically rule out inaccessible QP antecedents. This interpretation goes against grammatical models, such as S2017: the parser does appear to perform operations which would not be licensed by the gram- mar. However, it also goes against “structure-defeasible” memory models such as PP2017: structural information appears to accurately guide the retrieval at least in some cases. I suggest that a memory type of account is still a better explanation for the available data. Two main empirical facts supporting this conclusion are a) the presence of lure-match effects from non-c-commanding lures embedded in an animate subject; b) differential lure-match effects from non-c-commanding QP and NP lures in the same configurations. S2017 sub-command hypothesis would not be able to explain them, since it only predict interference from the lures embedded inside an RC with an inanimate head. The same goes for the logophoric explanation of the sub-command facts by Charnavel and Huang (2018). 217 But let us imagine that one of those explanation could account for lure-match effects from RC-lures, perhaps because in English sub-command binding is possible regardless of the animacy of the (head of the) embedding NP. Even in this case we would not be able to explain why non-c-commanding QP lures do not cause lure- match effects: as we showed in Chapter 2, in Chinese sub-commanding QPs can serve as antecedents for “ziji”. We could build another layer of speculation: suppose, in English QPs are not good as non-local antecedents for reflexives, as Postal (2006) suggests. But now we would not be able to explain why c-commanding QPs do provoke lure-match effects. Let us build yet another counterfactual. Suppose, we suggest that the null operator mediating logophoric interpretation, OPlog, is located in the left periphery of all clauses, and not only complement ones, as S2017 suggests. We think, it would still fail to capture interference from the lures embedded under an animate noun: arguably, the embedding noun is more prominent in the discourse and is more likely to control OPlog reference. Thus, even if OPlog is occasionally misretrieved, as S2017 suggests, the resolution would still pick the matrix subject, and not the embedded lure, as the antecedent. The OPlog account would also not be able to capture the contrast between c-commanding and non-c-commanding QP lures. Although S2017 is not very specific about the mechanism which is used to determine the reference of OPlog, he speculates that the mechanism should be similar to the one used by regular pronouns looking for their antecedents in the discourse Sloggett (2017, p.140). This remark suggests that OPlog resolution relies on co- reference. Since QPs may only bind, they would not be able to serve as antecedents for OPlog. This conclusion would be in line with Postal (2006) observations. 218 Thus, it appears that it is hard to explain my findings in terms of some gram- matical strategy, and a memory models might be preferred. But we would have to choose a model which is only partially “structure-defeasible”: instead of assum- ing that all structural information is weighted uniformly and can be out-competed, we would have to assume that the parser differentiates between different kinds of structural information, with some being able to act as gating features. How could this suggestion be implemented in a memory model? Under the usual assumptions cue-based models do not make it easy for a cue to categorically block access to some memory item: even if a feature on an item mismatches the retrieval cues, other features of the item may match them well enough to still make the item retrievable. Kush et al. (2015) discuss two possibilities of how a feature could act as a gating feature, categorically restricting access only to matching items. One could weigh the gating feature extremely highly, so that the items that match it receive such a high activation boost that no other item can compete with that. Alternatively, one can change the way cues are combined to determine an item’s activation. Most often it is assumed that the activation is determined by a weighted sum of activation values provided by the matching cues, with weights corresponding to their importance, and activation values - the strength of association between the item and the cue in question. However, one could assume that a weighted product is used instead. In this case, if a single feature mismatches the set of retrieval cues, it would make the activation of the item very small or kill it completely, thus preventing its retrieval in the majority of the cases4. Kush et al. (2015) mention 4Presumably, such an item could still be retrieved occasionally, if the random noise happens to 219 that it is impossible to tell these possibilities apart from their data. I think that the data in the thesis may favor the first hypothesis: super- weighting of the gating cues. In order to explain the fact that mismatch on locality information may be out-weighed by other factors, one would have to conclude that locality cue(s) contribute to item’s activation in an additive fashion. Now, acces- sible cue has to be able to categorically rule some items out from consideration. In principle it is possible that it combines multiplicatively with other cues, in a manner described by Eq. 5.1, where GCA 5i is the binary indicator of whether all gating cues6 allow the i − th item to be retrieved (i.e. whether the i − th item matches these cues). However, this would create a situation where different cues rely on different combinatorics rules, and without strong evidence for such mixed schemas, it seems preferable to keep cue combinatorics rules uniform within a model. Since we have to assume an additive schema to capture locality facts, accessible would push its activation high enough to outcompete other items in memory. 5Standing for “gating cues activation”. 6It seems possible that if the system relies on gating cues at all, there is more than one such cue. In this case, we need to know how these cues combine. The schema I suggest assumes that if even a single gating cue fails to match, the item is ruled out. This could be achieved if zero activation is spread from a mismatching gating cue, and the values of all gating cues are multiplied to produce the final GCAi value. More involved combinatorics schemas could in principled be produced, so that, say, a item has two mismatch two gating cues together in order to be ruled out. Also notice that when several items do match a gating cue, the activation it spreads to these items will be less than 1 due to fan effect. If the system uses multiple gating cues at once, and it happens so that each gating cue is matched by multiple items, the items’ activation values can be driven very low, since the product of several values smaller than 1 may indeed be small. 220 have to combine additively as well. In this case, super-weighting is the only way of making it act as the gating feature. This hypothesis could in principle be tested empirically: if super-weighting is, indeed, the underlying mechanism, we might find ways to boost the activations from other features highly enough to out-weigh even the super-weighted cues. E.g. we might try to make the match with the true target even stronger, and/or put the lure in a linguistically prominent position. ∑ Ai = Bi +GCAi × ( wjSji) (5.1) j Concluding, I have to note that several empirical points may be taken as speaking against my suggested account, although very speculatively. First, I ob- served the tendency for lure-match effects to be more readily detectable in late eye-tracking measure. So far it is simply a suggestive observation, but if confirmed experimentally, it could indicate that lure-match effects stem from repair processes. As discussed, this would imply that at least at some point structural information does categorically constrain parser’s actions. Second, it appears that the magni- tude of lure-match effects from NP lures is bigger when the lure c-commands the reflexive. This is true both in PP2017 and in my replications: despite that the effect sizes we observe are overall reduced compared to PP2017, the difference be- tween c-commanding and non-c-commanding lures holds. This may suggest that non-c-commanding lures are less accessible in general, with non-c-commanding QP lures being completely inaccessible and c-commanding NP lures being relatively in- accessible. If this intuition is correct, it will weaken my conclusions about the use 221 of accessible: accounts relying on this feature do not predict any access diffi- culty for + accessible phrases regardless of their c-command relations. Third, in both Experiment 1 and Experiment 2 I did observe an illusion of ungrammat- icality in agreement attraction stimuli. This is problematic for cue-based models, of which PP2017 account is an instance. This is also in line with Hammerly et al. (draft.april.2018), who suggest a “structure-strong” account for agreement attrac- tion. As I have discussed in Chapter 1, so far their account is not fully developed and, for example, is not easily extendable to eye-tracking data. But such data points support their model and if it is better developed, it could constitute a strong competitor to cue-based modes. Finally, I remind that my conclusions necessarily remain tentative, being based almost entirely on numerical patterns of RTs. 5.2 Future directions Given that most of my conclusions are based on numerical patterns or RTs, the most important next step is to conduct adequately powered confirmatory ex- periments. The numerical estimates of the lure-match effect magnitude I obtained in this study7 could be used for power calculations. The most important thing is to further investigate the interference from non-c-commanding lures - this would help to improve the evidence quality of evidence for the only pattern not covered by S2017 account. 7As well as other estimates, such as random effects and variance-covariance matrices for par- ticipants and items 222 Another possible line of investigation is to look for lure-match effects from lures in a possessor position, e.g. “The librarian’s brothers praised herself during the meeting”. S2017 account does not predict any interference: the subject NP is not embedded in a complement clause, so there is no null operator which could mediate the dependency between “librarian” and “herself”. Even if such operator were present, we do not think “librarian” is prominent enough to control its refer- ence. Sub-command explanation would not work either: the configuration is of the right type, but the embedding NP is animate and thus should block sub-command binding. PP2017 account, on the other hand, would predict lure-match effects from “librarian” unless c-command information is accurately used to guide the retrieval. Thus, in the suggested experiment the presence of an effect would speak for PP2017 model, but the absence of the effect would be hard to interpret since it could be compatible with both S2017 and PP2017 accounts. A third possible line of investigation would be to consider whether the mech- anism behind lure-match effects in reflexive resolution is purely formal (e.g. just checking the morphological features) or whether it affects interpretation. PP2017 account does not make strong predictions either way. On the other hand, S2017 account necessarily predicts that interpretation is affected, since S2017 argues for a grammatical option of reflexive resolution. S2017 supports this point of view with a results of interpretation study: he shows that when presented with sentences like ”The librarian said that the schoolgirl misinterpreted herself...” and asked a question like ”Who was misrepresented at the meeting? – The schoolgirl / The librarian”, people choose the answer compatible with non-local binding of the reflexive (”librar- 223 ian” in this example) as frequently as 30% of the time. S2017 uses this finding to support his argument about the logophoric nature of interference effects in reflexive resolution: if people can choose the non-local option even in grammatical sentences, it this choice has to be a grammatical possibility. We have doubts about this claim for several reasons. First, question answers do not necessarily stem from the same processes which underlie reflexive resolution during sentences processing. Instead, they may come from some extra-grammatical strategy (e.g. if upon seeing the question, people choose the answer not based upon how the reference was resolved but based on some other factors - familiarity, the answer being a more sensible choice etc.). Second, according to the Sloggett’s hypothesis, non-local reference is mediated by a silent logophoric operator. It is hypothesized not to carry any φ-features. Thus, if the overt local noun matches the reflexive in features, it will be a better match on at least two features - gender and number (and potentially person and/or animacy). If S2017 results do indeed indicate that people retrieve the silent operator, it would mean that even a target which is the best match along multiple dimensions can be ignored during memory access as frequently as 30% of the time. It seems odd to assume that memory access is so inefficient: if it performs that poorly in perfect conditions, what would it performance be in real-life situation with potentially less certainty about what the correct outcome is? Third, in all of his acceptability judgment experiments people give rather low ratings8 for sentences with target mismatch - even when the lure 8To give an idea, target mismatch conditions received ratings in the range of 3-4, while target match conditions were all above 5 on a Likert scale from 1 to 7. 224 matches the reflexive the ratings are lower than for grammatical sentences. This fact is hard to explain if, as S2017 argues, lure-match effects arise from a completely grammatical anaphora resolution strategy9. I suggest several ways of checking whether S2017’s interpretation results do indeed reflect the outcome of reflexive resolution during sentence reading or whether they are due to some extragrammatical strategy. Perhaps, the best option would be to use an eye-tracking study with interpretation questions. After collecting the data, one could separately analyze the conditions depending on the antecedent choice people made. If S2017 is correct, and RTs patterns and interpretations stem from the same mechanism, we will only observe facilitatory interference in conditions where people chose the non-local antecedent. Second, one could turn to a visual world paradigm: if lures are indeed chosen as antecedents in real-time, we should observe increased amount of looks for the picture representing the lure. Finally, one could make people’s task as easy as possible10. In S2017 experiment, people had to read the sentence in a self-paced fashion before answering the question. I.e. they never saw the sentence in its entirety, and had limited time to process it. If S2017 is correct, and non-local interpretations do stem from a grammatical interpretation option, these factors should not matter a lot and people will produce roughly the same interpretation patterns in easier conditions, e.g. when having unlimited time to look at the sentence. Otherwise, we would see a reduction in the number of non-local interpretations. 9I would like to thank Colin Phillips for this observation. 10I would like to thank Colin Phillips for this idea. 225 5.3 Conclusion In this thesis, I attempted to distinguish between two types of real-time long- distance dependency resolution models - those which assume that structural infor- mation categorically rules out structurally inappropriate elements and those which do not. Specifically, I contrasted two recent instantiations of these models explain- ing lure-match effects in reflexives resolution. Parker and Phillips (2017) attributed those effects to a processing error, while Sloggett (2017) attributed them to a gram- matical strategy. The evidence I obtained in six experiments suggested that neither account is entirely correct. Contrary to what S2017 claims there seem to be cases (namely, lure match effects from non-c-commanding lures) which are best explained in terms of a processing error. Contrary to what PP2017 say, structural information appears to have flavors - some kinds of it (e.g. locality), may not always succeed in categorically ruling out inappropriate antecedent, while others (e.g. some ap- proximation of c-command), may do so. I suggested that the best explanation for this evidence is a mixed memory model, in which some structural information (e.g. locality) is weighted highly but can be outcompeted relatively easily, and other structural cues (e.g. c-command) is super-weighted so that it is hard or impossible for other factors to out-weigh it. These studies have also shown that lure-match effects may not be as reliable and big as previous studies suggest. In fact, most of my conclusions are based on numerical patterns in the data. I failed to find statistical support for lure- match effects, despite having sample sizes comparable to or bigger than in the 226 previous studies. I do not think this indicates that the effects are bogus - they have been replicated multiple times, the numerical patterns I observe in my studies are going in the expected direction, and an exploratory analysis of pooled data from about 80 participants suggests that such sample size is enough for the statistics to support even the reduced effects. Rather, these results may indicate that the lack of statistically significant effects is due to low statistical power or unidentified subtle biases in experimental materials and procedures. I stress the necessity to conduct a priori power analyses; the data that I have collected can be used to set expectations for such analyses. I would certainly welcome higher-powered replications of my own studies which could verify the conclusions I make. 227 Appendices 228 Appendix A: Mean reaction times in PP2017and replication (PP2017analysis) Notice that standard errors reported here for PP2017data are slightly smaller than those reported in the original paper. The difference comes from the fact that we calculated them taking into account subject variability, by using the method suggested in Cousineau (2005) and Morey (2008)1. In contrast, as far as we can say, PP2017used the standard formula for calculating SEs: √sd . n 1The code implementing this procedure was taken from: http://www.cookbook-r.com/ Graphs/Plotting_means_and_error_bars_(ggplot2)/. 229 Mean RTs in PP2017 Region Cond precrit crit spillover fp tm, lm 947.4 (46.68) 196.7 (10.05) 166.4 (12.74) tm, lmm 927.5 (39.73) 223 (10.93) 154.8 (15.16) tomm, lm 805 (33.65) 224.6 (11.29) 166.2 (22.25) tomm, lmm 903.1 (35.85) 222.5 (12.08) 165.2 (15.33) ttmm, lm 912.8 (33.82) 184.7 (10.91) 156.4 (14.06) ttmm, lmm 881.7 (33.92) 289.7 (16.44) 129.1 (13.42) rb tm, lm 1162 (42.82) 200.1 ( 10.3) 195.7 (15.84) tm, lmm 1108 (36.39) 228.3 (11.22) 184.5 (22.23) tomm, lm 1025 (33.64) 244.8 (12.42) 214.7 (29.35) tomm, lmm 1088 (33.93) 233.7 (12.46) 195.3 (19.23) ttmm, lm 1117 (35.72) 190.6 (11.76) 191.3 (18.19) ttmm, lmm 1062 (33.93) 329.3 (18.94) 178.1 (22.75) rp tm, lm 1295 (67.08) 221 ( 14.8) 311.6 (41.56) tm, lmm 1203 (50.55) 250.9 (13.29) 282.3 (43.47) tomm, lm 1199 (62.91) 303.8 (28.14) 338.7 (52.22) tomm, lmm 1206 (53.94) 316.5 (38.28) 341.5 ( 67.3) ttmm, lm 1188 (44.37) 229.2 (22.99) 318.1 (51.11) ttmm, lmm 1199 (53.05) 456.5 (64.29) 405.2 (91.27) rr tm, lm 872.5 (68.21) 223.1 (22.04) 265 (30.71) 230 tm, lmm 798.7 ( 56.6) 224 (22.17) 253.4 (30.72) tomm, lm 880.6 (78.36) 318.1 (26.77) 265.8 (33.63) tomm, lmm 1044 (74.39) 338.7 (27.66) 233.6 (24.64) ttmm, lm 925.1 (79.73) 212.2 (22.67) 286.9 (33.07) ttmm, lmm 1210 (93.32) 445.5 (45.64) 288 (31.86) Table A.1: Mean RTs in PP2017 Reading measures: fp - first pass, rb - right-bound, rp - regression path, rr - re-read. Conditions: tm/tomm/ttmm - target match/one-mismatch/two-mismatch; lm/lmm - lure match/mismatch.Values in the parenthesis represent standard error of the mean, corrected for within-subjects variability(Cousineau, 2005; Morey, 2008) 231 Mean RTs in the replication of PP2017 Region Cond precrit crit spillover fp tm, lm 946.7 (25.82) 205 ( 7.73) 226.6 (11.48) tm, lmm 1004 (28.14) 214.8 ( 8.47) 204.6 (10.43) tomm, lm 923.1 ( 24.3) 210.7 ( 8.8) 222.3 (12.01) tomm, lmm 918.1 (23.91) 237.8 ( 9.67) 229.1 (11.66) ttmm, lm 927.1 (26.78) 225.1 ( 8.43) 210.6 (11.13) ttmm, lmm 955.5 (24.76) 243.5 (10.07) 220.2 (10.98) rb tm, lm 1124 (24.23) 220.1 ( 8.56) 255 (13.65) tm, lmm 1152 (27.64) 229.4 ( 9.3) 231.8 (13.74) tomm, lm 1064 (23.33) 220.7 ( 9.5) 277.1 (16.97) tomm, lmm 1066 (22.51) 258.1 (11.03) 263.9 (14.48) ttmm, lm 1090 (25.18) 249.5 ( 10.2) 262.6 (15.02) ttmm, lmm 1089 (24.01) 275.7 (12.56) 281.7 ( 15.1) rp tm, lm 1203 (32.16) 262.4 (13.95) 412.8 (43.15) tm, lmm 1246 (35.75) 273.6 (17.24) 329.5 (32.07) tomm, lm 1136 (27.78) 262.5 (17.34) 429.1 (40.17) tomm, lmm 1143 (32.61) 348.3 (32.09) 442.8 (41.18) ttmm, lm 1169 (32.19) 353.2 (30.91) 441.6 (43.06) ttmm, lmm 1146 ( 28.6) 413 (36.68) 501.3 (42.68) rr tm, lm 810.1 (49.34) 211 (17.24) 241 (16.41) 232 tm, lmm 779.9 (46.35) 205.8 (15.09) 257.9 (15.62) tomm, lm 754.3 (47.66) 260.1 (17.78) 315.2 (23.77) tomm, lmm 921.7 (51.72) 292.4 (18.11) 277.3 (18.65) ttmm, lm 818.9 (58.33) 274.4 (20.47) 280 ( 21.7) ttmm, lmm 995 (51.92) 366.1 (24.45) 312.2 (17.63) Table A.2: Mean RTs in the replication of PP2017 Reading measures: fp - first pass, rb - right-bound, rp - regression path, rr - re-read. Conditions: tm/tomm/ttmm - target match/one-mismatch/two-mismatch; lm/lmm - lure match/mismatch.Values in the parenthesis represent standard error of the mean, corrected for within-subjects variability(Cousineau, 2005; Morey, 2008) 233 Appendix B: Mean RTs Region Measure Target match Lure match Critical Spillover fp match match 318 (20) 295 (14) mismatch 416 (41) 290 (16) mismatch match 345 (18) 313 (16) mismatch 395 (31) 324 (23) rp match match 386 (31) 422 (53) mismatch 511 (53) 486 (60) mismatch match 389 (26) 455 (47) mismatch 639 (81) 510 (47) tt match match 455 (38) 375 (23) mismatch 509 (48) 377 (30) mismatch match 520 (48) 366 (28) mismatch 642 (58) 404 (31) Table B.1: Experiment 2 means for agreement stimuli. Numbers in parentheses are standard errors of the mean, corrected for within-subjects variability (Cousineau, 2005; Morey, 2008). 234 Region Measure Lure type Lure match Critical Spillover fp NP match 297 (14) 300 (15) mismatch 283 (14) 276 (13) QP match 286 (12) 309 (15) mismatch 262 (11) 295 (12) rp NP match 425 (32) 752 (104) mismatch 414 (36) 758 (91) QP match 578 (64) 782 (91) mismatch 509 (71) 683 (83) tt NP match 533 (33) 512 (33) mismatch 516 (30) 461 (26) QP match 540 (29) 500 (25) mismatch 524 (30) 489 (33) Table B.2: Experiment 2 means for reflexives stimuli. Numbers in parentheses are standard errors of the mean, corrected for within-subjects variability (Cousineau, 2005; Morey, 2008). Region Measure Target match Lure match Critical Spillover fp match match 420 (12) 361 (15) mismatch 422 (14) 374 (16) mismatch match 505 (20) 362 (14) mismatch 486 (18) 368 (13) rp match match 550 (35) 576 (52) mismatch 612 (38) 615 (43) mismatch match 864 (50) 830 (67) mismatch 640 (29) 668 (69) tt match match 741 (34) 625 (31) mismatch 754 (31) 602 (26) mismatch match 974 (39) 691 (32) mismatch 839 (33) 595 (27) Table B.3: Experiment 3 means for agreement stimuli. Numbers in parentheses are standard errors of the mean, corrected for within-subjects variability (Cousineau, 2005; Morey, 2008). 235 Region Measure Lure type Lure match Critical Spillover fp NP match 368 (13) 515 (21) mismatch 361 (15) 485 (20) QP match 353 (15) 535 (21) mismatch 356 (15) 496 (21) rp NP match 549 (34) 990 (66) mismatch 597 (63) 1124 (92) QP match 659 (55) 961 (72) mismatch 616 (63) 1004 (81) tt NP match 696 (30) 1002 (42) mismatch 724 (35) 1056 (43) QP match 683 (30) 1056 (46) mismatch 726 (35) 1025 (41) Table B.4: Experiment 3 means for reflexives stimuli. Numbers in parentheses are standard errors of the mean, corrected for within-subjects variability (Cousineau, 2005; Morey, 2008). 236 Region Measure Target match Lure match Critical Spillover NP lures fp match match 347 (13) 435 (16) mismatch 320 (11) 449 (20) mismatch match 345 (14) 431 (15) mismatch 371 (15) 450 (18) rp match match 460 (28) 600 (39) mismatch 480 (30) 668 (46) mismatch match 486 (27) 736 (48) mismatch 576 (41) 892 (72) tt match match 571 (22) 702 (28) mismatch 574 (23) 717 (27) mismatch match 645 (28) 843 (36) mismatch 694 (30) 847 (31) QP lures fp match match 291 (10) 438 (14) mismatch 311 (11) 439 (15) mismatch match 327 (11) 435 (17) mismatch 344 (13) 474 (18) rp match match 430 (31) 647 (52) mismatch 420 (35) 644 (43) mismatch match 474 (33) 921 (63) mismatch 526 (36) 955 (70) tt match match 515 (22) 767 (30) mismatch 500 (20) 719 (25) mismatch match 624 (27) 829 (32) mismatch 669 (27) 906 (35) Table B.5: Experiment 4 mean reaction times. Reading measures: fp - first pass, rp - regression path, tt - total times. Numbers in parentheses are standard errors of the mean, corrected for within-subjects variability (Cousineau, 2005; Morey, 2008). 237 Appendix C: Model coefficients Tables start on the next page. 238 Critical Spillover Measure Effect Estimate t value Estimate t value Agreement fp (Intercept) 5.91 (0.05) 129.23 5.71 (0.04) 141.49 Target match 0.18 (0.04) 4.42 -0.06 (0.03) -1.8 Lure match 0.03 (0.04) 0.72 -0.03 (0.04) -0.9 Target match x Lure match 0.02 (0.08) 0.28 -0.04 (0.06) -0.57 rp (Intercept) 6.30 (0.06) 104.16 6.13 (0.05) 112.58 Target match 0.29 (0.05) 5.58 0.20 (0.06) 3.46 Lure match 0.13 (0.05) 2.7 0.11 (0.06) 1.86 Target match x Lure match 0.10 (0.09) 1.13 0.32 (0.10) 3.22 tt (Intercept) 6.60 (0.07) 95.08 6.31 (0.07) 86.88 Target match 0.28 (0.05) 6.06 -0.06 (0.04) -1.47 Lure match 0.13 (0.03) 3.93 0.05 (0.04) 1.21 Target match x Lure match 0.01 (0.08) 0.18 0.04 (0.08) 0.47 Reflexives fp (Intercept) 5.64 (0.03) 196.79 5.85 (0.05) 116.57 Target animacy -0.04 (0.04) -0.82 -0.05 (0.06) -0.8 Lure match 0.05 (0.04) 1.36 -0.01 (0.04) -0.14 Target animacy x Lure match -0.09 (0.07) -1.42 -0.00 (0.08) -0.05 rp (Intercept) 6.00 (0.05) 122.01 6.44 (0.09) 71.99 Target animacy -0.06 (0.07) -0.86 -0.14 (0.15) -0.95 Lure match 0.05 (0.05) 0.93 0.07 (0.07) 1.01 Target animacy x Lure match -0.10 (0.10) -0.91 -0.06 (0.15) -0.39 tt (Intercept) 6.42 (0.05) 119.22 6.56 (0.07) 94.72 Target animacy -0.11 (0.06) -1.76 -0.22 (0.08) -2.81 Lure match 0.08 (0.04) 1.77 0.07 (0.04) 1.61 Target animacy x Lure match -0.00 (0.08) -0.01 0.05 (0.07) 0.65 Table C.1: Model coefficients for Experiment 1. Reading measures: fp - first pass, rp - regression path, tt - total times. Contrasts coding: “animate” = -0.5, “inanimate” = 0.5; “match” = -0.5, “mismatch” = 0.5. Statistically significant comparisons are highlighted: yellow - significant only before a Bonferroni cor- rection, red - significant before and after Bonferroni correction (see text). Values in the parentheses represent standard error of the estimate. 239 Critical Spillover Measure Effect Estimate t value Estimate t value Agreement fp (Intercept) 5.73 (0.09) 64.27 5.62 (0.05) 108.3 Target match 0.08 (0.09) 0.93 0.06 (0.06) 1.09 Lure match 0.09 (0.07) 1.23 0.01 (0.07) 0.17 Target match x Lure match -0.04 (0.13) -0.27 0.08 (0.12) 0.65 rp (Intercept) 5.89 (0.10) 61.64 5.89 (0.06) 94.91 Target match 0.12 (0.11) 1.13 0.06 (0.08) 0.73 Lure match 0.21 (0.08) 2.77 0.09 (0.11) 0.87 Target match x Lure match 0.11 (0.19) 0.6 0.03 (0.16) 0.2 tt (Intercept) 6.01 (0.11) 56.35 5.77 (0.07) 77.87 Target match 0.17 (0.09) 1.92 0.03 (0.07) 0.44 Lure match 0.11 (0.07) 1.63 0.04 (0.06) 0.66 Target match x Lure match 0.10 (0.12) 0.79 0.10 (0.13) 0.8 Reflexives fp (Intercept) 5.53 (0.04) 142.43 5.55 (0.05) 118.83 Lure match -0.06 (0.04) − 1.64 -0.06 (0.05) − 1.22 Lure type -0.04 (0.05) − 0.92 0.04 (0.04) 1.01 Lure match x Lure type -0.04 (0.08) − 0.47 0.03 (0.08) 0.32 rp (Intercept) 5.85 (0.05) 111.20 6.11 (0.07) 85.43 Lure match -0.08 (0.06) − 1.30 -0.07 (0.09) − 0.74 Lure type 0.10 (0.07) 1.47 -0.02 (0.08) − 0.21 Lure match x Lure type -0.08 (0.12) − 0.63 -0.17 (0.16) − 1.07 tt (Intercept) 6.07 (0.05) 115.34 5.97 (0.07) 91.59 Lure match -0.05 (0.06) − 0.85 -0.09 (0.06) − 1.52 Lure type 0.02 (0.06) 0.41 0.01 (0.05) 0.12 Lure match x Lure type -0.03 (0.10) − 0.29 -0.04 (0.11) − 0.36 Table C.2: Model coefficients for Experiment 2. Reading measures: fp - first pass, rp - regression path, tt - total times. Contrasts coding: “match” = -0.5, “mismatch” = 0.5. Statistically significant comparisons are highlighted: yellow - significant only before a Bonferroni correction, red - significant before and after Bonferroni correction (see text). Values in the parentheses represent standard error of the estimate. 240 Critical Spillover Measure Effect Estimate t value Estimate t value fp (Intercept) 5.91 (0.05) 129.23 5.71 (0.04) 141.49 Target match 0.18 (0.04) 4.42 -0.06 (0.03) -1.8 Lure match 0.03 (0.04) 0.72 -0.03 (0.04) -0.9 Target match x Lure match 0.02 (0.08) 0.28 -0.04 (0.06) -0.57 rp (Intercept) 6.30 (0.06) 104.16 6.13 (0.05) 112.58 Target match 0.29 (0.05) 5.58 0.20 (0.06) 3.46 Lure match 0.13 (0.05) 2.7 0.11 (0.06) 1.86 Target match x Lure match 0.10 (0.09) 1.13 0.32 (0.10) 3.22 tt (Intercept) 6.60 (0.07) 95.08 6.31 (0.07) 86.88 Target match 0.28 (0.05) 6.06 -0.06 (0.04) -1.47 Lure match 0.13 (0.03) 3.93 0.05 (0.04) 1.21 Target match x Lure match 0.01 (0.08) 0.18 0.04 (0.08) 0.47 fp (Intercept) 5.64 (0.03) 196.79 5.85 (0.05) 116.57 Target animacy -0.04 (0.04) -0.82 -0.05 (0.06) -0.8 Lure match 0.05 (0.04) 1.36 -0.01 (0.04) -0.14 Target animacy x Lure match -0.09 (0.07) -1.42 -0.00 (0.08) -0.05 rp (Intercept) 6.00 (0.05) 122.01 6.44 (0.09) 71.99 Target animacy -0.06 (0.07) -0.86 -0.14 (0.15) -0.95 Lure match 0.05 (0.05) 0.93 0.07 (0.07) 1.01 Target animacy x Lure match -0.10 (0.10) -0.91 -0.06 (0.15) -0.39 tt (Intercept) 6.42 (0.05) 119.22 6.56 (0.07) 94.72 Target animacy -0.11 (0.06) -1.76 -0.22 (0.08) -2.81 Lure match 0.08 (0.04) 1.77 0.07 (0.04) 1.61 Target animacy x Lure match -0.00 (0.08) -0.01 0.05 (0.07) 0.65 Table C.3: Model coefficients for Experiment 3. Reading measures: fp - first pass, rp - regression path, tt - total times. Contrasts coding: “match” = -0.5, “mismatch” = 0.5. Statistically significant comparisons are highlighted: yellow - significant only before a Bonferroni correction, red - significant before and after Bonferroni correction (see text). Values in the parentheses represent standard error of the estimate. 241 Critical Spillover Measure Effect Estimate t value Estimate t value NP lures fp (Intercept) 5.72 (0.04) 159.58 5.92 (0.06) 101.46 Target match 0.04 (0.04) 0.97 0.00 (0.04) 0.12 Lure match -0.01 (0.03) -0.25 0.01 (0.03) 0.38 Target match x Lure match 0.16 (0.07) 2.38 0.00 (0.08) 0.06 rp (Intercept) 5.96 (0.05) 122.63 6.27 (0.07) 85.02 Target match 0.07 (0.05) 1.46 0.18 (0.05) 3.37 Lure match 0.04 (0.06) 0.68 0.08 (0.06) 1.36 Target match x Lure match 0.08 (0.09) 0.86 -0.00 (0.12) -0.02 tt (Intercept) 6.23 (0.06) 107.23 6.46 (0.07) 87.34 Target match 0.13 (0.06) 2.32 0.16 (0.04) 4.23 Lure match 0.02 (0.04) 0.54 0.04 (0.04) 1.13 Target match x Lure match 0.04 (0.07) 0.59 0.03 (0.07) 0.47 QP lures fp (Intercept) 5.64 (0.03) 162.94 5.94 (0.06) 98.23 Target match 0.09 (0.04) 2.46 0.01 (0.04) 0.34 Lure match 0.05 (0.04) 1.55 0.03 (0.04) 0.85 Target match x Lure match -0.01 (0.06) -0.14 0.08 (0.07) 1.2 rp (Intercept) 5.86 (0.05) 116.63 6.32 (0.08) 78.04 Target match 0.14 (0.06) 2.29 0.30 (0.05) 6.23 Lure match 0.05 (0.05) 0.92 0.04 (0.06) 0.58 Target match x Lure match 0.11 (0.08) 1.26 0.02 (0.09) 0.28 tt (Intercept) 6.15 (0.06) 108.98 6.49 (0.07) 88.28 Target match 0.23 (0.05) 4.36 0.14 (0.03) 4.28 Lure match 0.03 (0.04) 0.79 0.04 (0.05) 0.84 Target match x Lure match 0.12 (0.08) 1.54 0.15 (0.08) 2 Table C.4: Model coefficients for Experiment 4. Reading measures: fp - first pass, rp - regression path, tt - total times. Contrasts coding: “match” = -0.5, “mismatch” = 0.5. Statistically significant comparisons are highlighted: yellow - significant only before a Bonferroni correction, red - significant before and after Bonferroni correction (see text). Values in the parentheses represent standard error of the estimate. 242 Appendix D: List of changes to the Parker and Phillips (2017) stim- uli in the replication experiment “→” indicates the changes made (the left-hand side: original; the right-hand side: replacement). “Q” indicates that the changes were made not to the sentence itself, but to the accompanying comprehension question. • Item 13, cond 2, Q: “spokeswoman” → “spokesman” (to align with the sen- tence) • Item 13, cond 3, Q: “spokesman” → “spokeswoman” (to align with the sen- tence) • Item 13, cond 5, Q: “spokesmen” → “spokeswomen” (to align with the sen- tence) • Item 15, cond 2,4,6: “rock star” → “pop diva” (fixing factorial manipulation: matrix subject should vary in gender across sentences) • Item 22, cond 1: “The friendly waitress mentioned that the outspoken hostess recommended herself for the new position”→ “The clumsy waitress mentioned that the outspoken hostess criticized herself for the horrible mistake” (to make the sentence as close as possible to the other sentences in the set) 243 • Item 23, cond 6: “worried”→ “said” (to align with the other sentences in the set) • Item 25, cond 1: “flustered” → “noisy” (to align with the other sentences in the set) • Item 28, cond 4 and 6. Remove “pretty” (in the original materials, conds 1,2,4,6 had “pretty” modifying the embedded noun, and conds 3,5 didn’t. But the noun in conds 1,2 is “cheerleader”, and in 4-6 - “football players”. I decide to keep the length of the NP to 3 words, thus, removing “pretty” in conds 4,6) • Item 28. Added complementizers. • Item 51: “refunded” → “refund” (typo) • Item 88, Q: “... become popular with the teenagers?” → “... become popu- lar?” (the sentence doesn’t mention teenagers) 244 Appendix E: Appendix: Experiment 2 fillers types The following types of fillers were used: 1. With reflexives and NP antecedents • Sentences either have an embedded complement clause or no embedding at all; • If there is a complement clause, the reflexive is inside it; • The reflexive matches the grammatically accessible antecedent in some sentences and mismatches it in gender in others (see Table E.2 for counts); it always mismatches any other nouns in the sentences. • Half of the items have masculine reflexives, half — feminine. Example1: Everybody in the receiving team could see that the woman could hardly stand by herself. 2. With reflexives and QP antecedent Similar to the above, but with QP antecedents. 1The examples in this section were embedded in larger contexts, so they may sound unnatural by themselves 245 Example: Interestingly, every politician seemed to portray himself as a trust- worthy and absolutely honest person, according to the journalist. 3. Other Three-sentence stories with no special constraints on them. Some of the sen- tences included a mistake of one of the following types (see Table E.2 for counts): • Wrong verb form: After having unexpectedly losing her husband in a car crash, Jane were utterly depressed. • Verb number marking: Tracy were babysitting for a new family in the neighborhood who had a very large and old house. • Noun number marking: By the end of his studies he already spoke three European language fluently and was taking lessons in Chinese. 246 Context1 Context2 Critical Experimental sentences 24 24 24 6 6 6 Agreement attraction 6 6 6 6 6 6 3 3 3 Fillers, referential 3 3 3 6 6 6 6 6 6 3 3 3 Fillers, quantificational 3 3 3 6 6 6 18 18 18 Fillers, other 6 6 3 6 6 6 Table E.2: Experiment 2 stimuli types and counts. White cells correspond to grammatical sentences, pink - to ungrammatical. 247 Bibliography Adesola, O. P. A-bar dependencies in the Yoruba reference-tracking system. Lingua, 116(12):2068–2106, December 2006. doi:10.1016/j.lingua.2005.06.001. Alcocer, P. and Phillips, C. Using relational syntactic constraints in content- addressable memory architectures for sentence parsing. Master’s thesis, University of Maryland, 2012. Anderson, J. R. Human symbol manipulation within an integrated cognitive architecture. Cognitive science, 29(3):313–341, 2005. doi:10.1207/s15516709cog0000 22. Anderson, J. R. and Reder, L. M. The fan effect: New results and new theories. Jour- nal of Experimental Psychology: General, 128(2):186, 1999. doi:10.1037//0096- 3445.128.2.186. Aoshima, S., Yoshida, M., and Phillips, C. Incremental processing of corefer- ence and binding in Japanese. Syntax, 12(2):93–134, 2009. doi:10.1111/j.1467- 9612.2009.00123.x. Badecker, W. and Straub, K. The processing role of structural constraints on interpretation of pronouns and anaphors. Journal of Experimental Psychol- ogy: Learning, Memory, and Cognition, 28(4):748–769, 2002. doi:10.1037/0278- 7393.28.4.748. Barr, D. J., Levy, R., Scheepers, C., and Tily, H. J. Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of memory and language, 68(3):255–278, 2013. doi:10.1016/j.jml.2012.11.001. Bates, D., Kliegl, R., Vasishth, S., and Baayen, H. Parsimonious mixed models. arXiv preprint arXiv:1506.04967, 2015a. Bates, D., Maechler, M., Bolker, B., and Walker, S. lme4: Linear mixed-effects models using Eigen and S4, 2015b. URL http://CRAN.R-project.org/package= lme4. R package version 1.1-8. 248 Charnavel, I. and Huang, Y. Inanimate ziji and Condition A in Mandarin. In 35th West Coast Conference on Formal Linguistics, pages 132–141. Cascadilla Proceedings Project, 2018. Charnavel, I. and Sportiche, D. Anaphor binding: What French inanimate anaphors show. Linguistic Inquiry, 2016. doi:10.1162/ling a 00204. Chow, W.-Y., Lewis, S., and Phillips, C. Immediate sensitivity to structural constraints in pronoun resolution. Frontiers in Psychology, 5, June 2014. doi:10.3389/fpsyg.2014.00630. Clements, G. N. The logophoric pronoun in Ewe: Its role in discourse. Journal of West African Languages, 10:141–177, 1975. Collaboration, O. S. Estimating the reproducibility of psychological science. Science, 349(6251):aac4716–aac4716, August 2015. doi:10.1126/science.aac4716. Cousineau, D. Confidence intervals in within-subject designs: A simpler solution to Loftus and Masson’s method. Tutorials in quantitative methods for psychology, 1 (1):42–45, 2005. Cowan, N. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24(1):87–185, November 2001. Culy, C. Aspects of logophoric marking. Linguistics, 32(6):1055–1094, 1994. doi:10.1515/ling.1994.32.6.1055. Cunnings, I. and Felser, C. The role of working memory in the processing of reflexives. Language and Cognitive Processes, 28(1-2):188–219, January 2013. doi:10.1080/01690965.2010.548391. Cunnings, I. and Sturt, P. Coargumenthood and the processing of re- flexives. Journal of Memory and Language, 75:117–139, August 2014. doi:10.1016/j.jml.2014.05.006. Cunnings, I., Patterson, C., and Felser, C. Variable binding and coreference in sentence comprehension: Evidence from eye movements. Journal of Memory and Language, 71(1):39–56, February 2014. doi:10.1016/j.jml.2013.10.001. Cunnings, I., Patterson, C., and Felser, C. Structural constraints on pronoun bind- ing and coreference: Evidence from eye movements during reading. Frontiers in Psychology, 6, June 2015. doi:10.3389/fpsyg.2015.00840. Daneman, M. and Carpenter, P. A. Individual differences in working memory and reading. Journal of verbal learning and verbal behavior, 19(4):450–466, 1980. Dillon, B. Structured access in sentence comprehension. PhD thesis, University of Maryland, College Park, 2011. 249 Dillon, B. Syntactic memory in the comprehension of reflexive dependencies: an overview. Language and Linguistics Compass, 8(5):171–187, May 2014. doi:10.1111/lnc3.12075. Dillon, B., Mishler, A., Sloggett, S., and Phillips, C. Contrasting intrusion profiles for agreement and anaphora: Experimental and modeling evidence. Journal of Memory and Language, 69(2):85–103, 2013. doi:10.1016/j.jml.2013.04.003. Eberhard, K. M., Cutting, J. C., and Bock, K. Making syntax of sense: Number agreement in sentence production. Psychological Review, 112(3):531–559, 2005. Engelmann, F., Jäger, L. A., and Vasishth, S. The effect of prominence and cue association in retrieval processes: A computational account (former title: The determinants of retrieval interference in dependency resolution: Review and com- putational modeling) [jan 4 2018 draft]. Journal of Memory and Language, draft4. Franck, J., Vigliocco, G., and Nicol, J. Subject-verb agreement errors in French and English: The role of syntactic hierarchy. Language and Cognitive Processes, 17 (4):371–404, 2002. Gelman, A. and Carlin, J. Beyond power calculations: Assessing Type S (sign) and Type M (magnitude) errors. Perspectives on Psychological Science, 9(6):641–651, November 2014. doi:10.1177/1745691614551642. Gelman, A. and Hill, J. Data analysis using regression and multilevel/hierarchical models, volume 1. Cambridge University Press New York, NY, USA, 2007. Hagège, C. Les pronoms logophoriques. Bulletin de la Société de Linguistique de Paris, 69(1):287–310, 1974. Hammerly, C., Staub, A., and Dillon, B. The grammaticality asymmetry in agree- ment attraction reflects respones bias: Experimental and modeling evidence. draft.april.2018. Hammerly, C., Staub, A., and Dillon, B. Response bias modulates the grammatical- ity asymmetry: Evidence for a continuous valuation model of agreement araction. Presented at CUNY 2018, 2018. Hanuĺıková, A., Van Alphen, P. M., Van Goch, M. M., and Weber, A. When one person’s mistake is another’s standard usage: The effect of foreign accent on syntactic processing. Journal of Cognitive Neuroscience, 24(4):878–887, 2012. doi:10.1162/jocn a 00103. Huang, C.-T. J. and Liu, C.-S. L. Logophoricity, attitudes, and ziji at the interface. In Long-distance reflexives, volume 33, pages 141–195. 2001. Jäger, L. A., Benz, L., Roeser, J., Dillon, B. W., and Vasishth, S. Teasing apart retrieval and encoding interference in the processing of anaphors. Frontiers in Psychology, 6, June 2015. doi:10.3389/fpsyg.2015.00506. 250 Jäger, L. A., Engelmann, F., and Vasishth, S. Retrieval interference in reflexive processing: Experimental evidence from Mandarin, and computational modeling. Frontiers in Psychology, 6, May 2015. doi:10.3389/fpsyg.2015.00617. Jäger, L. A., Engelmann, F., and Vasishth, S. Similarity-based interference in sen- tence comprehension: Literature review and Bayesian meta-analysis. Journal of Memory and Language, 94:316–339, June 2017. doi:10.1016/j.jml.2017.01.004. Jonides, J., Lewis, R. L., Nee, D. E., Lustig, C. A., Berman, M. G., and Moore, K. S. The mind and brain of short-term memory. Annual Review of Psychology, 59(1):193–224, January 2008. doi:10.1146/annurev.psych.59.103006.093615. Kazanina, N. and Phillips, C. Differential effects of constraints in the processing of russian cataphora. The Quarterly Journal of Experimental Psychology, 63(2): 371–400, 2010. Kazanina, N., Lau, E. F., Lieberman, M., Yoshida, M., and Phillips, C. The effect of syntactic constraints on the processing of backwards anaphora. Journal of Memory and Language, 56(3):384–409, April 2007. doi:10.1016/j.jml.2006.09.003. King, J., Andrews, C., and Wagers, M. Do reflexives always find a good antecedent for themselves? Presented at CUNY 2012, 2012a. King, J., Andrews, C., and Wagers, M. Do reflexives always find a grammatical antecedent for themselves? In 25th annual CUNY conference on human sentence processing, page 67. The CUNY Graduate Center New York, NY, 2012b. Kush, D. Respecting Relations: Memory Access and Antecedent Retrieval in Incre- mental Sentence Processing. PhD thesis, University of Maryland, College Park, 2013. Kush, D., Lidz, J., and Phillips, C. Relation-sensitive retrieval: Evidence from bound variable pronouns. Journal of Memory and Language, 82(82):18–40, 2015. doi:10.1016/j.jml.2015.02.003. Lago, S., Shalom, D. E., Sigman, M., Lau, E. F., and Phillips, C. Agreement attraction in Spanish comprehension. Journal of Memory and Language, 82:133– 149, July 2015. doi:10.1016/j.jml.2015.02.002. Lewis, R. L. and Vasishth, S. An activation-based model of sentence pro- cessing as skilled memory retrieval. Cognitive Science, 29(3):375–419, 2005. doi:10.1207/s15516709cog0000 25. Martin, A. E. and McElree, B. Memory operations that support language compre- hension: Evidence from verb-phrase ellipsis. Journal of Experimental Psychology: Learning, Memory, and Cognition, 35(5):1231–1239, 2009. doi:10.1037/a0016271. 251 Martin, A. E. and McElree, B. Direct-access retrieval during sentence comprehen- sion: evidence from sluicing. Journal of Memory and Language, 64(4):327–343, May 2011. doi:10.1016/j.jml.2010.12.006. URL https://doi.org/10.1016%2Fj. jml.2010.12.006. McElree, B. Accessing recent events. Psychology of Learning and Motivation, 46: 155–200, 2006. doi:10.1016/s0079-7421(06)46005-9. McElree, B., Foraker, S., and Dyer, L. Memory structures that subserve sentence comprehension. Journal of Memory and Language, 48(1):67–91, 2003. Morey, R. D. Confidence intervals from normalized data: A correction to cousineau (2005). Tutorial in Quantitative Methods for Psychology, 4(2):61–64, 2008. doi:10.20982/tqmp.04.2.p061. Moulton, K. and Han, C.-h. C-command vs. scope: An experimental assessment of bound variable pronouns. Language, toappear. Nairne, J. S. A feature model of immediate memory. Memory & Cognition, 18(3): 251–269, 1990. Nicenboim, B., Vasishth, S., Engelmann, F., and Suckow, K. Exploratory and confirmatory analyses in sentence processing: A case study of number interference in German. Open Science Framework., 2017. doi:10.17605/OSF.IO/MMR7S. Nicenboim, B. and Vasishth, S. Models of retrieval in sentence comprehension: A computational evaluation using Bayesian hierarchical modeling. Journal of Memory and Language, 99:1–34, April 2018. doi:10.1016/j.jml.2017.08.004. Nicol, J. and Swinney, D. The role of structure in coreference assignment during sentence comprehension. Journal of Psycholinguistic Research, 18(1):5–19, 1989. Nicol, J., Foster, K., and Veres, C. Subject–verb agreement processes in compre- hension. Journal of Memory and Language, 36:569–587, 1997. Osterhout, L. and Mobley, L. A. Event-related brain potentials elicited by failure to agree. Journal of Memory and language, 34(6):739–773, 1995. Osterhout, L., Bersick, M., and McLaughlin, J. Brain potentials reflect violations of gender stereotypes. Memory & Cognition, 25(3):273–285, 1997. Parker, D. and Phillips, C. Negative polarity illusions and the for- mat of hierarchical encodings in memory. Cognition, 157:321–339, 2016. doi:10.1016/j.cognition.2016.08.016. Parker, D. and Phillips, C. Reflexive attraction in comprehension is selective. Jour- nal of Memory and Language, 94:272–290, 2017. doi:10.1016/j.jml.2017.01.002. 252 Parker, D., Shvartsman, M., and Van Dyke, J. A. The cue-based retrieval theory of sentence comprehension: New findings and new challenges. Language processing and disorders. Newcastle: Cambridge Scholars Publishing, 2017. Parker, D. The Cognitive Basis for Encoding and Navigating Linguistic Structure. PhD thesis, University of Maryland: College Park, 2014. Patil, U., Vasishth, S., and Lewis, R. L. Retrieval interference in syntactic process- ing: The case of reflexive binding in English. Frontiers in Psychology, 7, May 2016. doi:10.3389/fpsyg.2016.00329. Postal, P. M. Remarks on English long-distance anaphora. Style, 40(1-2):7–18, 2006. doi:10.5325/style.40.1-2.7. R Core Team. R: A Language and Environment for Statistical Computing. R Foun- dation for Statistical Computing, Vienna, Austria, 2014. URL http://www.R- project.org/. Raab, D. H. Division of psychology: Statistical facilitation of simple reaction times. Transactions of the New York Academy of Sciences, 24(5 Series II):574–590, 1962. Ratcliff, R. A theory of memory retrieval. Psychological Review, 85(2):59–108, 1978. doi:10.1037/0033-295x.85.2.59. Rayner, K. Eye movements in reading and information processing: 20 years of research. Psychological bulletin, 124(3):372, 1998. Reinhart, T. and Reuland, E. Reflexivity. Linguistic inquiry, 24(4):657–720, 1993. Reinhart, T. M. The syntactic domain of anaphora. PhD thesis, Massachusetts Institute of Technology, 1976. Sells, P. Aspects of logophoricity. Linguistic Inquiry, 18(3):445–479, 1987. Slioussar, N. and Malko, A. Gender agreement attraction in Russian: pro- duction and comprehension evidence. Frontiers in psychology, 7, 2016. doi:10.3389/fpsyg.2016.01651. Sloggett, S. When errors aren’t: how comprehenders selectively violate Binding Theory. PhD thesis, University of Massachussetts, Amherst, 2017. Sloggett, S. and Dillon, B. Do comprehenders violate the Binding Theory? Depends on your point of view. in prep. Sturt, P. The time-course of the application of binding constraints in refer- ence resolution. Journal of Memory and Language, 48(3):542–562, April 2003. doi:10.1016/s0749-596x(02)00536-3. Tang, C.-C. J. Chinese reflexives. Natural Language & Linguistic Theory, 7(1): 93–121, 1989. 253 Tanner, D., Nicol, J., and Brehm, L. The time-course of feature interference in agreement comprehension: Multiple mechanisms and asymmetrical attraction. Journal of Memory and Language, 76:195–215. doi:10.1016/j.jml.2014.07.003. Tucker, M. A., Idrissi, A., and Almeida, D. Attraction effects for verbal gender and number are similar but not identical: Self-paced reading evidence from mod- ern standard Arabic. submitted. URL https://matthew-tucker.github.io/ files/papers/gender-attraction-msa-comprehension.pdf. Van Dyke, J. A. and Johns, C. L. Memory interference as a determinant of language comprehension. Language and Linguistics Compass, 6(4):193–211, 2012. Van Dyke, J. A. and McElree, B. Retrieval interference in sentence comprehension. Journal of Memory and Language, 55(2):157–166, 2006. Vasishth, S., Brussow, S., Lewis, R., and Drenhaus, H. Processing polarity: How the ungrammatical intrudes on the grammatical. Cognitive Science: A Multidis- ciplinary Journal, 32(4):685–712, June 2008. doi:10.1080/03640210802066865. Vasishth, S., von der Malsburg, T., and Engelmann, F. What eye movements can tell us about sentence comprehension. Wiley Interdisciplinary Reviews: Cognitive Science, 4(2):125–134, December 2012. doi:10.1002/wcs.1209. Vasishth, S., Jaeger, L. A., and Nicenboim, B. Feature overwriting as a finite mixture process: Evidence from comprehension data. arXiv preprint arXiv:1703.04081, 2017. URL https://arxiv.org/pdf/1703.04081.pdf. Vigliocco, G. and Nicol, J. Separating hierarchical relations and word order in language production: Is proximity concord syntactic or linear? Cognition, 68: B13B29, 1998. von der Malsburg, T. and Angele, B. False positives and other statistical errors in standard analyses of eye movements in reading. Journal of Memory and Language, 94:119–133, 2017. doi:10.1016/j.jml.2016.10.003. Wagers, M. W., Lau, E. F., and Phillips, C. Agreement attraction in comprehension: Representations and processes. Journal of Memory and Language, 61:206–237, 2009. Xiang, M., Dillon, B., and Phillips, C. Illusory licensing effects across dependency types: ERP evidence. Brain & Language, 108:4055, 2008. doi:10.1016/j.bandl.2008.10.002. Xiang, M., Grove, J., and Giannakidou, A. Dependency-dependent interference: NPI interference, agreement attraction, and global pragmatic inferences. Frontiers in Psychology, 4, 2013. doi:10.3389/fpsyg.2013.00708. 254